In a previous post Ishowed a small example of how to create a grammar and a parser using Antlrand Python. I made the point that simply parsing a file, and checking itconforms to the language grammar, isn’t usually all we want to do. As wellas creating parsers for a language Antlr also makes it easy to create moreuseful programs through actions.
To follow these examples you should take a look at my previous postwhere I used a very simple CSS like language grammar which is shown below.
grammar simplecss;options { language = Python;}/* Parser rules */cssdocument : cssrule+ ;cssrule : IDENTIFIER '{' declarationlist '}' ;declarationlist : declaration+ ;declaration : IDENTIFIER ':' IDENTIFIER ';' ;/* Lexer rules */fragment LETTER : 'A'..'Z' | 'a'..'z' | '_' ;fragment DIGIT : '0'..'9' ;IDENTIFIER : LETTER (LETTER | DIGIT)* ;COMMENT : '/*' .* '*/' {$channel = HIDDEN} ;WHITESPACE : ( 't' | ' ' | 'r' | 'n'| 'u000C' )+ {$channel = HIDDEN} ;
CSS is built up of selectors and a list of properties for selectors. Itseems that a useful data-type in Python would be a dictionary ofdictionaries. The outer set of keys being the selectors, and the inner setbeing the property names. The css banana {color:yellow; length:10cm;shape:curved;}
would be translated to the Python data type: {'banana':{'color': 'yellow', 'length': '10cm', 'shape': 'curved'}}
.
A simple way of doing this would be to add actions to the Antlr grammar. Anaction appears as a block of code in curly-braces. The position of theaction roughly equates to where in the parsing algorithm it will occur. Forexample we could print out the name of each css selector as we parse thedocument.
cssrule : i=IDENTIFIER {print $i.text} '{' declarationlist '}' ;
Notice that the $i
symbol refers to a variable outside the code block.This is substituted for the real IDENTIFIER
variable when Antlr generatesthe parser. In this sense the code block can be thought of as a template forthe real code generated by Antlr. Detailed information about what symbolsare allowed in actions can be found in the Antlr documentation.Details about what attributes can be used on the parser and lexer rules isfound separately in the Attributes section. Thecode block will be executed after parsing the identifier, but before thedeclarationlist
.
At this point we have to give a little thought to the code generated byAntlr from our grammar description. Each ‘rule’ within the grammar file istranslated in to a function on the parser class that takes the stream ofsymbols as the input. For example, when parsing a whole document, thetop-level rule function is used: res = parser.cssdocument ()
. Thecssdocument
function corresponds to the grammar rule of the same name.These rule functions have return values. It is possible to add members tothe return type within the grammar file.
cssdocument returns [selectordict] : cssrule+ ;
To access this rule within an action the attribute selectordict
is used asfollows.
cssdocument returns [selectordict] :{$cssdocument.selectordict = {}} cssrule+ ;
The problem with this is that the selectordict
variable is only availablewithin the cssdocument
rule. To make it available to sub-rules, where itis needed, it would have to be passed as an argument. This is cumbersome.The solution is Antlr scopes. A scope is a stack-type variable that isavailable to all sub-rules. The stack type is useful for language featuressuch as namespaces and variable scopes, hence the name. Scopes are declaredusing the scope
keyword.
cssdocumentscope { selectorscope;} : cssrule+ ;
Access for scopes is different to that of rule attributes and uses a syntaxlike C++ namespaces: $cssdocument::selectorscope
.
Using actions, return-values and scopes it is now possible to create aparser that generates a python dictionary from a source file.
grammar simplecssactions;options { language = Python;}/* Parser rules */cssdocument returns [ruledict]scope { ruledictscope; rulename;}@init { $cssdocument::ruledictscope = {} $cssdocument::rulename = ''} : cssrule+ ;cssrule : IDENTIFIER{ $cssdocument::rulename = $IDENTIFIER.text $cssdocument::ruledictscope[$IDENTIFIER.text] = {}}'{' declarationlist '}' ;declarationlist : declaration+ ;declaration : p=IDENTIFIER ':' v=IDENTIFIER ';'{$cssdocument::ruledictscope[$cssdocument::rulename][$p.text] =[$v.text]};/* Lexer rules */fragment LETTER : 'A'..'Z' | 'a'..'z' | '_' ;fragment DIGIT : '0'..'9' ;IDENTIFIER : LETTER (LETTER | DIGIT)* ;COMMENT : '/*' .* '*/' {$channel = HIDDEN} ;WHITESPACE : ( 't' | ' ' | 'r' | 'n'| 'u000C' )+ {$channel = HIDDEN} ;
The new syntax in this example is the @init
block of the cssdocument
rule. This is called before anything else in the rule and is generally usedfor initialising local variables.
Using instructions given in my first blog post, it is now possible tocompile this grammar in-to a parser using Antlr. Once the parser has beengenerated a compiler can be written that outputs a Python dictionary.
import sysimport antlr3from simplecssactionsLexer import simplecssactionsLexerfrom simplecssactionsParser import simplecssactionsParserdef main (argv): filename = argv[1] input = antlr3.FileStream (filename) lexer = simplecssactionsLexer (input) tokens = antlr3.CommonTokenStream (lexer) parser = simplecssactionsParser (tokens) res = parser.cssdocument () print res if __name__ == "__main__": sys.exit (main (sys.argv))
As well as actions Antlr also has the ability to generate ASTs (AbstractSyntax Trees) from a conforming document. This is a tree data type thatsomewhat matches the syntax of the language. It is very useful when makingmultiple passes over the document, as it saves parsing it twice. Antlr hasspecial grammar file syntax for easily generating the desired AST from anylanguage.
Retrospective
I have had fun with Antlr. Despite only having rudimentary knowledge ofparsing and compiling I created a decent compiler for a language of my owndesign. Perhaps the most impressive part of Antlr’s parser generation is theerror reporting. Error reporting is extremely important not only foreventual users of any compiler but also during development. Antlr allowed meto create a professional feel to the program when I would not otherwise havehad the time.
There are some issues. Antlr is big, and tries to remain consistent betweenits different modes with only some success. The different data types ofToken, Rule, Tree, Template often caused me confusion. They are all handledin similar ways in the grammar files, but very differently in the eventualcode. The code generation is also sometimes difficult. There does not seemto be a well worked template system for this, possibly because the code usedisn’t restricted to any one particular language. The best example is thecommon use of the %
symbol in python, which is also used in Antlr andcauses issues.
If you are thinking about playing about with your-own language grammar youshould probably try Antlr first. If you are lucky enough to be working in alanguage with a well worked parser combinator library then this may be moresuitable. I doubt many are as complete or actively-maintained as Antlr.