In this section we present the complete syntax reference for Dolmen parser descriptions. We start with the lexical structure of Dolmen parsers before describing the actual grammar of the language.
Lexical conventions
The following lexical conventions explain how the raw characters in a Dolmen parser description are split into the lexical elements that are then used as terminals of the grammar.
- White space
-
The five following characters are considered as white space: space
(0x20), horizontal tab(0x09), line feed(0x0A), form feed(0x0C)and carriage return(0x0D). The line feed and carriage return characters are called line terminators. White space does not produce lexical elements but can serve as separator between other lexical elements. - Comments
-
Comments follow the same rules as in the Java language. They can be either end-of-line comments
// …extending up to a line terminator, or traditional multi-line comments/* ... */. Comments cannot be nested, which means in particular that neither//nor/*are interpreted inside a traditional multi-line comment. As with white space, comments do not produce lexical elements but can serve as separators between other lexical elements. - Identifiers (
IDENT) -
Identifiers are formed by non empty-sequences of letters (a lowercase letter from
atoz, an uppercase letter fromAtoZ, or the underscore character_) and digits (from0to9), and must start with a letter. For instance,id,_FOOor_x32_yare valid identifiers. Some otherwise valid sequences of letters are reserved as keywords and cannot be used as identifiers. - String literals (
MLSTRING) -
A string literal is expressed as a sequence of simple characters and escape sequences between double quotes
"…". A simple character is any character other than"and\; in particular string literals can span multiple lines in a Dolmen parser description. An escape sequence can be any of the following:-
an octal character code between
\000and\377, representing the character with the given octal ASCII code; -
an escape sequence amongst
\\,\',\",\r(0x0D),\n(0x0A),\b(0x08),\t(0x09) and\f(0x0C); -
a Unicode character code between
\u0000and\uFFFF, representing the corresponding UTF-16 code unit; just like in Java, there can be any positive number ofucharacters before the actual hexadecimal code.
For instance, possible characters in a string literal are
g,$,\',\047or\uuu0027. The last three happen to all represent the same character. -
Characters and escape sequences in a string literal are
interpreted greedily in the order in which they appear, and
therefore a \ character will only be understood as the start of an
escape sequence if the number of other backslash \ that
contiguously precede it is even (which includes zero).Therefore, the string literal "\\u006E" is interpreted as an
escaped backslash followed by "u006E".
|
As in Java, Unicode code points outside the Basic Multilingual
Plane cannot be represented with a single character or escape
sequence; one must use two code points (a
surrogate
pair) instead. For instance, the Unicode code point U+1D11E,
which is the symbol for the musical G clef 𝄞, can be
obtained with two Unicode escape sequences \uD834\uDD1E.
|
- Java actions (
ACTION) -
Java actions are lexical elements which represent verbatim excerpts of Java code. They are used as part of the semantic actions associated to a parser rule’s productions, to express the Java return type of a rule or the Java type of the value associated to a token, and also for the top-level header and footer of the generated syntactic analyzer.
A Java action is a block of well-delimited Java code between curly braces:{ …well-delimited Java code… }. A snippet of Java code is well-delimited if every character literal, string literal, instruction block, method, class or comment that it contains is correctly closed and balanced. This essentially ensures that Dolmen is able to correctly and safely identify the closing}which delimits the end of the Java action.
Because Dolmen grammar rules can be parametric, Java actions are not necessarily copied verbatim in the generated parser: some generic parts of the Java action may undergo a substitution to account for the actual instantiation of a rule’s formal parameters. To that end, Java actions can contain placeholders of the form#keycalled holes, wherekeymust be any identifier starting with a lower-case letter. When instantiating grammar rules, these holes are filled by Dolmen with the Java types of the formal parameters whose names match the holes' keys. One limitation of this mechanism is that holes are not interpreted when they appear inside a Java comment, character literal or string literal. In other contexts, the#character can safely be interpreted as a hole marker since it is not a Java separator or operator symbol.
The following snippet:
{ // TODO Later }
is not a valid Java action because the internal end-of-line comment
is not closed inside the action. In fact the closing } is understood
as being part of the Java snippet and thus part of the comment (as
revealed by the syntax highlighting). Adding
a line break makes this a valid Java action:
{ // TODO Later
}
The following snippet:
{ System.out.printf("In action \"Foo\"); }
is not a valid Java action because the internal String literal is not properly closed inside the action. Closing the literal makes this a valid Java action:
{ System.out.printf("In action \"Foo\""); }
The following snippet:
{
// See A#foo()
List<#elt> l = new ArrayList<#elt>();
}
is a valid Java action with holes referencing some formal
parameter elt. The #foo in the comment is not
interpreted as a hole.
| Dolmen’s companion Eclipse plug-in offers editors with syntax highlighting, including relevant syntax highlighting inside Java actions which also recognizes holes. It is obvious in the examples above that decent Java-aware syntax highlighting goes a long way in helping one avoid silly typos and syntactic mistakes inside Java actions. |
|
Dolmen’s lexical conventions for white space, comments and
literals follow those used in the Java programming
language. In particular, Dolmen will follow the same rules
when encountering character or string literals inside and
outside Java actions. There is a subtle but important
difference between Dolmen and Java though:
unlike
Java, Dolmen does not unescape Unicode sequences in a
preliminary pass but during the main lexical translation
instead. Therefore, if one uses a Unicode escape code to
stand for a line terminator, a delimiter or a character in an
escape sequence, it is possible to write valid Java code that
is not a valid Java action, or the other way around. Consider for instance the Java literal "Hello\u0022: this
is a valid Java string because \u0022 is first replaced by
the double quote character ", but as far as Dolmen is
concerned this is an incomplete string literal whose first
six characters were Hello". Another example is "\u005C"
which is a valid Dolmen string representing the single
character \, and is understood by Java as being an
incomplete string literal whose first character is ". |
- Java arguments (
ARGUMENTS) -
Java arguments are very similar to Java actions in that they are lexical elements which represent verbatim excerpts of Java code. They are used to declare a parser rule's arguments and to pass arguments to a non-terminal in a production.
Java arguments are formed by a block of well-delimited Java code between parentheses:( …well-delimited Java code… ). The snippet of Java code is well-delimited if every character literal, string literal, or parenthesized expression that it contains is correctly closed and balanced. This essentially ensures that Dolmen is able to correctly and safely identify the closing)which delimits the end of the Java arguments.
Like Java actions, Java arguments can contain holes in the form of identifiers introduced by a#character. Holes are only interpreted outside Java comments and literals. - Keywords
-
The following lower-case identifiers are reserved keywords of the language:
continue,import,private,public,rule,static,token. - Operators and punctuation
-
The following symbols serve as operators or punctuation symbols in Dolmen lexer descriptions:
=,|,[,],.,<,>,,,;.
Any input sequence which does not match any of the categories above will result in a lexical error.
Grammar of Dolmen Parsers
We give the complete grammar for Dolmen parser descriptions below. The
terminals of the grammar are the lexical elements described infra,
and keywords, punctuation and operator symbols are displayed in
boldface. The main symbol is Parser and we use traditional
BNF syntax to
present the grammar’s rules, augmented with the usual repetition
operators ? (at most one), + (at least one) and * (any number
of repetitions).
Options, Imports and Auxiliary Definitions
Import := import (static)? Typename ;
Grammar Rules and Productions
Rule := (public | private) ACTION // rule's return type rule IDENT // must start with lowercase letter (FormalParams)? // rule's optional formal parameters (ARGUMENTS)? // rule's optional arguments = Production+ ; FormalParams := < IDENT (, IDENT)* > // formals must start with lowercase letter Production := | Item*
Production Items and Actuals
Actual := ActualExpr (ARGUMENTS)? // optional arguments to the actual
ActualExpr := | IDENT // a terminal or ground non-terminal | IDENT < ActualExpr (, ActualExpr)* > // application of a parametric non-terminal