Dolmen Lexers: Syntax Reference

In this section we present the complete syntax reference for Dolmen lexer descriptions. We start with the lexical structure of Dolmen lexers before describing the actual grammar of the language.

Lexical conventions

The following lexical conventions explain how the raw characters in a Dolmen lexer description are split into the lexical elements that are then used as terminals of the grammar.

White space

The five following characters are considered as white space: space (0x20), horizontal tab (0x09), line feed (0x0A), form feed (0x0C) and carriage return (0x0D). The line feed and carriage return characters are called line terminators. White space does not produce lexical elements but can serve as separator between other lexical elements.

Comments

Comments follow the same rules as in the Java language. They can be either end-of-line comments // … extending up to a line terminator, or traditional multi-line comments /* ... */. Comments cannot be nested, which means in particular that neither // nor /* are interpreted inside a traditional multi-line comment. As with white space, comments do not produce lexical elements but can serve as separators between other lexical elements.

Identifiers (IDENT)

Identifiers are formed by non empty-sequences of letters (a lowercase letter from a to z, an uppercase letter from A to Z, or the underscore character _) and digits (from 0 to 9), and must start with a letter. For instance, id, _FOO or _x32_y are valid identifiers. Some otherwise valid sequences of letters are reserved as keywords and cannot be used as identifiers.

Literal integers (LNAT)

A literal integer is either 0 or any non-zero digit followed by a number of digits between 0 and 9. Its value is interpreted as a decimal integer. Any such sequence of digits which produces a value that does not fit in a 32-bit signed integer results in a lexical error.

Character literals (LCHAR)

A character literal is expressed as a simple character or an escape sequence between single quotes '…'. A simple character can be any character other than ', \ and line terminators. An escape sequence can be any of the following:

an octal character code between \000 and \377, representing the character with the given octal ASCII code;
an escape sequence amongst \\, \', \", \r (0x0D), \n (0x0A), \b (0x08), \t (0x09) and \f (0x0C);
a Unicode character code between \u0000 and \uFFFE, representing the corresponding UTF-16 code unit; just like in Java, there can be any positive number of u characters before the actual hexadecimal code.

For instance, possible character literals are 'g', '$', '\'', '\047' or '\uuu0027'. The last three happen to all represent the same character.

Note that the character \uFFFF is not allowed as it is reserved to represent the end-of-input; it is not a valid Unicode character anyway.

String literals (LSTRING, MLSTRING)

A string literal is expressed as a sequence of simple characters and escape sequences between double quotes "…". A simple character is any character other than " and \. An escape sequence is exactly as described for character literals.
Unlike in Java, line terminators may be allowed inside string literals, representing their own value. Nonetheless, single-line string literals and multi-line string literals will produce different lexical elements (resp. LSTRING and MLSTRING) which can then be distinguished in the grammar. Indeed, multi-line string literals are only syntactically valid as option values, and their usage elsewhere will result in a syntax error during the parsing phase.

Characters and escape sequences in a string literal are interpreted greedily in the order in which they appear, and therefore a \ character will only be understood as the start of an escape sequence if the number of other backslash \ that contiguously precede it is even (which includes zero).
Therefore, the string literal "\\u006E" is interpreted as an escaped backslash followed by "u006E".

As in Java, Unicode code points outside the Basic Multilingual Plane cannot be represented with a single character or escape sequence; one must use two code points (a surrogate pair) instead. For instance, the Unicode code point U+1D11E, which is the symbol for the musical G clef 𝄞, can be obtained with two Unicode escape sequences \uD834\uDD1E.

Java actions (ACTION): Java actions are lexical elements which represent verbatim excerpts of Java code. They are used as part of the semantic actions associated to a lexer entry’s clauses, to express the Java return type and arguments of a lexer entry, and also for the top-level header and footer of the generated lexical analyzer.
A Java action is a block of well-delimited Java code between curly braces: { …well-delimited Java code… }. A snippet of Java code is well-delimited if every character literal, string literal, instruction block, method, class or comment that it contains is correctly closed and balanced. This essentially ensures that Dolmen is able to correctly and safely identify the closing } which delimits the end of the Java action.

Example (Comments)

The following snippet:

{ // TODO Later }

is not a valid Java action because the internal end-of-line comment is not closed inside the action. In fact the closing } is understood as being part of the Java snippet and thus part of the comment (as revealed by the syntax highlighting). Adding a line break makes this a valid Java action:

{ // TODO Later
}

Example (Literals)

The following snippet:

{ System.out.printf("In action \"Foo\"); }

is not a valid Java action because the internal String literal is not properly closed inside the action. Closing the literal makes this a valid Java action:

{ System.out.printf("In action \"Foo\""); }

Dolmen’s companion Eclipse plug-in offers editors with syntax highlighting, including relevant syntax highlighting inside Java actions. It is obvious in the examples above that decent Java-aware syntax highlighting goes a long way in helping one avoid silly typos and syntactic mistakes inside Java actions.

Dolmen’s lexical conventions for white space, comments and literals follow those used in the Java programming language. In particular, Dolmen will follow the same rules when encountering character or string literals inside and outside Java actions. There is a subtle but important difference between Dolmen and Java though: unlike Java, Dolmen does not unescape Unicode sequences in a preliminary pass but during the main lexical translation instead. Therefore, if one uses a Unicode escape code to stand for a line terminator, a delimiter or a character in an escape sequence, it is possible to write valid Java code that is not a valid Java action, or the other way around.
Consider for instance the Java literal "Hello\u0022: this is a valid Java string because \u0022 is first replaced by the double quote character ", but as far as Dolmen is concerned this is an incomplete string literal whose first six characters were Hello". Another example is "\u005C" which is a valid Dolmen string representing the single character \, and is understood by Java as being an incomplete string literal whose first character is ".

Keywords: The following lower-case identifiers are reserved keywords of the language:
as, eof, import, orelse, private, public, rule, shortest, static.
Operators and punctuation: The following symbols serve as operators or punctuation symbols in Dolmen lexer descriptions:
=, |, [, ], *, ?, +, (, ), ^, -, #, ., <, >, ,, ;.

Any input sequence which does not match any of the categories above will result in a lexical error.

Grammar of Dolmen Lexers

We give the complete grammar for Dolmen lexer descriptions below. The terminals of the grammar are the lexical elements described infra, and keywords, punctuation and operator symbols are displayed in boldface. The main symbol is Lexer and we use traditional BNF syntax to present the grammar’s rules, augmented with the usual repetition operators ? (at most one), + (at least one) and * (any number of repetitions).

Lexer :=
  Option*
  Import*
  ACTION        // header
  Definition*
  Entry+
  ACTION        // footer

Options, Imports and Auxiliary Definitions

Option :=
| [ IDENT = MLSTRING ]
| [ IDENT = MLSTRING ]

Import :=
  import (static)? Typename ;

Typename :=
| IDENT
| IDENT . Typename
| IDENT . *

Definition :=
  IDENT = Regular ;

Lexer Entries

Entry :=
  (public | private)
  ACTION            // entry's return type
  rule IDENT
  (ACTION)?         // entry's optional arguments
  = (shortest)?     // whether shortest-match or longest-match rule is used
  Clause+

Clause :=
| | Regular ACTION
| | orelse  ACTION

Regular Expressions

Regular :=
| Regular as IDENT    // as is left-associative
| AltRegular

AltRegular :=
| SeqRegular | AltRegular        // choice
| SeqRegular

SeqRegular :=
| PostfixRegular SeqRegular      // concatenation
| PostfixRegular

PostfixRegular :=
| DiffRegular *                    // zero, one or more occurrences
| DiffRegular +                    // at least one occurrence
| DiffRegular ?                    // at most one occurrence
| DiffRegular < LNAT >             // fixed # of occurrences
| DiffRegular < LNAT , LNAT >      // min. and max. # of occurrences
| DiffRegular

DiffRegular :=
| AtomicRegular # AtomicRegular  // only with char classes
| AtomicRegular

AtomicRegular :=
| _                           // all characters except eof
| eof                         // end-of-input
| LCHAR                       // a single character
| LSTRING                     // an exact sequence of characters
| IDENT                       // defined regular expression
| CharClass                   // a character class
| ( Regular )

CharClass :=
  [ CharSet ]

CharSet :=
| ^ CharSetPositive         // complement character set
| CharSetPositive

CharSetPositive :=
| LCHAR                     // a single character
| LCHAR - LCHAR             // a range of characters (inclusive)
| IDENT                     // defined character set
| CharSetPositive CharSetPositive // union of character sets