EBNF parser and generic parser generator.
In the primary mode, it supports a Parsing Expression Grammar (PEG) parser generator. This performs more minimal transformations on the parsed grammar to extract sub-productions, which allows each component of a rule to generate its own parsing event.
The resulting Packrat memoizer to reduce extra work when backtracking.objects then parse each associated rule according to the operator semantics and use a
These rules are driven using themodule which calls invokes the starting rule and ensures that all input is consumed.
a ::= b?into
a ::= _empty | b
a ::= b+into
a ::= b b*
a ::= b*into
a ::= _empty | (b a)
a ::= op1 (op2)into two rules:
a ::= op1 _a_1 _a_1_ ::= op2
Of note in this implementation is that the tokenizer and parser are streaming, so that they can process inputs of arbitrary size.
The exception operator (
A - B) is only supported on terminals.
Seeand for further information.
Parsing an EBNF Grammar
require 'ebnf' grammar = .(File.open('./etc/ebnf.ebnf'))
puts grammar.to_sxp puts grammar.to_ttl puts grammar.to_html puts grammar.to_s
Generate First/Follow rules for BNF grammars (using "ebnf" as the starting production):
Generate Terminal, First/Follow, Cleanup and Branch tables as Ruby for parsing grammars:
Generate formatted grammar using HTML (requires Haml gem):
Parsing an ISO/IEC 14977 Grammar
The EBNF gem can also parse [ISO/EIC 14977] Grammars (ISOEBNF) to S-Expressions.
grammar = .(File.open('./etc/iso-ebnf.isoebnf'), format: :isoebnf)
Parsing an ABNF Grammar
grammar = .(File.open('./etc/abnf.abnf'), format: :abnf)
Inevitably while implementing a parser for some specific grammar, a developer will need greater insight into the operation of the parser. While this can involve sorting through a tremendous amount of data, the parser can be provided a Logger instance which will output messages at varying levels of detail to document the state of the parser at any given point. Most useful is likely the
INFO level of debugging, but even more detail is revealed using the
ERROR statements will typically also be provided as part of an exception if parsing fails, but can be shown in the context of other parsing state with appropriate indentation as part of the logger.
The S-Expressions, this provides a means of transforming between grammar formats (e.g., W3C EBNF to ABNF), although with some potential loss in semantic fidelity (case-insensitive string matching vs. case-sensitive matching).class can be used to write parsed grammars out, either as formatted text, or HTML. Because grammars are written from the Abstract Syntax Tree, represented as
The formatted HTML results are designed to be appropriate for including in specifications.
On a parsing failure, and exception is raised with information that may be useful in determining the source of the error.
Note that the grammar includes an optional
[identifer] in front of rule names, which can be in conflict with the
RANGE terminal. It is typically not a problem, but if it comes up, try parsing with the
native parser, add comments or sequences to disambiguate. EBNF does not have beginning of line checks as all whitespace is treated the same, so the common practice of identifying each rule inherently leads to such ambiguity.
The character set for EBNF is UTF-8.
The general form of a rule is:
symbol ::= expression
which can also be proceeded by an optional number enclosed in square brackets to identify the rule number:
 symbol ::= expression
(Note, this can introduce an ambiguity if the previous rule ends in a range or enum and the current rule has no identifier. In this case, enclosing
expression within parentheses, or adding intervening comments can resolve the ambiguity.)
Symbols are written in CAPITAL CASE if they are the start symbol of a regular language (terminals), otherwise with they are treated as non-terminal rules. Literal strings are quoted.
Within the expression on the right-hand side of a rule, the following expressions are used to match strings of one or more characters:
|matches any Char or HEX with a value in the range(s) indicated (inclusive).|
||matches any UTF-8 R\_CHAR or HEX with a value among the characters enumerated. The last component may be '-'. Enumerations and ranges may be mixed in one set of brackets.|
||matches any UTF-8 Char or HEX a value outside the range indicated.|
||matches any UTF-8 R\_CHAR or HEX with a value not among the characters given. The last component may be '-'. Enumerations and ranges of excluded values may be mixed in one set of brackets.|
||matches a literal string matching that given inside the double quotes.|
||matches a literal string matching that given inside the single quotes.|
||matches A or nothing; optional A.|
||matches any string that matches |
||matches one or more occurrences of |
||matches zero or more occurrences of |
||Defines consumed whitespace in the document. Any whitespace found between non-terminal rules is consumed and ignored.|
||Introduces terminal rules. All rules defined after this point are treated as terminals.|
- Comments include
#through end of line (other than hex character) and
/* ... */ (* ... *) which may cross lines
- All rules MAY start with an identifier, contained within square brackets. For example
 rule, where the value within the brackets is a symbol
([a-z] | [A-Z] | [0-9] | "_" | ".")+
@terminalscauses following rules to be treated as terminals. Any terminal which is all upper-case (eg
TERMINAL), or any rules with expressions that match characters (
A - B), are also treated as terminals.
@passdefines the expression used to detect whitespace, which is removed in processing.
- No support for
wfc(well-formedness constraint) or
Intermediate representations of the grammar may be serialized to Lisp-like S-Expressions. For example, the rule
 ebnf ::= (declaration | rule)*
is serialized as
(rule ebnf "1" (star (alt declaration rule)))
Different components of an EBNF rule expression are transformed into their own operator:
|Case-insensitive string matching|
|Negative look-ahead, used for non-terminal uses of `B - A`.|
Additionally, rules defined with an UPPERCASE symbol are treated as terminals.
For an LL(1) parser generator, the method can be used to transform the EBNF rule into a BNF rule.
(rule ebnf "1" (alt _empty _ebnf_2)) (rule _ebnf_1 "1.1" (alt declaration rule)) (rule _ebnf_2 "1.2" (seq _ebnf_1 ebnf)) (rule _ebnf_3 "1.3" (seq ebnf))
This allows First/Follow and other tables used by a parser to parse examples of the associated grammar. For more, see .
For a PEG parser generator, there is a simpler transformation that reduces rules containing sub-expressions (composed of
seq and similar expressions) and creates named rules to allow appropriate callbacks and for naming elements of the generating abstract syntax tree. The method transforms the original rule into the following two rules:
(rule ebnf "1" (star _ebnf_1)) (rule _ebnf_1 "1.1" (alt declaration rule))
For an example parser built using this gem that parses the EBNF grammar, see EBNF PEG Parser example. This example creates a parser for the EBNF grammar which generates the same Abstract Syntax Tree as the built-in parser in the gem.
There is also an EBNF LL(1) Parser example.
Much of this work, particularly the generic parser, is inspired by work originally done by Tim Berners-Lee's Python predictive parser.
Full documentation available on Rubydoc.info.
- Better LL(1) parser tests
This repository uses Git Flow to mange development and release activity. All submissions must be on a feature branch based on the develop branch to ease staging and integration.
- Do your best to adhere to the existing coding conventions and idioms.
- Don't use hard tabs, and don't leave trailing whitespace on any line.
- Do document every method you add using YARD annotations. Read the tutorial or just look at the existing code for examples.
- Don't touch the
AUTHORSfiles. If you need to change them, do so on your private branch only.
- Do feel free to add yourself to the
CREDITSfile and the corresponding list in the the
README. Alphabetical order applies.
- Do note that in order for us to merge any non-trivial changes (as a rule of thumb, additions larger than about 15 lines of code), we need an explicit public domain dedication on record from you, which you will be asked to agree to on the first commit to a repo within the organization.