ISO EBNF Parser example
This example implements an ISO/IEC 14977 parser which parses compatible grammars into S-Expressions. This allows the resulting S-Expressions to drive a PEG/Packrat Parser to parser documents defined using ISO/IEC 14977.
Parsing the Grammar
require 'ebnf'
ebnf = ISOEBNFPegParser.new(File.open("examples/ebnf.isoebnf"))
Output rules and terminals as S-Expressions:
puts ebnf.to_sxp
This generates a S-Expression form of the grammar suitable for use by EBNF.
(
(rule syntax (star syntax_rule))
(rule syntax_rule
(seq meta_identifier defining_symbol definitions_list terminator_symbol))
(rule definitions_list
(seq single_definition (star (seq definition_separator_symbol definitions_list))))
(rule single_definition (seq term (star (seq "," term))))
(rule term (seq factor (opt (seq "-" exception))))
(rule exception (seq factor))
(rule factor (seq (opt (seq integer "*")) primary))
(rule primary
(alt optional_sequence repeated_sequence special_sequence grouped_sequence
meta_identifier terminal_string empty ))
(rule optional_sequence
(seq start_option_symbol definitions_list end_option_symbol))
(rule repeated_sequence
(seq start_repeat_symbol definitions_list end_repeat_symbol))
(rule grouped_sequence (seq "(" definitions_list ")"))
(rule letter
(alt "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R"
"S" "T" "U" "V" "W" "X" "Y" "Z" "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k"
"l" "m" "n" "o" "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z" ))
(rule decimal_digit (alt "0" "1" "2" "3" "4" "5" "6" "7" "8" "9"))
(rule integer (seq decimal_digit (star decimal_digit)))
(rule meta_identifier (seq letter (star meta_identifier_character)))
(rule meta_identifier_character (alt letter decimal_digit "_"))
(rule terminal_string
(alt
(seq (seq "'" first_terminal_character (star first_terminal_character) "'"))
(seq (seq "\"" second_terminal_character (star second_terminal_character) "\""))) )
(rule first_terminal_character (seq terminal_character))
(rule second_terminal_character (seq terminal_character))
(rule special_sequence (seq "?" (star special_sequence_character) "?"))
(rule special_sequence_character (seq terminal_character))
(rule terminal_character
(alt letter decimal_digit concatenate_symbol defining_symbol
definition_separator_symbol end_comment_symbol end_group_symbol
end_option_symbol end_repeat_symbol except_symbol first_quote_symbol
repetition_symbol second_quote_symbol special_sequence_symbol
start_comment_symbol start_group_symbol start_option_symbol
start_repeat_symbol terminator_symbol other_character ))
(rule other_character
(alt " " ":" "+" "_" "%" "@" "&" "#" "$" "<" ">" "\\" "^" "`" "~"))
(rule empty (seq ""))
(rule defining_symbol (alt "=" ":"))
(rule definition_separator_symbol (alt "|" "/" "!"))
(rule terminator_symbol (alt ";" "."))
(rule start_option_symbol (alt "[" "(/"))
(rule end_option_symbol (alt "]" "/)"))
(rule start_repeat_symbol (alt "{" "(:"))
(rule end_repeat_symbol (alt "}" ":)")))
This can then be used as input to EBNF.parse to transform EBNF to PEG for parsing examples of the grammar using EBNF::PEG::Parser.
ebnf --input-format sxp --peg ebnf.sxp -o ebnf.peg.sxp
Note, however, that ISO EBNF doesn't distinguish between terminal rules and non-terminal rules, so all rules are parsed as non-terminal rules with strings the only terminals. Whereas, the W3C EBNF definition of the grammar does use terminal rules.
When parsing files with this grammar, rules that are all capitalized will be treated as terminal productions, although this is an proprietary interpretation of the specification.
Example Walkthrough
This example uses the EBNF grammar from iso-ebnf to generate meta, which includes the resulting RULES
table, used by parser to implement a parser for the grammar.
The first step is defining regular expressions for terminals used within the grammar. Note that the parser can operate without terminal definitions, but this can greatly improve parser performance.
The parser is implemented using the ISOEBNFPegParser class, which includes EBNF::PEG::Parser.
Parser basics
The parser operates directly using the rules from the abstract syntax tree generated by turning the original ISO EBNF grammar using EBNF::PEG#make_peg. Tokens are derived from terminal rules defined in the grammar or contained inline through non-terminal rule definitions. Tokens are either strings, which must be matched exactly, or symbols, which identify a regular expression used to match the terminal and yield a token. The association between terminal symbols and their regular expressions along with processing rules to invoke when they are identified are described in Terminal definitions.
The parser starts with the specified rule, syntax
in this case, and executes that rule, which is expected to completely parse the input file potentially leaving some whitespace.
Non-terminal rules have an expression using one of the following:
seq
: A sequence of rules or terminals. If any (other than opt
or star
) to not parse, the rule is terminated as unmatched.
opt
: An optional rule or terminal. It either results in the matching rule or returns nil
.
alt
: A list of alternative rules, which are attempted in order. It terminates with the first matching rule, or is terminated as unmatched, if no such rule is found.
plus
: A sequence of one or more of the matching rule. If there is no such rule, it is terminated as unmatched; otherwise, the result is an array containing all matched input.
rept m n
: A sequence of at lest m
and at most n
of the matching rule. It will always return an array.
star
: A sequence of zero or more of the matching rule. It will always return an array.
The starting rule will typically be of the form (star sub_rule)
which will attempt to parse that sub rule until the end of input.
If a rule matches, it enters a production, which may invoke a start production before matching is attempted, and will call any _production either if matched, or unmatched. That production may choose to evaluate the returned abstract syntax tree to simplify the result, or create some semantic representation of that value.
Due to the nature of PEG parsers, the same rule may be attempted at the same input location many times; this is optimized by use of a Packrat memoizing cache, which remembers the result of a previous successful evaluation and short-circuits further execution.
Processing continues by continuing to look for productions sequence and pushing those productions onto the stack. When a production is complete, any associated production handler is invoked, after popping off the top of the prod_data
stack. The just removed hash is passed as current
to the production handler. This is typically where the work of the parser happens. See Production definitions for more information.
Terminal definitions
The parser uses a DSL to specify terminals
and productions
associated with rules in the grammar. Each terminal
specifies the rule name, associated regular expression, and a block which is invoked when the parser recognizes the terminal:
terminal(:integer, /\d+/) do |value, prod|
value.to_i
end
In this terminal definition, the integer
terminal is recognized using the /\d+/
. When found, the value of the integer is returned for use by productions which include it.
Production definitions
Looking at the grammar itself, we can see that the first declaration is
[1] syntax ::= syntax_rule*