ISO EBNF Parser example

This example implements an ISO/IEC 14977 parser which parses compatible grammars into S-Expressions. This allows the resulting S-Expressions to drive a PEG/Packrat Parser to parser documents defined using ISO/IEC 14977.

Parsing the Grammar

require 'ebnf'

ebnf = ISOEBNFPegParser.new(File.open("examples/ebnf.isoebnf"))

Output rules and terminals as S-Expressions:

puts ebnf.to_sxp

This generates a S-Expression form of the grammar suitable for use by EBNF.

(
 (rule syntax (star syntax_rule))
 (rule syntax_rule
  (seq meta_identifier defining_symbol definitions_list terminator_symbol))
 (rule definitions_list
  (seq single_definition (star (seq definition_separator_symbol definitions_list))))
 (rule single_definition (seq term (star (seq "," term))))
 (rule term (seq factor (opt (seq "-" exception))))
 (rule exception (seq factor))
 (rule factor (seq (opt (seq integer "*")) primary))
 (rule primary
  (alt optional_sequence repeated_sequence special_sequence grouped_sequence
   meta_identifier terminal_string empty ))
 (rule optional_sequence
  (seq start_option_symbol definitions_list end_option_symbol))
 (rule repeated_sequence
  (seq start_repeat_symbol definitions_list end_repeat_symbol))
 (rule grouped_sequence (seq "(" definitions_list ")"))
 (rule letter
  (alt "A" "B" "C" "D" "E" "F" "G" "H" "I" "J" "K" "L" "M" "N" "O" "P" "Q" "R"
   "S" "T" "U" "V" "W" "X" "Y" "Z" "a" "b" "c" "d" "e" "f" "g" "h" "i" "j" "k"
   "l" "m" "n" "o" "p" "q" "r" "s" "t" "u" "v" "w" "x" "y" "z" ))
 (rule decimal_digit (alt "0" "1" "2" "3" "4" "5" "6" "7" "8" "9"))
 (rule integer (seq decimal_digit (star decimal_digit)))
 (rule meta_identifier (seq letter (star meta_identifier_character)))
 (rule meta_identifier_character (alt letter decimal_digit "_"))
 (rule terminal_string
  (alt
   (seq (seq "'" first_terminal_character (star first_terminal_character) "'"))
   (seq (seq "\"" second_terminal_character (star second_terminal_character) "\""))) )
 (rule first_terminal_character (seq terminal_character))
 (rule second_terminal_character (seq terminal_character))
 (rule special_sequence (seq "?" (star special_sequence_character) "?"))
 (rule special_sequence_character (seq terminal_character))
 (rule terminal_character
  (alt letter decimal_digit concatenate_symbol defining_symbol
   definition_separator_symbol end_comment_symbol end_group_symbol
   end_option_symbol end_repeat_symbol except_symbol first_quote_symbol
   repetition_symbol second_quote_symbol special_sequence_symbol
   start_comment_symbol start_group_symbol start_option_symbol
   start_repeat_symbol terminator_symbol other_character ))
 (rule other_character
  (alt " " ":" "+" "_" "%" "@" "&" "#" "$" "<" ">" "\\" "^" "`" "~"))
 (rule empty (seq ""))
 (rule defining_symbol (alt "=" ":"))
 (rule definition_separator_symbol (alt "|" "/" "!"))
 (rule terminator_symbol (alt ";" "."))
 (rule start_option_symbol (alt "[" "(/"))
 (rule end_option_symbol (alt "]" "/)"))
 (rule start_repeat_symbol (alt "{" "(:"))
 (rule end_repeat_symbol (alt "}" ":)")))

This can then be used as input to EBNF.parse to transform EBNF to PEG for parsing examples of the grammar using EBNF::PEG::Parser.

ebnf --input-format sxp --peg ebnf.sxp -o ebnf.peg.sxp

Note, however, that ISO EBNF doesn't distinguish between terminal rules and non-terminal rules, so all rules are parsed as non-terminal rules with strings the only terminals. Whereas, the W3C EBNF definition of the grammar does use terminal rules.

When parsing files with this grammar, rules that are all capitalized will be treated as terminal productions, although this is an proprietary interpretation of the specification.

Example Walkthrough

This example uses the EBNF grammar from iso-ebnf to generate meta, which includes the resulting RULES table, used by parser to implement a parser for the grammar.

The first step is defining regular expressions for terminals used within the grammar. Note that the parser can operate without terminal definitions, but this can greatly improve parser performance.

The parser is implemented using the ISOEBNFPegParser class, which includes EBNF::PEG::Parser.

Parser basics

The parser operates directly using the rules from the abstract syntax tree generated by turning the original ISO EBNF grammar using EBNF::PEG#make_peg. Tokens are derived from terminal rules defined in the grammar or contained inline through non-terminal rule definitions. Tokens are either strings, which must be matched exactly, or symbols, which identify a regular expression used to match the terminal and yield a token. The association between terminal symbols and their regular expressions along with processing rules to invoke when they are identified are described in Terminal definitions.

The parser starts with the specified rule, syntax in this case, and executes that rule, which is expected to completely parse the input file potentially leaving some whitespace.

Non-terminal rules have an expression using one of the following:

seq : A sequence of rules or terminals. If any (other than opt or star) to not parse, the rule is terminated as unmatched. opt : An optional rule or terminal. It either results in the matching rule or returns nil. alt : A list of alternative rules, which are attempted in order. It terminates with the first matching rule, or is terminated as unmatched, if no such rule is found. plus : A sequence of one or more of the matching rule. If there is no such rule, it is terminated as unmatched; otherwise, the result is an array containing all matched input. rept m n : A sequence of at lest m and at most n of the matching rule. It will always return an array. star : A sequence of zero or more of the matching rule. It will always return an array.

The starting rule will typically be of the form (star sub_rule) which will attempt to parse that sub rule until the end of input.

If a rule matches, it enters a production, which may invoke a start production before matching is attempted, and will call any _production either if matched, or unmatched. That production may choose to evaluate the returned abstract syntax tree to simplify the result, or create some semantic representation of that value.

Due to the nature of PEG parsers, the same rule may be attempted at the same input location many times; this is optimized by use of a Packrat memoizing cache, which remembers the result of a previous successful evaluation and short-circuits further execution.

Processing continues by continuing to look for productions sequence and pushing those productions onto the stack. When a production is complete, any associated production handler is invoked, after popping off the top of the prod_data stack. The just removed hash is passed as current to the production handler. This is typically where the work of the parser happens. See Production definitions for more information.

Terminal definitions

The parser uses a DSL to specify terminals and productions associated with rules in the grammar. Each terminal specifies the rule name, associated regular expression, and a block which is invoked when the parser recognizes the terminal:

terminal(:integer, /\d+/) do |value, prod|
  value.to_i
end

In this terminal definition, the integer terminal is recognized using the /\d+/. When found, the value of the integer is returned for use by productions which include it.

Production definitions

Looking at the grammar itself, we can see that the first declaration is

[1]  syntax            ::= syntax_rule*