EBNF Parser example
This example implements an EBNF parser equivalent to the built-in parser. The proximate result is an Abstract S-Expression composed of sub-rules which can be directly executed by the parser. Effectively, this is a re-implementation of EBNF::Parser itself.
Parsing the Grammar
require 'ebnf'
ebnf = EBNFPegParser.new(File.open("../../etc/ebnf.ebnf"))
Output rules and terminals as S-Expressions, Turtle or EBNF
puts ebnf.to_sxp
This generates a S-Expression form of the grammar suitable for use by EBNF.
(
(pass _pass (seq PASS))
(rule ebnf "1" (star (alt declaration rule)))
(rule declaration "2" (alt "@terminals" pass))
(rule rule "3" (seq LHS expression))
(rule expression "4" (seq alt))
(rule alt "5" (seq seq (star (seq "|" seq))))
(rule seq "6" (plus diff))
(rule diff "7" (seq postfix (opt (seq "-" postfix))))
(rule postfix "8" (seq primary (opt POSTFIX)))
(rule primary "9"
(alt HEX SYMBOL ENUM O_ENUM RANGE O_RANGE STRING1 STRING2 (seq "(" expression ")")))
(rule pass "10" (seq "@pass" expression))
(terminal LHS "11" (seq (opt (seq "[" SYMBOL "]" (plus " "))) SYMBOL (star " ") "::="))
(terminal SYMBOL "12" (plus (alt (range "a-z") (range "A-Z") (range "0-9") "_" ".")))
(terminal HEX "13" (seq "#x" (plus (alt (range "a-f") (range "A-F") (range "0-9")))))
(terminal ENUM "14" (diff (alt (seq "[" (plus R_CHAR)) (seq (plus HEX) "]")) LHS))
(terminal O_ENUM "15" (alt (seq "[^" (plus R_CHAR)) (seq (plus HEX) "]")))
(terminal RANGE "16" (seq "[" (plus (alt (seq R_CHAR "-" R_CHAR) (seq HEX "-" HEX))) "]"))
(terminal O_RANGE "17"
(seq "[^" (plus (alt (seq R_CHAR "-" R_CHAR) (seq HEX "-" HEX))) "]"))
(terminal STRING1 "18" (seq "\"" (star (diff CHAR "\"")) "\""))
(terminal STRING2 "19" (seq "'" (star (diff CHAR "'")) "'"))
(terminal CHAR "20"
(alt
(range "#x9#xA#xD")
(range "#x20-#xD7FF")
(range "#xE000-#xFFFD")
(range "#x10000-#x10FFFF")) )
(terminal R_CHAR "21" (diff CHAR "]"))
(terminal POSTFIX "22" (range "?*+"))
(terminal PASS "23"
(plus
(alt
(range "#x00-#x20")
(seq (alt (diff "#" "#x") "//") (star (range "^#x0A#x0Dx")))
(seq "/*" (star (alt (opt (seq "*" (range "^/"))) (range "^*"))) "*/")
(seq "(*" (star (alt (opt (seq "*" (range "^)"))) (range "^*"))) "*)")) )) )
This can then be used as input to EBNF.parse to transform EBNF to PEG for parsing examples of the grammar using EBNF::PEG::Parser.
ebnf --input-format sxp --peg ebnf.sxp -o ebnf.peg.sxp
An example S-Expression for rule ebnf
, which is decomposed into sub-rules as follows:
(rule ebnf "1" (star _ebnf_1))
(rule _ebnf_1 "1.1" (alt declaration rule))
Note that sub-production _ebnf_1
is created, could be useful for some productions when creating parser logic, as described in the example walkthrough below.
Example Walkthrough
This example uses the EBNF grammar from ebnf to generate meta, which includes the resulting RULES
table, used by parser to implement a parser for the grammar.
The first step is defining regular expressions for terminals used within the grammar. For the EBNF grammar, this is done in EBNF::Terminals. Note that the parser can operate without terminal definitions, but this can greatly improve parser performance.
The parser is implemented using the EBNFPegParser class, which includes EBNF::PEG::Parser.
Parser basics
The parser operates directly using the rules from the abstract syntax tree generated by turning the original EBNF grammar using EBNF::PEG#make_peg. Tokens are derived from terminal rules defined in the grammar or contained inline through non-terminal rule definitions. Tokens are either strings, which must be matched exactly, or symbols, which identify a regular expression used to match the terminal and yield a token. The association between terminal symbols and their regular expressions along with processing rules to invoke when they are identified are described in Terminal definitions.
The parser starts with the specified rule, ebnf
in this case, and executes that rule, which is expected to completely parse the input file potentially leaving some whitespace.
Non-terminal rules have an expression using one of the following:
seq
: A sequence of rules or terminals. If any (other than opt
or star
) to not parse, the rule is terminated as unmatched.
opt
: An optional rule or terminal. It either results in the matching rule or returns nil
.
alt
: A list of alternative rules, which are attempted in order. It terminates with the first matching rule, or is terminated as unmatched, if no such rule is found.
plus
: A sequence of one or more of the matching rule. If there is no such rule, it is terminated as unmatched; otherwise, the result is an array containing all matched input.
star
: A sequence of zero or more of the matching rule. It will always return an array.
The starting rule will typically be of the form (star sub_rule)
which will attempt to parse that sub rule until the end of input.
If a rule matches, it enters a production, which may invoke a start production before matching is attempted, and will call any _production either if matched, or unmatched. That production may choose to evaluate the returned abstract syntax tree to simplify the result, or create some semantic representation of that value.
Due to the nature of PEG parsers, the same rule may be attempted at the same input location many times; this is optimized by use of a Packrat memoizing cache, which remembers the result of a previous successful evaluation and short-circuits further execution.
Processing continues by continuing to look for productions sequence and pushing those productions onto the stack. When a production is complete, any associated production handler is invoked, after popping off the top of the prod_data
stack. The just removed hash is passed as current
to the production handler. This is typically where the work of the parser happens. See Production definitions for more information.
Terminal definitions
The parser uses a DSL to specify terminals
and productions
associated with rules in the grammar. Each terminal
specifies the rule name, associated regular expression, and a block which is invoked when the parser recognizes the terminal:
terminal(:SYMBOL, SYMBOL) do |value, prod|
value.to_sym
end
In this terminal definition, the SYMBOL terminal is recognized using the SYMBOL
regular expression from EBNF::Terminals::SYMBOL. When found, the value of the symbol returned for use by productions which include it.
Production definitions
During parsing, when a non-terminal production is identified, it attempts to invoke an associated start_production
block. Typically there is nothing to do at the start of a production, so these are often left out. However, at times, it is necessary to prepare the production stack with information. For example, consider the start production for _alt_1
(a sub-production of alt
).
start_production(:_alt_1) do |data, callback|
data[:i_was_here] = true
end
This is associated with the '|' part of the alt
production.
[5] alt ::= seq ('|' seq)*
When this is invoked, we have already processed one seq
, which provided as part of the input value of the :alt
production. The result is to remove the seq
data and append it to the alt
data in value[:alt]
.
production(:_alt_1) do |value, data, callback|
data[:i_was_here] == true
value.map {|a1| a1.last[:seq]}.compact # Get rid of '|'
end
The final result of alt
, will then be the hash containing :alt and an array of data matching the seq
sub-productions.
The _alt_1
production comes from the PEG definition:
(rule _alt_1 "5.1" (star _alt_2))
(rule _alt_2 "5.2" (seq "|" seq))
On completion, the associated production for :alt
is invoked:
production(:alt) do |value|
if value.last[:_alt_1].length > 0
[:alt, value.first[:seq]] + value.last[:_alt_1]
else
value.first[:seq]
end
end
Looking at the EBNF grammar itself, we can see that the first declaration is
[1] ebnf ::= (declaration | rule)*
This is reduced to the PEG S-Expression noted above:
(rule ebnf "1" (star _ebnf_1))
(rule _ebnf_1 "1.1" (alt declaration rule))
The ebnf
production uses the alt
operator. When matching the production itself we can see that it is either a declaration
or a rule
. In this case of this parser, the result of parsing EBNF is an Abstract Syntax Tree, but in other cases it may create something else. In the case of the Turtle gem, the parser generates RDF Triples. Because the parser uses a streaming scanner, a file of any length can be passed to the parser, which emits triples as sufficient processing completes.