YACC Preprocessor
The
yaccpp
project is a preprocessor for YACC/Bison grammars.
Yaccpp implements a few features on top of traditional YACC. As you will see after reading this article, Yaccpp is more than just a preprocessor. Yaccpp is a purely functional dynamically typed domain specific programming language for describing context-free grammars. That is quite a mouthful, but this power does not get in the way. Yaccpp is still about writing grammars, not programs.
// Comments
Yacc allows
/* */
comments. Yaccpp also allows C99-style
//
comments.
Generate foreign language actions
If a foreign language such as
Aldor or Perl includes a C FFI,
Yaccpp can generate calls into this language, so that actions can be written in
a non-C language while keeping the parser implementation in C. This way, new
languages can get a well-tested parser generator without having to write their
own Bison template. A Perl module or Aldor domain will be generated to
encapsulate the action code. This feature can be controlled using the
%action-lang
directive.
%action-lang "Aldor" MyLanguageParserDom // or: %action-lang "Perl" Parse::MyLanguage
Merge multiple input files
By listing several files on the command line, you can merge multiple grammar
fragments into a single grammar. This aids in modularising grammars. You may
optionally define an entry point into the fragment with
%start
. All
nonterminals in the fragment except the start symbol will be prefixed with the
start symbol name, thus creating a kind of scope.
Automatic parse tree generation
Yaccpp can generate parse tree classes for C++ or structs for C using the
%parse-tree
directive. These represent the original grammar structure rather
than the expanded grammar. Note that unlike abstract syntax trees (ASTs), the
parse tree contains everything, including all tokens such as
'+'
,
';'
and
'{'
tokens. The class members will carry the names of the rhs parts. If more
than one rhs part has the same name, an 1-based index is appended.
%parse-tree "C++" %% add_exp : mul_exp | add_exp '+' mul_exp %class add_exp | add_exp '-' mul_exp %class add_exp ;
The above grammar productions will yield two classes. By using the
%class
directive, the common productions are written directly in the base class
add_exp_node
. This class will have three members
add_exp
,
tok
,
mul_exp
.
If
%class
is omitted, each production will get its own class,
unless a pattern can be detected. If
%class
had been omitted in the above
example, the three classes would be:
add_exp_node <<abstract>> add_exp_node1 : add_exp_node { node *mul_exp; }; add_exp_node2 : add_exp_node { node *add_exp, *tok, *mul_exp; };
Expand macro grammars
Yaccpp implements a variant of macro grammars as described in Peter Thiemann's Macros for Context-Free Grammars paper.
// E (sep E)* list(E, sep, list_type) : E // You may refer to arguments of macros in action code using the ` prefix. { ($$ = new `list_type)->add ($1); } | list(E, sep, list_type) sep E { ($$ = $1)->add ($3); } ; argument_list: '(' list (argument, ';', node_list) ')';
In addition to the expansion described in the paper, Yaccpp supports higher order programming by passing unexpanded macro names to other macros. These macros can then be expanded within the other macro.
Currying and variadic macros
Macros can also be partially expanded by passing a strict subset of the required
arguments. This is a form of currying. Arguments can be passed explicitly by name.
Macros can be overloaded by argument count. Using the
...
operator at the end
of a parameter list, a macro may accept any number of arguments. Overloading
resolution will select
variadic
macros only if there is no overload with the exact number of parameters.
opt(E) : /* empty */ | E ; // partially apply the macro apply(macro, arg): macro (arg) ; // recursively expand the apply macro with less and less arguments // until there is no "rest", anymore, and the first "apply" macro is called apply(macro, arg, rest...): apply (apply (macro, arg), rest...) ; expr.opt : opt(E: expr) // named arguments ;
Variadic arguments can not be directly expanded in the grammar. If you need a space separated list of all arguments passed to a macro, you will need to use something like this:
expand(arg, rest...): arg expand(rest...);
Specialisation (@spec)
Sometimes you may want to specialise a generic macro by production. Macros can be specialised by overloading them and defining one or more of its arguments with a production name.
// Generic list macro for the + operator list.1(E) : E | list.1(E) E ; // specialise the macro for list of "statement" list.1(E: statement) : E ';' | list.1(E) E ';' ;
Pattern matching
Macros can not only be specialised with production names, but also with ML-style patterns.
// specialise the '+' macro for list of optional statements list.1(E: opt(E: statement)) : ';' | list.1(E) E ';' ; // partially specialise to extract the argument // this macro reverses the effect of opt(E) notopt(E: opt(E)) : E ; // anything else is used as-is notopt(E): E ; foo : notopt(opt(statement)) // => statement | notopt(expression) // => expression ;
Analysis
Yaccpp performs an extensive analysis of the macro grammar before attempting to expand it, to ensure that the grammar terminates. It is also an error to have unexpanded or partially expanded macros when the grammar terminates. Anonymous macros defined inline in an argument list are not supported.
Refer to other symbols on the rhs
As a form of abbreviation, you may use variable references on the rhs of a rule.
some_macro(with,many,arguments,it,would,be,unfeasible,to,write,it,again) : $$ with it | many $$ arguments | unfeasible to write $$ | again $$ to write ; initialiser : '[' assignment.expression "..." $2 ']' basic.initialiser ; asm.statement : "asm" type.qualifier '(' string.literal.list ':' asm.argument.list?[L] // define a named component ':' $L // refer to it here ':' asm.clobbered.list ')' ';'
Regular grammars on the rhs
Yaccpp supports regular grammars known from regular expressions and EBNF on the right hand side of rules.
identifier // Note that you need to write out the "..." yourself. Character ranges // are not supported. : ('a' | 'b' | ... | 'z' | 'A' | ... | 'Z') ('a' | 'b' | ... | 'z' | 'A' | ... | 'Z' | '0' | ... '9')* ; // Be careful: macro(A): A; nonterm: macro (foo | bar)*; // does not call macro nonterm: macro ((foo | bar))*; // calls macro and repeats the expansion {0,n} nonterm: macro ((foo | bar)*); // repeats the alternatives {0,n} and calls macro on that // The second definition would expand to the the following: nonterm: macro (anon123)*; nonterm: list.0 (macro (anon123)); nonterm: list.0 (macro__anon123); nonterm: list.0__macro__anon123;
As a generic tool, Yaccpp cannot know how you want to handle lists and optional
elements, it requires the grammar to contain the following definitions. Defaults
are provided, if a definition is not given. The list macros do not, however, know how to
handle their values and therefore do not have a type, unless the automatic tree
generation feature is used, in which case a list implementation is provided and
used (
std::list
for C++, a custom list implementation for C).
- *
The
*
postfix operator calls thelist.0(E)
macro on the preceding atom. - +
The
*
postfix operator calls thelist.1(E)
macro on the preceding atom. - ?
The
?
postfix operator calls theopt(E)
macro on the preceding atom. - |
The
|
infix operator does not require any special support, because it is actually just an anonymous rule. It is copied to a separate rule with a unique name and that rule name is written in its place. You can write semantic actions in anonymous rules just as you would in global rules. If all alternatives have the same type, the type of the anonymous rule is inferred to be that type. Otherwise, you need to specify the type, explicitly.to be filled
Named references
Added in Bison 2.5, this feature is very useful for long rules. Especially when these rules change a lot, it is easy to make a mistake in counting the indices.
funcall: id '(' args ')' { $funcall = new funcall ($id, $args); } addition: exp[left] '+' exp[right-side] { $$ = $left + $[right-side]; } // use [] because the name has a - in it
See the Bison manual for more examples and details.
Import token names and numbers from an enum
The
%include-enum
directive can be used to parse a C enumeration in order to
extract token names and numbers. This is useful for unlinking the lexical
analysis from the parsing. The usual practice is to generate an
enum
or
#define
list from the grammar and use that in the scanner. Using
%include-enum
, both use the same external token information source without
depending on each other.
// we import symbols from "enum token_type" in "tokens.h": %include-enum "tokens.h" token_type
Specify types on lhs of rules
Defining the types at the top of a grammar file splits information about rules in two parts: the type information and the actual rule. In Yaccpp, you can declare the type on the nonterminal you are defining.
add_exp<add_exp_node>: exp + exp ; // becomes %type<add_exp_node> add_exp %% add_exp: exp + exp ;
If you split the definition of a nonterminal into several separate rules, you only need to define the type on one of them. Defining a different type on the same nonterminal is an error.
Default token and rule types
Using the
%default-token-type
, you can specify the type to be used on tokens
read from the included enum. They may be overridden using the
%token
directive.
The
%default-rule-type
defines the default type of lhs symbols.
%default-rule-type <ast_node> %% decls: decl | decls decl ; // becomes: %type<ast_node> decls decl %% decls: decl | decls decl ;
You may override the type analogously to the token types.
"typeof"
You may refer to the type of any typed nonterminal or terminal using the new
#
symbol.
%type<funcall_node> funcall %% funcall: id '(' args ')' { $$ = new #$ ($1, $3); } // is translated to: funcall: id '(' args ')' { $$ = new funcall_node ($1, $3); }
You may also refer to the types of rhs symbols, which may be useful in template instantiations.
%type<add_exp_node> add_exp %type<mul_exp_node> mul_exp %type<div_exp_node> div_exp %% add_exp: mul_exp + div_exp { $$ = new $#<#1, #3> ($1, $3); } // template instantiation // is translated to: add_exp: mul_exp + div_exp { $$ = new add_exp_node<add_exp_node, mul_exp_node> ($1, $3); }
Referring to the type of untyped (non)terminals is an error.
Extensible through Perl
If all the added functionality is not enough for you or you need something done slightly differently, you can use the Perl API to influence the translation and extract additional information from the grammar.