YACC Preprocessor

The yaccpp project is a preprocessor for YACC/Bison grammars.

Git repository

Yaccpp implements a few features on top of traditional YACC. As you will see after reading this article, Yaccpp is more than just a preprocessor. Yaccpp is a purely functional dynamically typed domain specific programming language for describing context-free grammars. That is quite a mouthful, but this power does not get in the way. Yaccpp is still about writing grammars, not programs.

// Comments

Yacc allows /* */ comments. Yaccpp also allows C99-style // comments.

Generate foreign language actions

If a foreign language such as Aldor or Perl includes a C FFI, Yaccpp can generate calls into this language, so that actions can be written in a non-C language while keeping the parser implementation in C. This way, new languages can get a well-tested parser generator without having to write their own Bison template. A Perl module or Aldor domain will be generated to encapsulate the action code. This feature can be controlled using the %action-lang directive.

%action-lang "Aldor" MyLanguageParserDom
// or:
%action-lang "Perl" Parse::MyLanguage

Merge multiple input files

By listing several files on the command line, you can merge multiple grammar fragments into a single grammar. This aids in modularising grammars. You may optionally define an entry point into the fragment with %start. All nonterminals in the fragment except the start symbol will be prefixed with the start symbol name, thus creating a kind of scope.

Automatic parse tree generation

Yaccpp can generate parse tree classes for C++ or structs for C using the %parse-tree directive. These represent the original grammar structure rather than the expanded grammar. Note that unlike abstract syntax trees (ASTs), the parse tree contains everything, including all tokens such as '+', ';' and '{' tokens. The class members will carry the names of the rhs parts. If more than one rhs part has the same name, an 1-based index is appended.

%parse-tree "C++"
%%
add_exp
   : mul_exp
   | add_exp '+' mul_exp %class add_exp
   | add_exp '-' mul_exp %class add_exp
   ;

The above grammar productions will yield two classes. By using the %class directive, the common productions are written directly in the base class add_exp_node. This class will have three members add_exp, tok, mul_exp.

If %class is omitted, each production will get its own class, unless a pattern can be detected. If %class had been omitted in the above example, the three classes would be:

add_exp_node <<abstract>>
add_exp_node1 : add_exp_node { node *mul_exp; };
add_exp_node2 : add_exp_node { node *add_exp, *tok, *mul_exp; };

Expand macro grammars

Yaccpp implements a variant of macro grammars as described in Peter Thiemann's Macros for Context-Free Grammars paper.

// E (sep E)*
list(E, sep, list_type)
   : E
      // You may refer to arguments of macros in action code using the ` prefix.
      { ($$ = new `list_type)->add ($1); }
   | list(E, sep, list_type) sep E
      { ($$ = $1)->add ($3); }
   ;

argument_list: '(' list (argument, ';', node_list) ')';

In addition to the expansion described in the paper, Yaccpp supports higher order programming by passing unexpanded macro names to other macros. These macros can then be expanded within the other macro.

Currying and variadic macros

Macros can also be partially expanded by passing a strict subset of the required arguments. This is a form of currying. Arguments can be passed explicitly by name. Macros can be overloaded by argument count. Using the ... operator at the end of a parameter list, a macro may accept any number of arguments. Overloading resolution will select variadic macros only if there is no overload with the exact number of parameters.

opt(E)
   : /* empty */
   | E
   ;

// partially apply the macro
apply(macro, arg): macro (arg) ;
// recursively expand the apply macro with less and less arguments
// until there is no "rest", anymore, and the first "apply" macro is called
apply(macro, arg, rest...): apply (apply (macro, arg), rest...) ;

expr.opt
   : opt(E: expr) // named arguments
   ;

Variadic arguments can not be directly expanded in the grammar. If you need a space separated list of all arguments passed to a macro, you will need to use something like this:

expand(arg, rest...): arg expand(rest...);

Specialisation (@spec)

Sometimes you may want to specialise a generic macro by production. Macros can be specialised by overloading them and defining one or more of its arguments with a production name.

// Generic list macro for the + operator
list.1(E)
   : E
   | list.1(E) E
   ;

// specialise the macro for list of "statement"
list.1(E: statement)
   : E ';'
   | list.1(E) E ';'
   ;

Pattern matching

Macros can not only be specialised with production names, but also with ML-style patterns.

// specialise the '+' macro for list of optional statements
list.1(E: opt(E: statement))
   : ';'
   | list.1(E) E ';'
   ;

// partially specialise to extract the argument
// this macro reverses the effect of opt(E)
notopt(E: opt(E)) : E ;
// anything else is used as-is
notopt(E): E ;

foo
   : notopt(opt(statement)) // => statement
   | notopt(expression) // => expression
   ;

Analysis

Yaccpp performs an extensive analysis of the macro grammar before attempting to expand it, to ensure that the grammar terminates. It is also an error to have unexpanded or partially expanded macros when the grammar terminates. Anonymous macros defined inline in an argument list are not supported.

Refer to other symbols on the rhs

As a form of abbreviation, you may use variable references on the rhs of a rule.

some_macro(with,many,arguments,it,would,be,unfeasible,to,write,it,again)
   : $$ with it
   | many $$ arguments
   | unfeasible to write $$
   | again $$ to write
   ;

initialiser
   : '[' assignment.expression "..." $2 ']' basic.initialiser
   ;

asm.statement
   : "asm" type.qualifier '(' string.literal.list
     ':' asm.argument.list?[L] // define a named component
     ':' $L // refer to it here
     ':' asm.clobbered.list ')' ';'

Regular grammars on the rhs

Yaccpp supports regular grammars known from regular expressions and EBNF on the right hand side of rules.

identifier
   // Note that you need to write out the "..." yourself. Character ranges
   // are not supported.
   : ('a' | 'b' | ... | 'z' | 'A' | ... | 'Z')
     ('a' | 'b' | ... | 'z' | 'A' | ... | 'Z' | '0' | ... '9')*
   ;

// Be careful:
macro(A): A;
nonterm: macro (foo | bar)*; // does not call macro
nonterm: macro ((foo | bar))*; // calls macro and repeats the expansion {0,n}
nonterm: macro ((foo | bar)*); // repeats the alternatives {0,n} and calls macro on that

// The second definition would expand to the the following:
nonterm: macro (anon123)*;
nonterm: list.0 (macro (anon123));
nonterm: list.0 (macro__anon123);
nonterm: list.0__macro__anon123;

As a generic tool, Yaccpp cannot know how you want to handle lists and optional elements, it requires the grammar to contain the following definitions. Defaults are provided, if a definition is not given. The list macros do not, however, know how to handle their values and therefore do not have a type, unless the automatic tree generation feature is used, in which case a list implementation is provided and used ( std::list for C++, a custom list implementation for C).

Named references

Added in Bison 2.5, this feature is very useful for long rules. Especially when these rules change a lot, it is easy to make a mistake in counting the indices.

funcall: id '(' args ')'
      { $funcall = new funcall ($id, $args); }

addition: exp[left] '+' exp[right-side]
      { $$ = $left + $[right-side]; } // use [] because the name has a - in it

See the Bison manual for more examples and details.

Import token names and numbers from an enum

The %include-enum directive can be used to parse a C enumeration in order to extract token names and numbers. This is useful for unlinking the lexical analysis from the parsing. The usual practice is to generate an enum or #define list from the grammar and use that in the scanner. Using %include-enum, both use the same external token information source without depending on each other.

// we import symbols from "enum token_type" in "tokens.h":
%include-enum "tokens.h" token_type

Specify types on lhs of rules

Defining the types at the top of a grammar file splits information about rules in two parts: the type information and the actual rule. In Yaccpp, you can declare the type on the nonterminal you are defining.

add_exp<add_exp_node>: exp + exp ;
// becomes
%type<add_exp_node> add_exp
%%
add_exp: exp + exp ;

If you split the definition of a nonterminal into several separate rules, you only need to define the type on one of them. Defining a different type on the same nonterminal is an error.

Default token and rule types

Using the %default-token-type, you can specify the type to be used on tokens read from the included enum. They may be overridden using the %token directive. The %default-rule-type defines the default type of lhs symbols.

%default-rule-type <ast_node>
%%
decls: decl | decls decl ;

// becomes:
%type<ast_node> decls decl
%%
decls: decl | decls decl ;

You may override the type analogously to the token types.

"typeof"

You may refer to the type of any typed nonterminal or terminal using the new # symbol.

%type<funcall_node> funcall
%%
funcall: id '(' args ')'
      { $$ = new #$ ($1, $3); }

// is translated to:
funcall: id '(' args ')'
      { $$ = new funcall_node ($1, $3); }

You may also refer to the types of rhs symbols, which may be useful in template instantiations.

%type<add_exp_node> add_exp
%type<mul_exp_node> mul_exp
%type<div_exp_node> div_exp
%%
add_exp: mul_exp + div_exp
      { $$ = new $#<#1, #3> ($1, $3); } // template instantiation

// is translated to:
add_exp: mul_exp + div_exp
      { $$ = new add_exp_node<add_exp_node, mul_exp_node> ($1, $3); }

Referring to the type of untyped (non)terminals is an error.

Extensible through Perl

If all the added functionality is not enough for you or you need something done slightly differently, you can use the Perl API to influence the translation and extract additional information from the grammar.