Meta-Error
Merr (Meta-ERRor) is a syntax error message generator for LR parsers. It passes samples of erroneous input to the parser and produces an error message function that produces the error message written in the sample file when the same syntax error occurs again. with a modified version of the menhir parser generator. Adapting other parser generators such as ocamlyacc is probably not difficult. The modified menhir can be found at https://github.com/pippijn/menhir.
This tool is based on ideas from Clinton Jeffery's "merr" tool at http://unicon.sourceforge.net/merr/. It is recommended to read the technical paper introducing the concepts.
Usage
The merr program takes several inputs to produce its error message function from.
-t <terminals> -e <errors.ml.in> -a <automaton> -p <parser command> -o <output.ml>
Terminal names
The terminals file contains an ocamlyacc grammar or tokens file amended with token names. An extract of this file could look like this:
∘token<string> TkIDENTIFIER "identifier" ∘token TkIF "if" ∘token TkTHEN "then" ∘token TkELSE "else"
Merr uses these strings when producing default error messages, so that the user
doesn't see the internal names like "TkIDENTIFIER". Since menhir doesn't
understand these extra string literals after the token name, the grammar file
needs to be preprocessed before passing it to the parser generator. A simple
sed -e 's/^\(%token[^"]*\w\+\)\s*".*"$/\1/'
will safely remove the string
literals as well as leading whitespace. This can be done in a
make
target
like
%.mly: %.mly.in
.
Sample file
The second input expected by merr is the sample file. Merr works by passing each
sample input to the parser, which should be instrumented to print a pair of
(state, token)
to its standard output. The
state
should be a number, and
token
is the token type constructor name, e.g.
TkIF
.
The parser should have a flag to disable error messages and print this
state/token pair. The parser command and this flag should be passed to merr with
the
-p
command line option.
The sample file has a similar syntax as OCaml itself. The following is an excerpt from merr's own error description.
module Tokens = Etokens open Etoken (* provides string_of_token *) open Tokens (* provides the 'token' type *) let message = λ | "open" → λ | EOF → "expected module name after 'open'" | TkIDENTIFIER → "unexpected identifier `%s' after 'open'" "expected module name (capitalised)" | _ → "unexpected token '%s' in handler definition" | "open Foo let message = function" → λ | EOF → "expected '|'-separated code fragments" | _ → "unexpected token '%s' where code fragments expected"
One or more
open
directives are always required, and the opened modules
should provide the function
val string_of_token : token -> string
and
bring the token type constructors into scope. Merr can generate a default
function that return the strings in the terminals file, but usually you will
want a better description, using the token data. There can be any number of
open
and
module
directives
In error messages, the
%s
part of the message is replaced by the application
of
string_of_token
with the erroneous token. So, for instance if that
function returns the
string
argument of the
TkIDENTIFIER
constructor, the
identifier can be printed in the error message.
Error messages can consist of multiple consecutive format strings. In the error
message returned from the generated error function, these are joined with the
new-line character
'\n'
.
The top-level patterns contain erroneous sample input to be passed to the parser. The inner match dispatches over the current parser token, and a default catch-all case can be assigned that only regards the state, not the token. Instead of a nested match, one can also write the message directly after the code sample. Multi-matches are also possible, if you want the same error handling for multiple distinct code samples (and therefore states).
Merr will check that all code samples are unique in that they arrive at different states in the parser, so that there can be no ambiguities about which error message to display for a given error.
Automaton file
As a secondary source of error messages, the merr program will try to find out
what tokens would allow a shift action in the parser. It will present the user
with a list of token names (from the terminals file) that the parser might
expect at the state where the error occurred. Menhir can produce an automaton
description parseable by merr, using the option
-dump
.
Using the function
After the parser has been run on the inputs and the error message function has
been generated, it can be called when an error occurs in the parser. The
modified menhir parser will raise an exception
StateError of int * token
where
token
is the token type used by the parser and the first argument is
the state. These two can be passed to the error function with the type signature
val message : int -> token -> string
. Further formatting can be done in
the client of the error module.
Ocamlbuild integration
Merr can be easily integrated with
ocamlbuild
. See its own
myocamlbuild.ml
and
_tags
files to understand how to do that.