Intro
Giza is a parser, but not in the same vein as what you're used to. For one, it's not a parser-generator. Rather than turning a grammar into a bunch of source code, it instead converts the grammar into a ready-to-use form, and can then start parsing immediately. No compiling necessary!
One of the cool benefits of this is that you can try out and edit your grammar in real time. Take a look at the giza
program. It's actually a REPL for editing grammars and parsing text. How cool is that!?
Giza would probably be classified as a kind of GLR parser. It goes through all possible paths without the need for backtracking. It can also handle ambiguities, by producing multiple parse trees, if needed. The format of its grammars is not quite BNF; instead the syntax is more inspired by programming languages and regexes. The syntax of grammars is described entirely by the "Supergrammar", which is written in it's own syntax. Take a look at Supergrammar.txt
.
Giza maintains a distinction between "tokenized" and "non-tokenized" processing. The former is what parsers usually do, constructing a parse tree from a stream of input tokens. The latter is more akin to what regexes do, matching against a stream of input characters instead. Herein, non-tokenized processing often goes by the name "spanning" and is done by a "spanner".
Example
Here's how you would work with a grammar in the REPL:
>>> expr = ( mult-expr | add-expr | sub-expr );
>>> mult-expr = sub-expr ( [*/%] sub-expr )+;
>>> add-expr = ( mult-expr | sub-expr ) ( [-+] ( mult-expr | sub-expr )
)+;
>>> sub-expr = ( number | var | paren );
>>> <token> number = [\d]+;
>>> <token> var = [\l]+;
Then you can check the definitions for errors like this:
>>> check --tokenized
There are errors in the grammar:
Definition 'sub-expr' references a definition 'paren' which is not
defined.
Oops, we forgot to define paren
:
>>> paren = '(' expr ')';
>>> check --tokenized
There are no errors or warnings.
Once that's settled, we can parse some text:
>>> parse expr '1 + 2 * 3 - four / 5 % 6 + seven'
There is 1 valid parse of the input.
To get more info, use the --verbose
flag to print out the parse tree:
>>> parse expr --verbose '1 + 2 * 3 - four / 5 % 6 + seven'
There is 1 valid parse of the input.
expr
add-expr
sub-expr
1 number
+ $implicit char class +-
mult-expr
sub-expr
2 number
* $implicit char class %*/
sub-expr
3 number
- $implicit char class +-
mult-expr
sub-expr
four var
/ $implicit char class %*/
sub-expr
5 number
% $implicit char class %*/
sub-expr
6 number
+ $implicit char class +-
sub-expr
seven var
In the left column, you see the sequence of tokens in the input. On the right is the parse tree with indentation to show hierarchy, with each matching node on the same line of text as the token that matched it.
But that's not all. There's tons of stuff the REPL can do. It has a built-in help system to explain everything:
>>> help
Usage:
>>> [options]
>>> help [command_or_topic]
>>> command [args...]
Commands:
help Display general help, or help on a specific topic.
list List all of the definitions currently defined.
print Print definitions as text in giza grammar format
delete Delete the specified definitions.
save Save definitions to a file as text in giza grammar format
load Load definitions from a file
check Check definitions for errors
parse Parse one or more inputs with a tokenized grammar, starting with a given definition, and print how many valid parse trees are found
span Span one or more inputs with a non-tokenized grammar, starting with a given definition, and print how many valid span trees are found
render Convert definitions to state machine format and render the state machines to a C# class.
Example in Code
Once you've worked out your language's grammar, you typically want to use it in some other application. To do that, follow these steps:
- Save all relevant definitions to a file. By convention, the file extension is
.giza
, but you can use anything. See an example here. - Use the
render
command. This takes your grammar, converts all of the definitions into state machine format, and then emits the C# code for a class that creates the same state machine representation [1]. See example here. - Create a class that takes your
*Grammar
class and plugs it into aParser
object. Whenever your want to parse some input text, pass it to theParser.Parse
method to get a list of parse trees. See example here - [Optional] If there's more than one parse tree, then there's some ambiguity in your grammar with that particular input. You can do some kind of semantic analysis to choose which of the parse trees is the 'right' one. Or you can just pick the first one, if you're lazy.
- Convert the generalized parse tree(s) into whatever domain objects you need. See example here.
[1] This kinda violates the "not turning a grammar into a bunch of
source code" assertion above, but not entirely. The code so generated is not
capable of parsing anything. It's basically a data structure serialized as
C#. We could just as well store the raw text of the grammar file and run it
through SupergrammarSpanner
to generate the state machine
representation, which is what the render
command does for you.
Whatever. We're working on making it prettier.
State of the project
Unfortunately, you caught us right in the middle of a major architecture overhaul. Much of the internals of how the system work are being completely re-worked. In particular, we're trying to treat tokenized and non-tokenized processing as special cases of a generalized pattern-matching system. Hence, the hideous Spanner2
class. It's a work in progress. Code and API may change at any time, although the command-line tool will be pretty stable. There's a lot of cosmetic stuff to be fixed up as well, such as the fact that Supergrammar.txt
should really be named Supergrammar.giza
. Stay tuned!