ALeRT is one of my projects. It’s designed to show the properties of a grammar graphically. It was started out as a command-line script, I then added a website to allow anyone to run it. The website was still using the original script and now I want to refactor it to make the program more flexible.

This series of posts will cover insights from that refactoring:

  • Part 1: Less Complex Parsers, More Maintainable Code
  • Part 2: Checking refactoring with graph isomorphism (not yet posted)

Simplifying the parser

I used the PyParsing module to parse the user-provided grammar. That requires creating a meta-grammar which describes the structure of the users grammar. The script used a much more complex grammar than was required to parse the input, so my refactoring started with simplifying the parser.

The supported format is:

<non-terminal> ::= (<non-terminal> | <terminal>)+
                (| (<non-terminal> | <terminal>)+)*

Essentially a non-terminal derives a string of terminals/non-terminals, and optionally followed by more productions preceded by a pipe character. The new-line character is significant, all other whitespace does not matter.

This was the original grammar:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
terminal = pp.Group(
    pp.QuotedString('"')
    | pp.Keyword("epsilon")
    | pp.Keyword("IDENT")
).set_results_name("terminal")
non_terminal = pp.Group(pp.Word(pp.alphas + "_")).set_results_name("non-terminal")

expression = pp.OneOrMore(pp.Group(terminal | non_terminal), stop_on=pp.White("\n"))

head = pp.Word(pp.alphas + "_").set_results_name("head")
body = pp.Group(
    pp.Group(expression)
    + pp.ZeroOrMore(pp.Keyword("|").suppress() + pp.Group(expression))
).set_results_name("body")

production = pp.Group(head + pp.Keyword("::=") + body).set_results_name(
    "production"
)
grammar = pp.OneOrMore(pp.Group(production)).set_results_name("grammar")

This meta-grammar is much more complex than the defined format. Two problems with this grammar show where it can be simplified:

  • Unnecessary usage of pp.Group
  • The head/body/expression rules have no special meaning in our program so they are unnecessary

This makes the grammar harder to for other programmers to understand. It also complicates the parse tree by adding unnecessary layers. This means the parse tree processing algorithms must be able to cope with these extra layers, this adds complexity than required.

I can’t remember why I made the original grammar so complex. I think it was because I was trying to add more detail to the parse tree than was required, and when I encountered problems with my grammar, I fixed them by adding more code.

Instead of attempting to simplify the existing grammar, I deleted the existing code and started again. My goal was to simplify the grammar as much as possible. I was happy to lose detail in the parse tree so long as the post-processing was able to generate the same graph as before.

Here is the end result:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
terminal = pp.QuotedString('"') | pp.Keyword("epsilon") | pp.Keyword("IDENT")

non_terminal = pp.Word(pp.alphas + "_")

derivation = pp.OneOrMore(terminal | non_terminal, stop_on=pp.White("\n"))

production = (
    non_terminal + pp.Keyword("::=").suppress() + derivation
    + pp.ZeroOrMore(pp.Keyword("|").suppress() + derivation)
)

grammar = pp.OneOrMore(production)

This is a big reduction, a lot of noise is removed and 2 productions are gone.

In comparison to the previous grammar, this version removes some contextual information from the parse tree. The code which analyzed the parse tree did not require this additional information so it is fine to remove. I could add it back in if I change the post-processing calculations.

I was able to remove the .set_results_name on the ParserElements by using .add_parse_action with class constructors instead, this gave a typed parse tree which tidied up some of the post-processing.

Aiming for the parser grammar to be as close to the “real” grammar, and the benefits which come from it, seems similar to DDD. When designing models with DDD, they should map closely to the real world objects they are modelling. If you implement your models in accordance with this policy, the rest of your code should be simpler to understand. This example from ALeRT shows the same is true for parsing. If the grammar and parse tree tightly map to the real-world grammar of the source document, it makes the code more readable. This is something I will keep in mind when creating parser grammars in future.

Perfection is achieved, not when there is nothing more to add, but when there is nothing left to take away. - Antoine de Saint-Exupéry