A collection of recipes to use Lark and its various features

Use a transformer to parse integer tokens

Transformers are the common interface for processing matched rules and tokens.

They can be used during parsing for better performance.

from lark import Lark, Transformer

class T(Transformer):
    def INT(self, tok):
        "Convert the value of `tok` from string to int, while maintaining line number & column."
        return tok.update(value=int(tok))

parser = Lark("""
start: INT*
%import common.INT
%ignore " "
""", parser="lalr", transformer=T())

print(parser.parse('3 14 159'))

Prints out:

Tree(start, [Token(INT, 3), Token(INT, 14), Token(INT, 159)])

Collect all comments with lexer_callbacks

lexer_callbacks can be used to interface with the lexer as it generates tokens.

It accepts a dictionary of the form

{TOKEN_TYPE: callback}

Where callback is of type f(Token) -> Token

It only works with the basic and contextual lexers.

This has the same effect of using a transformer, but can also process ignored tokens.

from lark import Lark

comments = []

parser = Lark("""
    start: INT*

    COMMENT: /#.*/

    %import common (INT, WS)
    %ignore COMMENT
    %ignore WS
""", parser="lalr", lexer_callbacks={'COMMENT': comments.append})

1 2 3  # hello
# world
4 5 6


Prints out:

[Token(COMMENT, '# hello'), Token(COMMENT, '# world')]

Note: We don’t have to return a token, because comments are ignored


Parsing ambiguous texts with earley and ambiguity='explicit' produces a single tree with _ambig nodes to mark where the ambiguity occurred.

However, it’s sometimes more convenient instead to work with a list of all possible unambiguous trees.

Lark provides a utility transformer for that purpose:

from lark import Lark, Tree, Transformer
from lark.visitors import CollapseAmbiguities

grammar = """
    !start: x y

    !x: "a" "b"
      | "ab"
      | "abc"

    !y: "c" "d"
      | "cd"
      | "d"

parser = Lark(grammar, ambiguity='explicit')

t = parser.parse('abcd')
for x in CollapseAmbiguities().transform(t):

This prints out:


x     ab
y     cd

x     abc
y     d

While convenient, this should be used carefully, as highly ambiguous trees will soon create an exponential explosion of such unambiguous derivations.

Keeping track of parents when visiting

The following visitor assigns a parent attribute for every node in the tree.

If your tree nodes aren’t unique (if there is a shared Tree instance), the assert will fail.

class Parent(Visitor):
    def __default__(self, tree):
        for subtree in tree.children:
            if isinstance(subtree, Tree):
                assert not hasattr(subtree, 'parent')
                subtree.parent = proxy(tree)

Unwinding VisitError after a transformer/visitor exception

Errors that happen inside visitors and transformers get wrapped inside a VisitError exception.

This can often be inconvenient, if you wish the actual error to propagate upwards, or if you want to catch it.

But, it’s easy to unwrap it at the point of calling the transformer, by catching it and raising the VisitError.orig_exc attribute.

For example:

from lark import Lark, Transformer
from lark.visitors import VisitError

tree = Lark('start: "a"').parse('a')

class T(Transformer):
    def start(self, x):
        raise KeyError("Original Exception")

t = T()
    print( t.transform(tree))
except VisitError as e:
    raise e.orig_exc

Adding a Progress Bar to Parsing with tqdm

Parsing large files can take a long time, even with the parser='lalr' option. To make this process more user-friendly, it’s useful to add a progress bar. One way to achieve this is to use the InteractiveParser to display each token as it is processed. In this example, we use tqdm, but a similar approach should work with GUIs.

from tqdm import tqdm

def parse_with_progress(parser: Lark, text: str, start=None):
    last = 0
    progress = tqdm(total=len(text))
    pi = parser.parse_interactive(text, start=start)
    for token in pi.iter_parse():
        if token.end_pos is not None:
            progress.update(token.end_pos - last)
            last = token.end_pos
    return pi.result

Note that we don’t simply wrap the iterable because tqdm would not be able to determine the total. Additionally, keep in mind that this implementation relies on the InteractiveParser and, therefore, only works with the LALR(1) parser, not earley.