Advanced LookML parsing

lkml.load() and lkml.dump() provide a simple interface between LookML and Python primitive data structures. However, lkml.load() discards information about comments and whitespace, making lossless modification of LookML impossible.

For example, let’s say we wanted to programmatically add a description to the dimension in this snippet of LookML:

# Inventory-related dimensions here

dimension: days_in_inventory { sql: ${TABLE}.days_in_inventory ;; }

If we parse this LookML with lkml.load(), we’ll lose the comment and any information about the surrounding whitespace:

>>> text = """
... # Inventory-related dimensions here
...
... dimension: days_in_inventory { sql: ${TABLE}.days_in_inventory ;; }
... """

>>> parsed = lkml.load(text)
>>> parsed
{'dimensions': [{'sql': '${TABLE}.days_in_inventory', 'name': 'days_in_inventory'}]}

Writing this dictionary back to LookML with lkml.dump() yields the following:

>>> print(lkml.dump(parsed))
dimension: days_in_inventory {
  sql: ${TABLE}.days_in_inventory ;;
}

The comment is missing and the whitespace has been overriden by lkml.dump()’s opinionated formatting. If we want to preserve the exact whitespace and comments surrounding this dimension, we’ll need to dive under the hood of lkml and directly modify the parse tree.

The parse tree

The parse tree is an immutable tree structure generated by lkml that holds the relevant information about the parsed LookML. Each node in the tree is either a syntax node (a node with children) or a syntax token (a leaf node).

class lkml.tree.SyntaxNode

Abstract base class for members of the parse tree that have child nodes.

abstract accept(visitor: Visitor) Any

Accepts a Visitor that can interact with the node.

The visitor pattern allows for flexible algorithms that can traverse the tree without needing to be defined as methods on the tree itself.

abstract property children: Tuple[SyntaxNode, ...] | None

Returns all child SyntaxNodes, but not SyntaxTokens.

abstract property line_number: int | None

Returns the line number of the first SyntaxToken in the node

class lkml.tree.SyntaxToken(value: str, line_number: int | None = None, prefix: str = '', suffix: str = '')

Stores a text value with optional prefix or suffix trivia.

For example, a syntax token might represent meaningful punctuation like a curly brace or the type or value of a LookML field. A syntax token can also store trivia, comments or whitespace that precede or follow the token value. The parser attempts to assign these prefixes and suffixes intelligently to the corresponding tokens.

value

The text represented by the token.

Type:

str

prefix

Comments or whitespace preceding the token.

Type:

str

suffix

Comments or whitespace following the token.

Type:

str

You can think of syntax tokens as the fundamental pieces of text that make up LookML. Whitespace and comments are collectively referred to as trivia and are stored in syntax tokens in their prefix and suffix attributes.

Types of nodes

All lkml parse trees begin with a lkml.tree.DocumentNode, the root node of the tree. A DocumentNode has a single attribute, container, a lkml.tree.ContainerNode, which stores all of the top-level nodes in the document.

Children of the ContainerNode can be instances of lkml.tree.BlockNode, lkml.tree.ListNode, or lkml.tree.PairNode.

Block nodes store children in container nodes of their own, and list nodes may have block nodes or pair nodes as children.

Creating an example node

In lkml, hidden: yes is represented as a PairNode. A PairNode has two attributes, type and value, where each are SyntaxTokens.

You’ll notice the PairNode also stores a special kind of SyntaxToken to represent the colon “:” between the type and the value.

class lkml.tree.PairNode(type: SyntaxToken, value: SyntaxToken, colon: Colon = Colon(value=':', line_number=None, prefix='', suffix=' '))

A simple LookML field, e.g. hidden: yes.

type

The field type, the value that precedes the colon.

Type:

lkml.tree.SyntaxToken

value

The field value, the value that follows the colon.

Type:

lkml.tree.SyntaxToken

colon

An optional Colon SyntaxToken. If not supplied, a default colon is created with a single space suffix after the colon.

Type:

lkml.tree.Colon

We could build a PairNode for hidden: yes from scratch as follows:

>>> from lkml.tree import PairNode, SyntaxToken

>>> node = PairNode(
...    type=SyntaxToken('hidden'),
...    value=SyntaxToken('yes')
... )

>>> print(str(node))
hidden: yes

We could include this simple pair node as a child in a block or container node, the beginnings of a more complex piece of LookML.

Generating the parse tree

Creating nodes and tokens by hand is tedious, so it’s more likely that you will be parsing a LookML string into a parse tree with lkml.parse().

>>> lkml.parse('hidden: yes')
DocumentNode(container=ContainerNode(), prefix='', suffix='')

This tree can be analyzed with a visitor or modified with a transformer.

To learn more about the parse tree and the different kinds of nodes, read the full API reference for the lkml.tree module.

Traversing and modifying the parse tree

The parse tree follows a design pattern called the visitor pattern. The visitor pattern allows us to define flexible algorithms that interact with the tree without having to implement those algorithms as methods on the tree’s node classes.

Each node implements a method called accept, that accepts a lkml.tree.Visitor instance and passes itself to the corresponding visit_ method on the visitor.

Here’s the accept method for a ListNode.

ListNode.accept(visitor: Visitor) Any

Accepts a visitor and calls the visitor’s list method on itself.

def accept(self, visitor: Visitor) -> Any:
    """Accepts a visitor and calls the visitor's list method on itself."""
    return visitor.visit_list(self)

In our visitor, we can define visit_list however we want—giving us tons of flexibility over how we design the visitor.

A simple visitor class

For example, we could write a linting visitor that traverses the parse tree and throws an error if it finds a dimension without a description:

 from lkml.visitors import BasicVisitor

 class DescriptionVisitor(BasicVisitor):
     def visit_block(self, block: BlockNode):
         """For each block, check if it's a dimension and if it has a description."""
         if block.type.value == 'dimension':
             child_types = [node.type.value for node in block.container.items]
             if 'description' not in child_types:
                 raise KeyError(f'Dimension {block.name.value} does not have a description')

# Assume we already have a parse tree to visit
tree.accept(DescriptionVisitor())

lkml.visitors.BasicVisitor, will traverse the tree but do nothing. We can simply override that default behavior for visit_block so we can inspect the dimensions, which are BlockNodes.

For each block that is a dimension, we iterate through its children and throw an error if a child with the description type is not present.

Modifying the parse tree with transformers

Because syntax nodes and tokens are immutable, you can’t change them once created, you may only replace or remove them. This makes modifying the parse tree challenging, because the entire tree needs to be rebuilt for each change.

lkml includes a basic transformer, lkml.visitors.BasicTransformer, which like the visitors, traverses the tree. However, the transformer visits and replaces each node’s children, allowing the immutable parse tree to be rebuilt with modifications.

As an example, let’s write a transformer that injects a user attribute into each sql_table_name field. This is something that could easily be solved with regex, but let’s write a transformer as an example instead:

from dataclasses import replace
from lkml.visitors import BasicTransformer

class TableNameTransformer(BasicTransformer):
    def visit_pair(self, node: PairNode) -> PairNode:
        """Visit each pair and replace the SQL table schema with a user attribute."""
        if node.type.value == 'sql_table_name':
            try:
                schema, table_name = node.value.value.split('.')
            # Sometimes the table name won't have a schema
            except ValueError:
                table_name = node.value.value

            new_value: str = '{{ _user_attributes["dbt_schema"] }}.' + table_name
            new_node: PairNode = replace(node, value=ExpressionSyntaxToken(new_value))
            return new_node
        else:
            return node

# Assume we already have a parse tree to visit
tree.accept(TableNameTransformer())

This transformer traverses the parse tree and modifies all PairNodes that have the sql_table_name type, injecting a user attribute into the expression.

We rely on the dataclasses function dataclasses.replace(), which allows us to copy an immutable node (all lkml nodes are frozen, immutable dataclasses) with modifications—in this case, to the value attribute of the PairNode.

Generating LookML from the parse tree

Generating LookML from the parse tree is simple because each node class defines its own __str__ method to serialize its contents. To generate a LookML string from any part of the tree, just cast it with str:

tree: DocumentNode
str(tree)

How does lkml build the parse tree?

lkml is made up of two components, a lexer and a parser. The parser is a recursive descent parser with backtracking.

First, the lexer scans through the input string character by character and generates a stream of relevant tokens. The lexer skips over whitespace when it’s not relevant.

For example, the input string:

"sql: ${TABLE}.order_date ;;"

would be broken into the tuple of tokens:

(
    LiteralToken(sql),
    ValueToken(),
    ExpressionBlockToken(${TABLE}.order_date),
    ExpressionBlockEndToken()
)

Next, the parser scans through the stream of tokens. It marks its position in the stream, then attempts to identify a matching rule in the grammar. If the rule is made up of other rules (this is a called a non-terminal), it descends recursively through the constituent rules looking for tokens that match.

If it doesn’t find a match for a rule, it backtracks to a previously marked point in the stream and tries the next available rule. If the parser runs out of rules to try, it raises a syntax error.

As the parser finds matches, it adds the relevant token values to its syntax tree, which is eventually returned to the user if the input parses successfully.