.. testsetup:: import lkml Advanced LookML parsing ======================= :py:func:`lkml.load` and :py:func:`lkml.dump` provide a simple interface between LookML and Python primitive data structures. However, :py:func:`lkml.load` discards information about comments and whitespace, making lossless modification of LookML impossible. For example, let's say we wanted to programmatically add a description to the dimension in this snippet of LookML: .. code-block:: # Inventory-related dimensions here dimension: days_in_inventory { sql: ${TABLE}.days_in_inventory ;; } If we parse this LookML with :py:func:`lkml.load`, we'll lose the comment and any information about the surrounding whitespace: .. doctest:: >>> text = """ ... # Inventory-related dimensions here ... ... dimension: days_in_inventory { sql: ${TABLE}.days_in_inventory ;; } ... """ >>> parsed = lkml.load(text) >>> parsed {'dimensions': [{'sql': '${TABLE}.days_in_inventory', 'name': 'days_in_inventory'}]} Writing this dictionary back to LookML with :py:func:`lkml.dump` yields the following: .. doctest:: >>> print(lkml.dump(parsed)) dimension: days_in_inventory { sql: ${TABLE}.days_in_inventory ;; } The comment is missing and the whitespace has been overriden by :py:func:`lkml.dump`'s opinionated formatting. If we want to preserve the exact whitespace and comments surrounding this dimension, we'll need to dive under the hood of lkml and directly modify the **parse tree**. The parse tree -------------- The parse tree is an `immutable` tree structure generated by lkml that holds the relevant information about the parsed LookML. Each node in the tree is either a **syntax node** (a node with **children**) or a **syntax token** (a leaf node). .. autoclass:: lkml.tree.SyntaxNode :members: :noindex: .. autoclass:: lkml.tree.SyntaxToken :noindex: You can think of syntax tokens as the fundamental pieces of text that make up LookML. Whitespace and comments are collectively referred to as **trivia** and are stored in syntax tokens in their **prefix** and **suffix** attributes. Types of nodes ^^^^^^^^^^^^^^ All lkml parse trees begin with a :py:class:`lkml.tree.DocumentNode`, the root node of the tree. A ``DocumentNode`` has a single attribute, ``container``, a :py:class:`lkml.tree.ContainerNode`, which stores all of the top-level nodes in the document. Children of the ``ContainerNode`` can be instances of :py:class:`lkml.tree.BlockNode`, :py:class:`lkml.tree.ListNode`, or :py:class:`lkml.tree.PairNode`. Block nodes store children in container nodes of their own, and list nodes may have block nodes or pair nodes as children. Creating an example node ^^^^^^^^^^^^^^^^^^^^^^^^ In lkml, ``hidden: yes`` is represented as a ``PairNode``. A ``PairNode`` has two attributes, ``type`` and ``value``, where each are ``SyntaxTokens``. You'll notice the ``PairNode`` also stores a special kind of ``SyntaxToken`` to represent the colon ":" between the type and the value. .. autoclass:: lkml.tree.PairNode :noindex: We could build a ``PairNode`` for ``hidden: yes`` from scratch as follows: .. doctest:: >>> from lkml.tree import PairNode, SyntaxToken >>> node = PairNode( ... type=SyntaxToken('hidden'), ... value=SyntaxToken('yes') ... ) >>> print(str(node)) hidden: yes We could include this simple pair node as a child in a block or container node, the beginnings of a more complex piece of LookML. Generating the parse tree ^^^^^^^^^^^^^^^^^^^^^^^^^ Creating nodes and tokens by hand is tedious, so it's more likely that you will be parsing a LookML string into a parse tree with :py:func:`lkml.parse`. .. doctest:: >>> lkml.parse('hidden: yes') DocumentNode(container=ContainerNode(), prefix='', suffix='') This tree can be analyzed with a visitor or modified with a transformer. To learn more about the parse tree and the different kinds of nodes, read the full API reference for the :ref:`tree-ref`. Traversing and modifying the parse tree --------------------------------------- The parse tree follows a design pattern called the `visitor pattern `_. The visitor pattern allows us to define flexible algorithms that interact with the tree without having to implement those algorithms as methods on the tree's node classes. Each node implements a method called ``accept``, that accepts a :py:class:`lkml.tree.Visitor` instance and passes itself to the corresponding ``visit_`` method on the visitor. Here's the ``accept`` method for a ``ListNode``. .. automethod:: lkml.tree.ListNode.accept :noindex: :: def accept(self, visitor: Visitor) -> Any: """Accepts a visitor and calls the visitor's list method on itself.""" return visitor.visit_list(self) In our visitor, we can define ``visit_list`` however we want---giving us tons of flexibility over how we design the visitor. A simple visitor class ^^^^^^^^^^^^^^^^^^^^^^ For example, we could write a linting visitor that traverses the parse tree and throws an error if it finds a dimension without a description:: from lkml.visitors import BasicVisitor class DescriptionVisitor(BasicVisitor): def visit_block(self, block: BlockNode): """For each block, check if it's a dimension and if it has a description.""" if block.type.value == 'dimension': child_types = [node.type.value for node in block.container.items] if 'description' not in child_types: raise KeyError(f'Dimension {block.name.value} does not have a description') # Assume we already have a parse tree to visit tree.accept(DescriptionVisitor()) :py:class:`lkml.visitors.BasicVisitor`, will traverse the tree but do nothing. We can simply override that default behavior for ``visit_block`` so we can inspect the dimensions, which are ``BlockNodes``. For each block that is a dimension, we iterate through its children and throw an error if a child with the description type is not present. Modifying the parse tree with transformers ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Because syntax nodes and tokens are immutable, you can't change them once created, you may only replace or remove them. This makes modifying the parse tree challenging, because the entire tree needs to be rebuilt for each change. lkml includes a basic transformer, :py:class:`lkml.visitors.BasicTransformer`, which like the visitors, traverses the tree. However, the transformer visits and replaces each node's children, allowing the immutable parse tree to be rebuilt with modifications. As an example, let's write a transformer that injects a user attribute into each ``sql_table_name`` field. This is something that could easily be solved with regex, but let's write a transformer as an example instead:: from dataclasses import replace from lkml.visitors import BasicTransformer class TableNameTransformer(BasicTransformer): def visit_pair(self, node: PairNode) -> PairNode: """Visit each pair and replace the SQL table schema with a user attribute.""" if node.type.value == 'sql_table_name': try: schema, table_name = node.value.value.split('.') # Sometimes the table name won't have a schema except ValueError: table_name = node.value.value new_value: str = '{{ _user_attributes["dbt_schema"] }}.' + table_name new_node: PairNode = replace(node, value=ExpressionSyntaxToken(new_value)) return new_node else: return node # Assume we already have a parse tree to visit tree.accept(TableNameTransformer()) This transformer traverses the parse tree and modifies all ``PairNodes`` that have the ``sql_table_name`` type, injecting a user attribute into the expression. We rely on the dataclasses function :py:func:`dataclasses.replace`, which allows us to copy an immutable node (all lkml nodes are frozen, immutable dataclasses) with modifications---in this case, to the ``value`` attribute of the ``PairNode``. Generating LookML from the parse tree ------------------------------------- Generating LookML from the parse tree is simple because each node class defines its own ``__str__`` method to serialize its contents. To generate a LookML string from any part of the tree, just cast it with ``str``:: tree: DocumentNode str(tree) How does lkml build the parse tree? ----------------------------------- lkml is made up of two components, a `lexer `_ and a parser. The parser is a `recursive descent parser `_ with backtracking. First, the lexer scans through the input string character by character and generates a stream of relevant tokens. The lexer skips over whitespace when it's not relevant. For example, the input string:: "sql: ${TABLE}.order_date ;;" would be broken into the tuple of tokens:: ( LiteralToken(sql), ValueToken(), ExpressionBlockToken(${TABLE}.order_date), ExpressionBlockEndToken() ) Next, the parser scans through the stream of tokens. It marks its position in the stream, then attempts to identify a matching rule in the grammar. If the rule is made up of other rules (this is a called a non-terminal), it descends recursively through the constituent rules looking for tokens that match. If it doesn't find a match for a rule, it backtracks to a previously marked point in the stream and tries the next available rule. If the parser runs out of rules to try, it raises a syntax error. As the parser finds matches, it adds the relevant token values to its syntax tree, which is eventually returned to the user if the input parses successfully.