Index

Requirement: All examples are compatible with at least Python v3.6, except for using ast.dump() with the attribute indent= which has been added in Python v3.9.

What is an Abstract Syntax Tree (AST)?

An Abstract Syntax Tree (AST) is a data structure used to reason about the grammar of a programming language in the context of the instructions provided into source code.

From source code to binary

For instance, compilers use ASTs when transforming source code into binary code:

  1. Given some text source code, the compiler first tokenizes the text to identify programming language keywords, variables, literals, etc. Each token represents an "atom" of an instruction.

  2. Tokens are then rearranged into an AST, a tree where nodes are the "atoms" of the instructions, and edges the relationships between the atoms based on the programming language grammar. For instance, the AST make explicit the presence of a function call, the related input arguments, the instructions composing the function, etc.

  3. The compiler then can apply multiple optimizations to the AST, and ultimately converts it into binary code.

Despite their role for compilers, ASTs are useful for a broader set of use-cases. Let's discuss this more in details.

The ast Python module and its use

The ast module in the Python standard library can be used to create, visit, and modify AST related to Python source code. It has been introduced in Python 2.6, and since then it evolved alongside the Python grammar.

Even if it is part of standard library since a long time, it is not common to use it directly. Rather, you might have used it indirectly as popular tools use it under-the-hood:

Using the ast module to investigate the PyBites Bite exercises

Still not convinced of the relevance of an AST? Fair enough: let's consider a more practical, and closer to the PyBites Platform, use-case.

The PyBites Platform is currently offering 300+ Bite exercises, and the number is constantly increasing. Given the (semi)hidden intention of the platform is to offer a varied set of challenges covering different Python modules and functionalities, it starts to be more and more challenging to identify what is covered by already available exercises, and what is instead left to explore.

This is where we can take advantage of the ast module. Specifically, we can process the source code of the solution of the exercises (as provided by the authors of the challenges) and recover some statistics about their content. For instance, which are the popular modules and builtin functions used.

Here some of the results. To follow along check out this Jupyter notebook.

Builtins popularity

Pybites exercises - builtins popularity

The histogram above shows the Python builtin calls sorted by their popularity. In other words, using the ast module one can detect when a function call has been made, and if it relates to the builtins module or not. Three colors are used to visually distinguish between exception types, the creation of base types (int, float, bool, list, and set), or other functions. The histogram is a normalized frequency count, i.e., the frequency of each element is cumulated across all exercises, and divided by the sum of all elements occurrence across all exercises.

A few observations:

Modules popularity

Pybites exercises - modules popularity

The histogram above shows the ranking for modules. For simplicity we limit to report on the root modules only. If submodules are used, their frequencies are cumulated into the frequency of the respective root modules.

As before, the histogram is heavy tailed, a testament that the PyBites Bite exercises try to "cover a little bit of everything".

We can observe the presence of non-standard modules, such as pandas and pytest, as well more ad-hoc modules such as as zodiac and fibonacci that are created for the purpose of the challenges themselves.

One can easily expand the analysis to understand the functions used in each module/submodule, as well as dive into more specific analysis. What is relevant to highlight is that the results reported here are generated with about 50 lines of Python code and using ast module. Processing the 300+ source code files with tools like awk, grep, or anything else would have been significantly harder.

Hopefully this examples gave you a rough idea of what you can achieve with an AST. The next step is to understand how to create such data structures, and investigate their composition.

Dissecting an assignment instruction using the ast module

To start familiarize with the ast module, let's see what happens when we analyze a single instruction: one_plus_two = 1+2

>>> import ast
>>> code = "one_plus_two = 1+2"
>>> tree = ast.parse(code)
>>> ast.dump(tree, indent=4)

This will output:

Module(
    body=[
        Assign(
            targets=[
                Name(id='one_plus_two', ctx=Store())],
            value=BinOp(
                left=Constant(value=1),
                op=Add(),
                right=Constant(value=2)))],
    type_ignores=[])

It might not be obvious at first, but the output generated by ast.dump() is actually a tree:

Let's rework the output into a diagram with the following conventions:

AST sketch

With this visualization at hand we can observe a few things.

The root of the tree is a Module node. In fact, even if our example is a single line program, it is still a true Python module. The node contains two attributes body and type_ignores. Let's put the aside type_ignores for a moment and focus on body.

As a Python module contains a series of instructions, the Module.body attribute is a list of nodes, one for each instruction in the program. Our example consists of a single assignment operation, hence Module.body contains only one Assign node.

An assignment operation has a right-hand side specifying the operation to perform, and a left-hand side specifying the destination of the operation. The two sides are associated to the Assign.value and Assign.targets attributes of the Assign node.

Considering the right-hand side, the Assign.value attribute is a BinOp node, since the instruction is a binary operation between two operands, which is fully specified with three attributes:

Considering the left-side, as Python supports multiple assignments and tuple unpacking, the Assign.targets attribute is a list collecting the different destinations of the operation. In our case the assignment is for a single variable, so a single Name node is used. In turn, the Name node has 2 attributes:

The Module.type_ignores attribute and type comments

The attribute Module.type_ignores in the vast majority of the cases is going to be an empty list. This is why in the sketch is colored in blue. To understand why this is the case and what is the actual purpose of the attribute, we need to make a digression.

Python 3.0 introduced annotations, and few years later those have been expanded into type hints. If you are not familiar with those concepts, check this Real Python tutorial and the official doc.

Those changes were not back ported to Python 2, which instead was using type comments as a form of annotation. For more information, see PEP 484 or this Real Python tutorial.

The attribute Module.type_ignores refers to a special type comment # type: ignore that was used to indicate to type checker (such as mypy) to suppress errors if one was found. For legacy reasons, the ast module is still reporting on those comments, but only when asked to do so.

Let's see an example.

>>> code = 'one_plus_two = 1+2 # type: int'
>>> tree = ast.parse(code, type_comments=True)
>>> print(ast.dump(tree, indent=4))

This will output:

Module(
    body=[
        Assign(
            targets=[
                Name(id='one_plus_two', ctx=Store())],
            value=BinOp(
                left=Constant(value=1),
                op=Add(),
                right=Constant(value=2)),
            type_comment='int')],
    type_ignores=[])

Notice that the only difference with respect to the detailed analysis of the AST previously discussed is the presence of the attribute Assign.type_comment='int'. The attribute reflects the metadata provided by type comment # type: int, and is added to the AST tree Assign node because we specified type_comment=True when triggering the parsing.

However, # type: ignore is treated differently. Those type comments are stored into the Module.type_ignores attribute as TypeIgnore objects rather than being collected as metadata in the inner nodes of the tree.

>>> code = 'one_plus_two = 1+2 # type: ignore'
>>> tree = ast.parse(code, type_comments=True)
>>> print(ast.dump(tree, indent=4))

This will output:

Module(
    body=[
        Assign(
            targets=[
                Name(id='one_plus_two', ctx=Store())],
            value=BinOp(
                left=Constant(value=1),
                op=Add(),
                right=Constant(value=2)))],
    type_ignores=[
        TypeIgnore(lineno=1, tag='')])

The ast module APIs

The ast module is mostly a large collection of classes, one for each of the different aspects of the Python grammar. Overall, there are about 100 classes, ranging from literals, to more complex construct such as list comprehensions.

ast.AST is the base class for all other classes in the module, and it defines the following base attributes for all AST nodes:

When dealing with an AST the trickiest part is understanding nodes and attributes semantic. In fact, there are a lot of variants and corner cases, so it is easy to get confused.

A good way to start to familiarize with an AST is to use an interactive console such as ipython similarly to what we did in the previous examples. If you are used to an IDE, both PyCharm and Visual Studio Code provide plugins to visualize an AST (notice that PyCharm uses its own version of AST called Program Structure Interface - PSI)

No matter your preferred choice, the documentation is a fundamental resource to keep at hand. Yet, a couple of remarks:

Beside the classes, the ast module defines how to perform a visit of a tree, and how to do transformations.

Visiting an AST

You can visit an AST in two ways: using helper functions, or via an ast.NodeVisitor class.

Let's starts reviewing the helper functions:

Here some examples:

>>> import ast
>>> code = "one_plus_two = 1+2"
>>> tree = ast.parse(code)

>>> for node in ast.walk(tree):
        print(node.__class__.__name__)

Module
Assign
Name
BinOp
Store
Constant
Add
Constant

>>> for name, value in ast.iter_fields(tree):
        print(name, value)

body [<_ast.Assign object at 0x10c60be80>]
type_ignores []

>>> for node in ast.iter_child_nodes(tree):
        print(node.__class__.__name__)

Assign

When using an ast.NodeVisitor instead, one can register specific callbacks to trigger when visiting specific node types:

class BinOpVisitor(ast.NodeVisitor):

    def visit_BinOp(self, node):
        print(f"found BinOp at line: {node.lineno}")
        self.generic_visit(node)

In this example:

What sort of black magic happens behind the scene to trigger the callbacks? It is actually simple. A ast.NodeVisitor also defines a visit() function which is always invoked first: if the input node type matches one of the callbacks, such callback is called, otherwise generic_visit() is invoked to visit the node children. In our example we are not overwriting visit(), hence we can trigger a visit of the tree simply invoking the method:

>>> vis = BinOpVisitor()
>>> vis.visit(tree)

Here the complete example:

import ast

class BinOpVisitor(ast.NodeVisitor):

    def visit_BinOp(self, node):
        print(f"found BinOp at line: {node.lineno}")
        self.generic_visit(node)


code = """
left_op = 1
right_op = 2
sum_two_things = left_op + right_op
other_sum = sum_two_things - 1

print(sum_two_things)
print(other_sum)
"""

tree = ast.parse(code)

print("=== full AST ===")
print(ast.dump(tree, indent=4))

print()
print("=== visit ===")
vis = BinOpVisitor()
vis.visit(tree)

Running the program we obtain the following output:

=== full AST ===
Module(
    body=[
        Assign(
            targets=[
                Name(id='left_op', ctx=Store())],
            value=Constant(value=1)),
        Assign(
            targets=[
                Name(id='right_op', ctx=Store())],
            value=Constant(value=2)),
        Assign(
            targets=[
                Name(id='sum_two_things', ctx=Store())],
            value=BinOp(
                left=Name(id='left_op', ctx=Load()),
                op=Add(),
                right=Name(id='right_op', ctx=Load()))),
        Assign(
            targets=[
                Name(id='other_sum', ctx=Store())],
            value=BinOp(
                left=Name(id='sum_two_things', ctx=Load()),
                op=Sub(),
                right=Constant(value=1))),
        Expr(
            value=Call(
                func=Name(id='print', ctx=Load()),
                args=[
                    Name(id='sum_two_things', ctx=Load())],
                keywords=[])),
        Expr(
            value=Call(
                func=Name(id='print', ctx=Load()),
                args=[
                    Name(id='other_sum', ctx=Load())],
                keywords=[]))],
    type_ignores=[])

=== visit ===
found BinOp at line: 4
found BinOp at line: 5

Modifying an AST

A ast.NodeTransformer can be used as base class for a transformers, similarly to the logic used for the visitor class. This time, rather than simply visiting the nodes, the callbacks are used to modify, replace, add new nodes.

Here an example:

import ast
import random

class ConstantTransformer(ast.NodeTransformer):
    def visit_Constant(self, node):
        new_value = random.randint(-10, 10)
        new_node = ast.Constant(new_value)
        print(f"replacing constant {node.value} -> {new_value} at lineno: {node.lineno}")
        return new_node

In this example:

The reference returned by the callbacks represent the node to use in the AST. In this example we are replacing the original node. When returning None instead, the visited node is removed from the tree.

To trigger the transformation, we use the same operation used for the visit. This time the visit returns the reference of the tree modified:

>>> vis = ConstantTransformer()
>>> new_tree = vis.visit(tree)

Here the full example:

import ast
import random

class ConstantTransformer(ast.NodeTransformer):
    def visit_Constant(self, node):
        new_value = random.randint(-10, 10)
        print(f"replacing constant {node.value} -> {new_value} at lineno: {node.lineno}")
        new_node = ast.Constant(new_value)
        return new_node

def exec_tree(tree):
    print()
    print("=== AST ===")
    print(ast.dump(tree, indent=4))
    print('---')

    tree_fixed = ast.fix_missing_locations(tree)
    code_obj = compile(tree_fixed, '<string>', 'exec')
    exec(code_obj)

code = """
left_op = 1
right_op = 2
sum_two_things = left_op + right_op
other_sum = sum_two_things - 1

print(sum_two_things)
print(other_sum)
"""

tree = ast.parse(code)
exec_tree(tree)

## fix seed
random.seed(10)

## apply the transformations
vis = ConstantTransformer()
new_tree = vis.visit(tree)

The source code in code is the same as the one used to do a simple visit. Likewise, the process to generate the related tree.

Then we fix a seed for the random number generator via random.seed(), so to have consistent output when running the program multiple times.

We create a ConstantTransformer() object, and we visit it obtaining new_tree, which is a transformed version of the original tree.

To verify the transformations, we can print the AST, and "run it" by transforming into executable code. To do so, we use the helper function exec_tree():

Here the output related to exec_tree(tree):

=== AST ===
Module(
    body=[
        Assign(
            targets=[
                Name(id='left_op', ctx=Store())],
            value=Constant(value=1)),
        Assign(
            targets=[
                Name(id='right_op', ctx=Store())],
            value=Constant(value=2)),
        Assign(
            targets=[
                Name(id='sum_two_things', ctx=Store())],
            value=BinOp(
                left=Name(id='left_op', ctx=Load()),
                op=Add(),
                right=Name(id='right_op', ctx=Load()))),
        Assign(
            targets=[
                Name(id='other_sum', ctx=Store())],
            value=BinOp(
                left=Name(id='sum_two_things', ctx=Load()),
                op=Sub(),
                right=Constant(value=1))),
        Expr(
            value=Call(
                func=Name(id='print', ctx=Load()),
                args=[
                    Name(id='sum_two_things', ctx=Load())],
                keywords=[])),
        Expr(
            value=Call(
                func=Name(id='print', ctx=Load()),
                args=[
                    Name(id='other_sum', ctx=Load())],
                keywords=[]))],
    type_ignores=[])
---
3
2

The output for exec_tree(new_tree):

replacing constant 1 -> 8 at lineno: 2
replacing constant 2 -> -9 at lineno: 3
replacing constant 1 -> 3 at lineno: 5

=== AST ===
Module(
    body=[
        Assign(
            targets=[
                Name(id='left_op', ctx=Store())],
            value=Constant(value=8)),
        Assign(
            targets=[
                Name(id='right_op', ctx=Store())],
            value=Constant(value=-9)),
        Assign(
            targets=[
                Name(id='sum_two_things', ctx=Store())],
            value=BinOp(
                left=Name(id='left_op', ctx=Load()),
                op=Add(),
                right=Name(id='right_op', ctx=Load()))),
        Assign(
            targets=[
                Name(id='other_sum', ctx=Store())],
            value=BinOp(
                left=Name(id='sum_two_things', ctx=Load()),
                op=Sub(),
                right=Constant(value=3))),
        Expr(
            value=Call(
                func=Name(id='print', ctx=Load()),
                args=[
                    Name(id='sum_two_things', ctx=Load())],
                keywords=[])),
        Expr(
            value=Call(
                func=Name(id='print', ctx=Load()),
                args=[
                    Name(id='other_sum', ctx=Load())],
                keywords=[]))],
    type_ignores=[])
---
-1
-4

The output now is "randomized", as expected by the transformation. However, the transformation has overwritten the original tree, as new_tree and tree are the same object.

>>> id(tree), id(new_tree)
4350659920, 4350659920

To avoid this however one simply use the copy module to clone the whole tree before triggering the transformation, or overwrite the visit() method and define the ad-hoc logic for the use-case at hand.

Keep calm and happy Python coding!

-- Alessandro

PyBites Python Tips

Do you want to get 250+ concise and applicable Python tips in an ebook that will cost you less than 10 bucks (future updates included), check it out here.

Get our Python Tips Book

"The discussions are succinct yet thorough enough to give you a solid grasp of the particular problem. I just wish I would have had this book when I started learning Python." - Daniel H

"Bob and Julian are the masters at aggregating these small snippets of code that can really make certain aspects of coding easier." - Jesse B

"This is now my favourite first Python go-to reference." - Anthony L

"Do you ever go on one of those cooking websites for a recipe and have to scroll for what feels like an eternity to get to the ingredients and the 4 steps the recipe actually takes? This is the opposite of that." - Sergio S

Get the book