To run this locally, install Ploomber and execute: ploomber examples -n guides/sql-templating

SQL templating¶

Introductory tutorial teaching how to develop modular SQL pipelines.

Basic templating¶

SQL templating is a powerful way to make your SQL scripts more concise. It works by using a templating language (Ploomber uses jinja) to generate SQL code on the fly.

You’ve already used SQL templating if you’ve followed the SQL pipelines tutorial in the Get started section. Let’s take a look at the structure of a SQL script in Ploomber:

DROP TABLE IF EXISTS {{product}};

CREATE TABLE {{product}} AS
SELECT * FROM {{upstream['clean']}}
WHERE x > 10

The {{product}} placeholder will be replaced at runtime by whatever value this task has in the product section. For example, if you have product: [schema, name, table], {{product}} becomes schema.name.

To ensure the table is re-created on each run, we add a DROP TABLE IF EXISTS ...; statement before CREATE TABLE ..;, since both of them take the table name as an argument, we can use the {{product}} placeholder.

Finally, the {{upstream['clean']}} placeholder tells Ploomber that the current script uses the product from a task named clean as input data. This defines the dependency relationship between these two scripts and implies that the placeholder will be replaced by the actual table/view generated by the clean task.

These are the essential elements for templated SQL scripts; you don’t have to use more if you don’t want to but sometimes it is convenient to write concise code and maximize reusability. Let’s see a few more examples.

Control structures¶

jinja offers control structures that help us write SQL code on the fly. Say we want to compute summary statistics on a given column:

SELECT
    some_column,
    AVG(another_column) as avg_another_column,
    STDEV(another_column) as stdev_another_column,
    COUNT(another_column) as count_another_column,
    SUM(another_column) as sum_another_column,
    MAX(another_column) as max_another_column,
    MIN(another_column) as min_another_column,
FROM some_table
GROUP BY some_column

This code is very repetitive; now imagine how repetitive. We can generate the same code succinctly using a for loop:

SELECT
    some_column,
-- loop over aggregation functions
{% for fn in ['AVG', 'STDEV', 'COUNT', 'SUM', 'MAX', 'MIN'] %}
    -- apply function to the column, name the column
    -- and only add a comma if we are not in the last loop element
    {{fn}}({{col_agg}}) as {{fn}}_{{col_agg}}{{ ',' if not loop.last else '' }}
{% endfor %}
FROM some_table
GROUP BY some_column

Macros¶

Macros let us maximize SQL code reusability by defining snippets that we can “import” into other files. To define a macro, enclose your snippet between the {% macro MACRO_NAME %} ... {% endmacro %} tags. Let’s create a macro using our previous snippet:

# Content of sql/macros.sql
{% macro agg(col_group, col_agg, from_table) -%}

SELECT
    {{col_group}},
{% for fn in ['AVG', 'STDEV', 'COUNT', 'SUM', 'MAX', 'MIN'] %}
    {{fn}}({{col_agg}}) as {{fn}}_{{col_agg}}{{ ',' if not loop.last else '' }}
{% endfor %}
FROM {{from_table}}
GROUP BY {{col_group}}

{%- endmacro %}

The {% macro %} tag defines the macro name and parameters (if any). To use our macro in a different file, we must import it. Let’s say we define the previous macro in a macros.sql file:

# Content of sql/create-table.sql
-- import macros
{% import "macros.sql" as m %}

DROP TABLE IF EXISTS {{product}};

CREATE TABLE {{product}} AS
-- use macro
{{m.agg(col_group='country', col_agg='price', from_table='sales')}}

Configuring support for macros¶

We have to make a small change to our pipeline.yaml file to work with macros. So far, to specify which SQL files to use, we’ve just passed the file’s path in the source key. However, to import macros in our scripts, we must configure a source loader.

A source loader is simply a folder with files, with small addition: it defines a “jinja environment” that makes imports work (to know more about jinja environments, click here.

Let’s say all the scripts in our pipeline are in a sql/ directory. sql/ has two scripts, which correspond to the files shown in the previous section:

[1]:

%%sh
tree sql

sql
├── create-table.sql
└── macros.sql

0 directories, 2 files

To configure our source loader. We need to add a source_loader section like this:

# Content of pipeline.yaml
meta:
  # initialize source loader
  source_loader:
    # use the sql/ folder as the "root" for loading files
    path: sql/


tasks:
  # sources are now loaded from the source loader, paths are relative
  # to the source loader root directory
  - source: create-table.sql
    name: sql-task
    product: [some_table, table]
    client: db.get_client

Printing rendered code¶

Templated SQL helps us write more concise SQL code, but if your template renders to an invalid SQL script, you’ll get syntax errors, only use it when the benefits outweigh this risk. One way to debug SQL templates is to see how the rendered code looks like, you can do so from the command line:

[2]:

%%sh
ploomber task sql-task --source

Loading pipeline...
-- import macros


DROP TABLE IF EXISTS some_table;

CREATE TABLE some_table AS
-- use macro
SELECT
    country,

    AVG(price) as AVG_price,

    STDEV(price) as STDEV_price,

    COUNT(price) as COUNT_price,

    SUM(price) as SUM_price,

    MAX(price) as MAX_price,

    MIN(price) as MIN_price

FROM sales
GROUP BY country

100%|██████████| 1/1 [00:00<00:00, 874.18it/s]

As we can see, our template is generating a valid SQL script. But it’d be easier to spot errors in the rendered code than in the templated source if it didn’t.

Where to go next¶

Jinja documentation

Contents