To run this example locally, install Ploomber and execute: ploomber examples -n guides/spec-api-python

To start a free, hosted JupyterLab: binder-logo

Found an issue? Let us know.

Have questions? Ask us anything on Slack.

Your first Python pipeline

Introductory tutorial to learn the basics of Ploomber.

Note: This tutorial is a quick introduction. If you want to learn about Ploomber’s core concepts and design rationale, go to the the next tutorial.

Why Ploomber?

Notebooks are hard to maintain. Teams often prototype projects in notebooks, but maintaining them is an error-prone process that slows progress down. Ploomber overcomes the challenges of working with .ipynb files allowing teams to develop collaborative, production-ready pipelines interactively using JupyterLab or any text editor.


A pipeline (or DAG) is a group of tasks with a particular execution order, where subsequent (or downstream tasks) use previous (or upstream) tasks as inputs.

This example pipeline contains three tasks, the first task, gets some data, cleans it, and generates a visualization:

ls *.py

Note: These tasks are Python scripts, but you can use Python functions, Jupyter notebooks, R scripts and SQL scripts.

Note: This is a simple three-task pipeline, but Ploomber can manage arbitrarily complex pipelines and dependencies among tasks.

Integration with Jupyter

Ploomber integrates with Jupyter. If you open the scripts inside the jupyter notebook app, they will render as notebooks. If you’re using jupyter lab, you need to right click -> open with -> Notebook as shown below:


Note: You can use regular .ipynb files for your pipeline; however, using plain .py files is recommended since they’re easier to manage with git.

Along with the *.py files, there is a pipeline.yaml file where we declare which files we use as tasks:

# Content of pipeline.yaml
   # source is the code you want to execute
  - source:
    # products = task outputs
      # nb is short for 'notebook'
      nb: output/1-get.ipynb
      # you can define as many outputs (of any type) as you want
      data: output/data.csv
      # e.g., another: output/another.parquet

    # the outputs of become inputs of
  - source:
      nb: output/2-clean.ipynb
      data: output/clean.csv

    # the outputs of become inputs of
  - source:
    product: output/3-plot.ipynb

  # add more tasks by adding new entries here

Note: YAML is a human-readable text format similar to JSON; Ploomber uses it to describe the tasks in our pipeline.

Let’s plot the pipeline:

ploomber plot
Plot saved at: pipeline.png
100%|██████████| 3/3 [00:00<00:00, 3096.19it/s]
from IPython.display import Image

You can see that our pipeline has a defined execution order: 1-get -> 2-clean -> 3-plot.

Let’s now execute the status command, which gives us an overview of the pipeline:

ploomber status
name     Last run      Outdated?    Product       Doc (short)    Location
-------  ------------  -----------  ------------  -------------  ------------
1-get    Has not been  Source code  MetaProduct(                 /Users/Edu/d
         run                        {'data': Fil                 ev/projects-
                                    e('output/da                 ploomber/gui
                                    ta.csv'),                    des/spec-
                                    'nb': File('                 api-python/1
2-clean  Has not been  Source code  MetaProduct(                 /Users/Edu/d
         run           & Upstream   {'data': Fil                 ev/projects-
                                    e('output/cl                 ploomber/gui
                                    ean.csv'),                   des/spec-
                                    'nb': File('                 api-python/2
3-plot   Has not been  Source code  File('output                 /Users/Edu/d
         run           & Upstream   /3-plot.ipyn                 ev/projects-
                                    b')                          ploomber/gui
100%|██████████| 3/3 [00:00<00:00, 3187.97it/s]

We can see a summary of each task: last execution date, if it’s outdated (i.e., source code changed since previous execution), product (output files), documentation (if any), and the source code location.

How is execution order determined?

Ploomber infers the pipeline structure from your code. For example, to clean the data, we must get it first; hence, we declare the following in

# this tells Ploomber to execute the '1-get' task before '2-clean'
upstream = ['1-get']

Once we finish cleaning the data, we must save it somewhere (In Ploomber, an output is known as a product). Products can be files or SQL relations. Our current example only generates files.

To specify where to save the output of each task, we use the product key. For example, the 1-get task definition looks like this:

- source:
  # task outputs
    # nb is generated by executing the file as a notebook
    nb: output/get.ipynb
    # declare any other outputs here
    data: output/data.csv

Scripts automatically generate a copy of themselves in Jupyter notebook format (.ipynb). That’s why we see a notebook in the product dictionary (under the nb key). Generating a copy on each execution allows us to create standalone reports for each task, no need to write extra code to save our charts! Notebooks as outputs are an essential concept: is part of the pipeline’s source code; in contrast, output/1-get.ipynb is an artifact generated by the source code.

If you don’t want to generate output notebooks, you can use Python functions as tasks. Our upcoming tutorial goes deeper into the different types of tasks.

Building the pipeline

Let’s build the pipeline:

# takes a few seconds to finish
ploomber build
name     Ran?      Elapsed (s)    Percentage
-------  ------  -------------  ------------
1-get    True          7.14704       54.1592
2-clean  True          2.35617       17.8547
3-plot   True          3.69315       27.9861
Building task '1-get':   0%|          | 0/3 [00:00<?, ?it/s]
Executing:   0%|          | 0/6 [00:00<?, ?cell/s]
Executing:  17%|█▋        | 1/6 [00:06<00:32,  6.43s/cell]
Executing: 100%|██████████| 6/6 [00:07<00:00,  1.18s/cell]
Building task '2-clean':  33%|███▎      | 1/3 [00:07<00:14,  7.15s/it]
Executing:   0%|          | 0/5 [00:00<?, ?cell/s]
Executing:  20%|██        | 1/5 [00:01<00:07,  1.86s/cell]
Executing: 100%|██████████| 5/5 [00:02<00:00,  2.16cell/s]
Building task '3-plot':  67%|██████▋   | 2/3 [00:09<00:04,  4.33s/it]
Executing:   0%|          | 0/7 [00:00<?, ?cell/s]
Executing:  14%|█▍        | 1/7 [00:02<00:15,  2.53s/cell]
Executing:  57%|█████▋    | 4/7 [00:02<00:01,  1.94cell/s]
Executing: 100%|██████████| 7/7 [00:03<00:00,  1.94cell/s]
Building task '3-plot': 100%|██████████| 3/3 [00:13<00:00,  4.40s/it]

This pipeline saves all the output in the output/ directory; we have a few data files:

ls output/*.csv

And a notebook for each script:

ls output/*.ipynb

Updating the pipeline

Quick experimentation is essential to analyze data. Ploomber allows you to iterate faster and run more experiments.

Say you found a problematic column and need to add few more lines to your script. Since does not depend on, we don’t have to rerun it. However, if we modify and want to bring our results up-to-date, we must run, and then, in that order. To save you valuable time, Ploomber keeps track of those dependencies and only reruns outdated tasks.

To see how it works, execute the following to modify the script

from pathlib import Path

path = Path('')
clean = path.read_text()
path.write_text(clean + '\nprint("hello")')

Let’s now build again:

# takes a few seconds to finish
ploomber build
name     Ran?      Elapsed (s)    Percentage
-------  ------  -------------  ------------
2-clean  True          2.296         38.6155
3-plot   True          3.64979       61.3845
1-get    False         0              0
Building task '2-clean':   0%|          | 0/2 [00:00<?, ?it/s]
Executing:   0%|          | 0/6 [00:00<?, ?cell/s]
Executing:  17%|█▋        | 1/6 [00:01<00:09,  1.84s/cell]
Executing: 100%|██████████| 6/6 [00:02<00:00,  2.67cell/s]
Building task '3-plot':  50%|█████     | 1/2 [00:02<00:02,  2.30s/it]
Executing:   0%|          | 0/7 [00:00<?, ?cell/s]
Executing:  14%|█▍        | 1/7 [00:02<00:15,  2.53s/cell]
Executing:  57%|█████▋    | 4/7 [00:02<00:01,  1.96cell/s]
Executing: 100%|██████████| 7/7 [00:03<00:00,  1.96cell/s]
Building task '3-plot': 100%|██████████| 2/2 [00:05<00:00,  2.98s/it]
# restore contents

You’ll see that didn’t run because it was not affected by the change!

Incremental builds are a powerful feature: you can open any of the .py files in Jupyter, edit them interactively (as if they were notebooks), then call ploomber build to quickly get your results up-to-date.

Where to go from here

This tutorial shows a bit of what Ploomber can do for you. However, there are many other features to discover: task parallelization, parametrization, execution in the cloud, among others.

Want to learn more about what Ploomber is good for? Check out the use cases documentation.

Want to dig deeper into Ploomber’s core concepts? Check out the basic concepts tutorial.

Want to take a look at some examples? Check out how to download templates.

Have questions? Ask us anything on Slack or open an issue on GitHub.

Do you like our project? Show your support with a star on GitHub!

[ ]: