To run this example locally, install Ploomber and execute: ploomber examples -n guides/first-pipeline

To start a free, hosted JupyterLab: binder-logo

Found an issue? Let us know.

Have questions? Ask us anything on Slack.

Your first Python pipeline

Introductory tutorial to learn the basics of Ploomber.

Introduction

Ploomber helps you build modular pipelines. A pipeline (or DAG) is a group of tasks with a particular execution order, where subsequent (or downstream tasks) use previous (or upstream) tasks as inputs.

Pipeline declaration

This example pipeline contains five tasks, 1-get.py, 2-profile-raw.py, 3-clean.py, 4-profile-clean.py and 5-plot.py; we declare them in a pipeline.yaml file:

# Content of pipeline.yaml
tasks:
   # source is the code you want to execute (.ipynb also supported)
  - source: 1-get.py
    # products are task's outputs
    product:
      # scripts generate executed notebooks as outputs
      nb: output/1-get.ipynb
      # you can define as many outputs as you want
      data: output/raw_data.csv

  - source: 2-profile-raw.py
    product: output/2-profile-raw.ipynb

  - source: 3-clean.py
    product:
      nb: output/3-clean.ipynb
      data: output/clean_data.parquet

  - source: 4-profile-clean.py
    product: output/4-profile-clean.ipynb

  - source: 5-plot.py
    product: output/5-plot.ipynb

Note: YAML is a human-readable text format similar to JSON.

Note: Ploomber supports Python scripts, Python functions, Jupyter notebooks, R scripts, and SQL scripts.

Opening .py files as notebooks

Ploomber integrates with Jupyter. Among other things, it allows you to open ``.py`` files as notebooks (via jupytext).

lab-open-with-nb

What sets the execution order?

Ploomber infers the pipeline structure from your code. For example, to clean the data, we must get it first; hence, we declare the following in 3-clean.py:

# 3-clean.py

# this tells Ploomber to execute the '1-get' task before '3-clean'
upstream = ['1-get']

Plotting the pipeline

[1]:
%%bash
ploomber plot
Loading pipeline...
Plot saved at: pipeline.png
[2]:
from IPython.display import Image
Image(filename='pipeline.png')
[2]:
../_images/get-started_first-pipeline_5_0.png

You can see that our pipeline has a defined execution order.

Note: This is a sample predefined five-task pipeline, Ploomber can manage arbitrarily complex pipelines and dependencies among tasks.

Running the pipeline

[3]:
%%bash
# takes a few seconds to finish
ploomber build
Loading pipeline...
name             Ran?      Elapsed (s)    Percentage
---------------  ------  -------------  ------------
1-get            True          2.68205       13.9314
2-profile-raw    True          4.57552       23.7666
3-clean          True          2.53486       13.1668
4-profile-clean  True          3.8043        19.7607
5-plot           True          5.65512       29.3745
Building task '1-get':   0%|          | 0/5 [00:00<?, ?it/s]
Executing:   0%|          | 0/6 [00:00<?, ?cell/s]
Executing:  17%|█▋        | 1/6 [00:01<00:08,  1.77s/cell]
Executing: 100%|██████████| 6/6 [00:02<00:00,  2.30cell/s]
Building task '2-profile-raw':  20%|██        | 1/5 [00:02<00:10,  2.69s/it]
Executing:   0%|          | 0/7 [00:00<?, ?cell/s]
Executing:  14%|█▍        | 1/7 [00:01<00:07,  1.32s/cell]
Executing:  43%|████▎     | 3/7 [00:02<00:03,  1.24cell/s]
Executing:  71%|███████▏  | 5/7 [00:03<00:01,  1.48cell/s]
Executing:  86%|████████▌ | 6/7 [00:03<00:00,  1.77cell/s]
Executing: 100%|██████████| 7/7 [00:04<00:00,  1.56cell/s]
Building task '3-clean':  40%|████      | 2/5 [00:07<00:11,  3.80s/it]
Executing:   0%|          | 0/9 [00:00<?, ?cell/s]
Executing:  11%|█         | 1/9 [00:01<00:14,  1.78s/cell]
Executing:  44%|████▍     | 4/9 [00:01<00:01,  2.66cell/s]
Executing:  67%|██████▋   | 6/9 [00:02<00:00,  4.23cell/s]
Executing: 100%|██████████| 9/9 [00:02<00:00,  3.68cell/s]
Building task '4-profile-clean':  60%|██████    | 3/5 [00:09<00:06,  3.22s/it]
Executing:   0%|          | 0/7 [00:00<?, ?cell/s]
Executing:  14%|█▍        | 1/7 [00:01<00:07,  1.18s/cell]
Executing:  43%|████▎     | 3/7 [00:02<00:02,  1.40cell/s]
Executing:  57%|█████▋    | 4/7 [00:02<00:01,  1.76cell/s]
Executing:  71%|███████▏  | 5/7 [00:02<00:00,  2.03cell/s]
Executing:  86%|████████▌ | 6/7 [00:03<00:00,  2.39cell/s]
Executing: 100%|██████████| 7/7 [00:03<00:00,  1.88cell/s]
Building task '5-plot':  80%|████████  | 4/5 [00:13<00:03,  3.45s/it]
Executing:   0%|          | 0/8 [00:00<?, ?cell/s]
Executing:  12%|█▎        | 1/8 [00:02<00:16,  2.32s/cell]
Executing:  50%|█████     | 4/8 [00:02<00:01,  2.07cell/s]
Executing:  75%|███████▌  | 6/8 [00:04<00:01,  1.64cell/s]
Executing: 100%|██████████| 8/8 [00:05<00:00,  1.44cell/s]
Building task '5-plot': 100%|██████████| 5/5 [00:19<00:00,  3.85s/it]

This pipeline saves all the output in the output/ directory; we have the output notebooks and data files:

[4]:
%%bash
ls output
1-get.ipynb
2-profile-raw.ipynb
3-clean.ipynb
4-profile-clean.ipynb
5-plot.ipynb
clean_data.parquet
raw_data.csv

Updating the pipeline

Ploomber automatically caches your pipeline’s previous results and only runs tasks that changed since your last execution.

Execute the following to modify the 3-clean.py script

[5]:
from pathlib import Path

path = Path('3-clean.py')
clean = path.read_text()

# add a print statement at the end of 3-clean.py
path.write_text(clean + """
print("hello")
""")
[5]:
417

Execute the pipeline again:

[6]:
%%bash
# takes a few seconds to finish
ploomber build
Loading pipeline...
name             Ran?      Elapsed (s)    Percentage
---------------  ------  -------------  ------------
3-clean          True          2.35839       20.2362
4-profile-clean  True          3.73398       32.0396
5-plot           True          5.56192       47.7242
1-get            False         0              0
2-profile-raw    False         0              0
Building task '3-clean':   0%|          | 0/3 [00:00<?, ?it/s]
Executing:   0%|          | 0/9 [00:00<?, ?cell/s]
Executing:  11%|█         | 1/9 [00:01<00:12,  1.61s/cell]
Executing:  44%|████▍     | 4/9 [00:01<00:01,  2.92cell/s]
Executing:  67%|██████▋   | 6/9 [00:01<00:00,  4.56cell/s]
Executing: 100%|██████████| 9/9 [00:02<00:00,  3.97cell/s]
Building task '4-profile-clean':  33%|███▎      | 1/3 [00:02<00:04,  2.36s/it]
Executing:   0%|          | 0/7 [00:00<?, ?cell/s]
Executing:  14%|█▍        | 1/7 [00:01<00:08,  1.36s/cell]
Executing:  43%|████▎     | 3/7 [00:02<00:03,  1.33cell/s]
Executing:  71%|███████▏  | 5/7 [00:02<00:00,  2.10cell/s]
Executing:  86%|████████▌ | 6/7 [00:03<00:00,  2.40cell/s]
Executing: 100%|██████████| 7/7 [00:03<00:00,  1.91cell/s]
Building task '5-plot':  67%|██████▋   | 2/3 [00:06<00:03,  3.17s/it]
Executing:   0%|          | 0/8 [00:00<?, ?cell/s]
Executing:  12%|█▎        | 1/8 [00:02<00:15,  2.21s/cell]
Executing:  50%|█████     | 4/8 [00:02<00:01,  2.16cell/s]
Executing:  75%|███████▌  | 6/8 [00:03<00:01,  1.67cell/s]
Executing: 100%|██████████| 8/8 [00:05<00:00,  1.46cell/s]
Building task '5-plot': 100%|██████████| 3/3 [00:11<00:00,  3.89s/it]
[7]:
# restore contents
path.write_text(clean)
[7]:
401

You’ll see that 1-get.py & 2-profile-raw.py didn’t run because it was not affected by the change!

Where to go from here

Bring your own code! Check out the tutorial to migrate your code to Ploomber.

Have questions? Ask us anything on Slack.

Want to dig deeper into Ploomber’s core concepts? Check out the basic concepts tutorial.

Want to start a new project quickly? Check out how to get examples.