To run this locally, install Ploomber and execute: ploomber examples -n guides/parametrized
Found an issue? Let us know.
Questions? Ask us on Slack.
Parametrized pipelines¶
Tutorial showing how to parametrize pipelines and change parameters from the command-line.
Often, pipelines perform the same operation over different subsets of the data. For example, say you are developing visualizations of economic data. You might want to generate the same charts for other countries.
One way to approach the problem is to have a for loop on each pipeline task to process all needed countries. But such an approach adds unnecessary complexity to our code; it’s better to keep our logic simple (each task processes a single country) and take the iterative logic out of our pipeline.
Ploomber allows you to do so using parametrized pipelines. Let’s see a sample using a pipeline.yaml
file.
Spec API (pipeline.yaml
)¶
# Content of pipeline.yaml
tasks:
- source: print.py
name: print
product:
nb: 'output/{{some_param}}/notebook.html'
papermill_params:
log_output: True
params:
some_param: '{{some_param}}'
The pipeline.yaml
above has a placeholder called some_param
. It is coming from a file called env.yaml
:
# Content of env.yaml
some_param: default_value
When reading your pipeline.yaml
, Ploomber looks for an env.yaml
file. If found, all defined keys will be available to your pipeline definition. You can use these placeholders (placeholders are strings between double curly brackets) in any of the fields of your pipeline.yaml
file.
In our case, we are using it in two places. First, we will save the executed notebook in a folder with the value of some_param
; this will allow you to keep copies of the generated output in a different folder depending on your parameter. Second, if we want to use the parameter in our code, we have to pass it to our tasks; all tasks take an optional params
with arbitrary parameters.
Let’s see how the code looks like:
# Content of print.py
# + tags=["parameters"]
upstream = None
product = None
some_param = None
# +
print('some_param: ', some_param, ' type: ', type(some_param))
Our task is a Python script, meaning that parameters are passed as an injected cell at runtime. Let’s see what happens if we build our pipeline.
[1]:
%%capture captured
%%sh
ploomber build --force --log INFO
[2]:
def filter_output(captured, startswith):
return print('\n'.join([
line for line in captured.stderr.split('\n')
if line.startswith(startswith)
]))
filter_output(captured, startswith='INFO:papermill:some_param')
INFO:papermill:some_param: default_value type: <class 'str'>
We see that our param some_param
is taking the default value (default_value
) as defined in env.yaml
. The command-line interface is aware of any parameters. You can see them using the --help
option:
[3]:
%%sh
ploomber build --help
usage: ploomber [-h] [--log LOG] [--log-file LOG_FILE]
[--entry-point ENTRY_POINT] [--force] [--skip-upstream]
[--partially PARTIALLY] [--debug]
[--env--some_param ENV__SOME_PARAM]
Build pipeline
optional arguments:
-h, --help show this help message and exit
--log LOG, -l LOG Enables logging to stdout at the specified level
--log-file LOG_FILE, -F LOG_FILE
Enables logging to the given file
--entry-point ENTRY_POINT, -e ENTRY_POINT
Entry point, defaults to pipeline.yaml
--force, -f Force execution by ignoring status
--skip-upstream, -su Skip building upstream dependencies. Only applicable
when using --partially
--partially PARTIALLY, -p PARTIALLY
Build a pipeline partially until certain task
--debug, -d Drop a debugger session if an exception happens
--env--some_param ENV__SOME_PARAM
Default: default_value
Apart from the default parameters from the ploomber build
command, Ploomber automatically adds any parameters from env.yaml
, we can easily override the default value. Let’s do that:
[4]:
%%capture captured
%%sh
ploomber build --force --env--some_param another_value --log INFO
[5]:
filter_output(captured, startswith='INFO:papermill:some_param')
INFO:papermill:some_param: another_value type: <class 'str'>
We see that our task effectively changed the value!
Finally, let’s see how the output/
folder looks like:
[2]:
%%sh
tree output
output
├── another_value
│ └── notebook.html
└── default_value
└── notebook.html
2 directories, 2 files
We have separate folders for each parameter, helping to keep things organized and taking the looping logic out of our pipeline.
Notes¶
There are some built-in placeholders that you can use without having an
env.yaml
file. For example,{{here}}
will expand to thepipeline.yaml
parent directory. Check out the Spec API documentation for more information.This example uses a Python script as a task. In SQL pipeline, you can achieve the same effect by using the placeholder in the product’s schema or a table/view name prefix.
If the parameter takes many different values and you want to run your pipeline using all of them, calling them by hand might get tedious. So you have two options 1) write a bash script that calls the CLI with different value parameters or 2) Use the Python API (everything that the CLI can do, you can do with Python directly), take a look at the
DAGSpec
documentation.Parametrized
pipeline.yaml
files are a great way to simplify a task’s logic but not overdo it. If you find yourself adding too many parameters, it’s a better idea to use the Python AP directly; factory functions are the correct pattern for highly customized pipeline construction.Given that the two pipelines are entirely independent, we could even run them in parallel.
Python API (factory functions)¶
Parametrization is straightforward when using a factory function. If your factory takes parameters, they’ll also be available in the command-line interface. Types are inferred from type hints. Let’s see an example:
# Content of factory.py
from ploomber import DAG
def make(param: str, another: int = 10):
dag = DAG()
# add tasks to your pipeline...
return dag
Our function takes two parameters: param
and another
. Parameters with no default values (param
) turn into positional arguments, and function parameters with default values convert to optional parameters (another
). To see the same auto-generated API, you can use the --help
command:
[7]:
%%sh
ploomber build --entry-point factory.make --help
usage: ploomber [-h] [--log LOG] [--log-file LOG_FILE]
[--entry-point ENTRY_POINT] [--force] [--skip-upstream]
[--partially PARTIALLY] [--debug] [--another ANOTHER]
param
Build pipeline
positional arguments:
param
optional arguments:
-h, --help show this help message and exit
--log LOG, -l LOG Enables logging to stdout at the specified level
--log-file LOG_FILE, -F LOG_FILE
Enables logging to the given file
--entry-point ENTRY_POINT, -e ENTRY_POINT
Entry point, defaults to pipeline.yaml
--force, -f Force execution by ignoring status
--skip-upstream, -su Skip building upstream dependencies. Only applicable
when using --partially
--partially PARTIALLY, -p PARTIALLY
Build a pipeline partially until certain task
--debug, -d Drop a debugger session if an exception happens
--another ANOTHER
Note that the Python API requires more work than a pipeline.yaml
file, but it is more flexible. [Click here] to see examples using the Python API.