To run this example locally, install Ploomber and execute: ploomber examples -n guides/versioning

To start a free, hosted JupyterLab:

Found an issue? Let us know.

Have questions? Ask us anything on Slack.

# Versioning¶

Note: This feature requires Ploomber 0.17.1 or higher.

A tutorial showing how to version pipeline products.

Although Ploomber is not a data versioning solution, it offers a simple way to organize pipeline artifacts via placeholders. Note that this requires your project to be in a git repository.

## Using {{git}}¶

Let’s look at the first example, which uses the {{git}} placeholder:

# Content of pipeline.git.yaml
product:
data: 'output/{{git}}/data.csv'

product:
nb: 'output/{{git}}/plot.html'


You can see that both tasks use {{git}}. When Ploomber executes the pipeline, it will replace the placeholder using the following order:

1. If currently at the tip of the branch, return the branch name

2. If the current commit has a tag, return the tag name

3. Otherwise, return the hash for the current commit (appending -dirty if there are uncommitted changes)

Let’s see how it works:

[1]:

from pathlib import Path

from ploomber.spec import DAGSpec

[2]:

dag = DAGSpec('pipeline.git.yaml').to_dag()

[2]:

File('output/master/load.html')


We can see the product will be stored in the output/master directory, {{git}} is resolved to master since we’re at the tip of such branch.

## Using {{git_hash}}¶

The {{git_hash}} placeholder is similar to {{git}}, except it doesn’t return the branch name, the rules are as follows:

1. If the current commit has a tag, return the tag name

2. Otherwise, return the hash for the current commit (appending -dirty if there are uncommitted changes)

This is how our sample pipeline.git_hash.yaml looks like:

# Content of pipeline.git_hash.yaml
product:
data: 'output/{{git_hash}}/data.csv'

product:
nb: 'output/{{git_hash}}/plot.html'

[3]:

dag = DAGSpec('pipeline.git_hash.yaml').to_dag()

[3]:

File('output/62a3494-dirty/load.html')


This time, the product will be stored in a directory with the hash of the current commit.

## Adding the current timestamp with {{now}}¶

Alternatively, you can use the {{now}} placeholder, which doesn’t require your project to be in a git repository and will resolve to the current timestamp:

# Content of pipeline.now.yaml
product:
data: 'output/{{now}}/data.csv'

product:
nb: 'output/{{now}}/plot.html'

[4]:

dag = DAGSpec('pipeline.now.yaml').to_dag()

print(path)

output/2022-03-26T17:00:38.060493/load.html


You can see that the load.html file will to into a folder with the timestamp computed when running this example.

## Using placeholders in selected tasks¶

You can selectively choose which tasks to organize based on the git repository commit, the following example only uses the {{git}} placeholder in the last task:

# Content of pipeline.partial.yaml
product:
data: 'output/data.csv'

product:
nb: 'output/{{git}}/plot.html'

[5]:

dag = DAGSpec('pipeline.partial.yaml').to_dag()

[5]:

File('output/load.html')

[6]:

dag['plot'].product['nb']

[6]:

File('output/master/plot.html')


Here, you can see that the product of the load task goes to output/, while the output of plot goes to output/master/

## Using an env.yaml¶

If you’re using an env.yaml file, you can still use the placeholders:

# env.yaml
directory: '{{git}}' # or '{{git_hash}}'


Then add references to {{directory}} in your pipeline.yaml:

# pipeline.yaml