To run this locally, install Ploomber and execute: ploomber examples -n guides/versioning

binder-logo

Found an issue? Let us know.

Questions? Ask us on Slack.

Versioning

Note: This feature requires Ploomber 0.17.1 or higher.

A tutorial showing how to version pipeline products.

Although Ploomber is not a data versioning solution, it offers a simple way to organize pipeline artifacts via placeholders. Note that this requires your project to be in a git repository.

Using {{git}}

Let’s look at the first example, which uses the {{git}} placeholder:

# Content of pipeline.git.yaml
tasks:
  - source: tasks/load.py
    product:
      nb: 'output/{{git}}/load.html'
      data: 'output/{{git}}/data.csv'

  - source: tasks/plot.py
    product:
      nb: 'output/{{git}}/plot.html'

You can see that both tasks use {{git}}. When Ploomber executes the pipeline, it will replace the placeholder using the following order:

  1. If currently at the tip of the branch, return the branch name

  2. If the current commit has a tag, return the tag name

  3. Otherwise, return the hash for the current commit (appending -dirty if there are uncommitted changes)

Let’s see how it works:

[1]:
from pathlib import Path

from ploomber.spec import DAGSpec
[2]:
dag = DAGSpec('pipeline.git.yaml').to_dag()
dag['load'].product['nb']
[2]:
File('output/master/load.html')

We can see the product will be stored in the output/master directory, {{git}} is resolved to master since we’re at the tip of such branch.

Using {{git_hash}}

The {{git_hash}} placeholder is similar to {{git}}, except it doesn’t return the branch name, the rules are as follows:

  1. If the current commit has a tag, return the tag name

  2. Otherwise, return the hash for the current commit (appending -dirty if there are uncommitted changes)

This is how our sample pipeline.git_hash.yaml looks like:

# Content of pipeline.git_hash.yaml
tasks:
  - source: tasks/load.py
    product:
      nb: 'output/{{git_hash}}/load.html'
      data: 'output/{{git_hash}}/data.csv'

  - source: tasks/plot.py
    product:
      nb: 'output/{{git_hash}}/plot.html'
[3]:
dag = DAGSpec('pipeline.git_hash.yaml').to_dag()
dag['load'].product['nb']
[3]:
File('output/62a3494-dirty/load.html')

This time, the product will be stored in a directory with the hash of the current commit.

Adding the current timestamp with {{now}}

Alternatively, you can use the {{now}} placeholder, which doesn’t require your project to be in a git repository and will resolve to the current timestamp:

# Content of pipeline.now.yaml
tasks:
  - source: tasks/load.py
    product:
      nb: 'output/{{now}}/load.html'
      data: 'output/{{now}}/data.csv'

  - source: tasks/plot.py
    product:
      nb: 'output/{{now}}/plot.html'
[4]:
dag = DAGSpec('pipeline.now.yaml').to_dag()

path = Path(dag['load'].product['nb']).relative_to(Path().resolve())
print(path)
output/2022-03-26T17:00:38.060493/load.html

You can see that the load.html file will to into a folder with the timestamp computed when running this example.

Using placeholders in selected tasks

You can selectively choose which tasks to organize based on the git repository commit, the following example only uses the {{git}} placeholder in the last task:

# Content of pipeline.partial.yaml
tasks:
  - source: tasks/load.py
    product:
      nb: 'output/load.html'
      data: 'output/data.csv'

  - source: tasks/plot.py
    product:
      nb: 'output/{{git}}/plot.html'
[5]:
dag = DAGSpec('pipeline.partial.yaml').to_dag()
dag['load'].product['nb']
[5]:
File('output/load.html')
[6]:
dag['plot'].product['nb']
[6]:
File('output/master/plot.html')

Here, you can see that the product of the load task goes to output/, while the output of plot goes to output/master/

Using an env.yaml

If you’re using an env.yaml file, you can still use the placeholders:

# env.yaml
directory: '{{git}}' # or '{{git_hash}}'

Then add references to {{directory}} in your pipeline.yaml:

# pipeline.yaml
tasks:
  - source: tasks/load.py
    product:
      nb: 'output/{{directory}}/load.html'
      data: 'output/{{directory}}/data.csv'