Configuration (dev/prod)

In the previous guide (Parametrized pipelines), we saw how to use an env.yaml file to parametrize our pipeline and switch parameters from the command line.

Sometimes we want to change all the parameters at once. The most common scenario is to change configuration during development and production.

For example, say you’re working on a Machine Learning pipeline whose pipeline.yaml looks like this:

tasks:

  - source: get.py
    product:
      nb: get.ipynb
      data: raw.csv
    params:
      sample_pct: '{{sample_pct}}'

  - source: get.py
    product:
      nb: get.ipynb
      data: raw.csv

  - source: get.py
    product:
      nb: get.ipynb
      data: raw.csv

The pipeline above has one placeholder '{{sample_pct}}', which controls which percentage of raw data to download. You may want to develop locally with a fraction of the data, say 20%, to iterate quickly. To smoke test quickly, you may run it with a smaller sample, say 1%. Finally, to train a model, you’ll use 100% of the data.

Tip

You can use placeholders (e.g., {{sample_pct}}) anywhere in the pipeline.yaml file. Another typical use case is to switch the product location (e.g., product: '{{product_directory}}/some-data.csv'.

By default, Ploomber looks for an env.yaml. To enable rapid local development with 20% of the data, you may create an env.yaml file like this:

sample_pct: 20

For smoke testing, env.test.yaml:

sample_pct: 1

And for training, env.train.yaml:

sample_pct: 100

To switch configurations, you can set the PLOOMBER_ENV_FILENAME environment variable to env.test.yaml in the testing environment and to env.train.yaml in the training environment.

Whenever PLOOMBER_ENV_FILENAME has a value, Ploomber uses it and looks for a file with such a name. Note that this must be a filename, not a path since Ploomber expects env.yaml files to exist in the same folder as the pipeline.yaml file. For example, if you’re on Linux or macOS:

export PLOOMBER_ENV_FILENAME=env.train.yaml && ploomber build

Important

If you’re using the Jupyter integration and want to see the changes reflected in the injected cell, you need to shut down Jupyter set PLOOMBER_ENV_FILENAME, and start Jupyter again.

Managing multiple pipelines

If your project has more than one pipeline, they’ll likely need different env.yaml files.

Say you have two pipelines, one for training a model (pipeline.yaml) and one for serving it (pipeline.serve.yaml). You can create an env.yaml file to parametrize pipeline.yaml and an env.serve.yaml to parametrize pipeline.serve.yaml:

project/
    pipeline.yaml
    pipeline.serve.yaml
    env.yaml
    env.serve.yaml

The general rule is as follows: When loading a pipeline.{name}.yaml, extract the {name} portion. Then look for a env.{name}.yaml file, if such file doesn’t exist, look for an env.yaml file. Note that the PLOOMBER_ENV_FILENAME environment variable overrides this process.

Alternatively, you may separate the pipelines into different directories, and put an env.yaml on each one:

project-a/
    pipeline.yaml
    env.yaml
project-b/
    pipeline.yaml
    env.yaml

env.yaml composition (DRY)

Note

New in version 0.18

In many cases, your development and production environment configuration share many values in common. To keep them simple, you may create an env.yaml (development configuration) and have your env.prod.yaml (production configuration) inherit from it:

key: value
key_another: dev-value

Then in your env.prod.yaml:

meta:
  # import development config
  import_from: env.yaml

# no need to declare key: value here, it'll be imported from env.yaml

# overwrite value
key_another: production-value

Note that if the value in import_from is a relative path, it is considered so relative to the location of the env file (in our case env.prod.yaml).

You can switch values in env.yaml from the command line, to see how:

ploomber build --help

Example, if you have a key in your env.yaml:

ploomber build --env--key new-value