Batch processing

You can export Ploomber pipelines to production schedulers for batch processing. Check out our package Soopervisor, which allows you to export to Kubernetes (via Argo workflows), AWS Batch, Airflow, and SLURM.

Composing batch pipelines

To compose a batch pipeline, use the import_tasks_from directive in your pipeline.yaml file.

For example, define your feature generation tasks in a features.yaml file:

# generate one feature...
- source: features.a_feature
  product: features/a_feature.csv

# another feature...
- source: features.anoter_feature
  product: features/another_feature.csv

# join the two previous features...
- source: features.join
  product: features/all.csv

Then import those tasks in your training pipeline, pipeline.yaml:

meta:
    # import feature generation tasks
    import_tasks_from: features.yaml

tasks:
    # Get raw data for training
    - source: train.get_historical_data
      product: raw/get.csv

    # The import_tasks_from injects your features generation tasks here

    # Train a model
    - source: train.train_model
      product: model/model.pickle

Your serving pipeline pipepline-serve.yaml would look like this:

meta:
    # import feature generation tasks
    import_tasks_from: features.yaml

tasks:
    # Get new data for predictions
    - source: serve.get_new_data
      product: serve/get.parquet

    # The import_tasks_from injects your features generation tasks here

    # Make predictions using a trained model
    - source: serve.predict
      product: serve/predictions.csv
      params:
        path_to_model: model.pickle

Here’s an example project showing how to use import_tasks_from to create a training (pipeline.yaml) and serving (pipeline-serve.yaml) pipeline.

Scheduling

For an example showing how to schedule runs with cron and Ploomber, click here.