Running experiments in parallel

Some of the advantages of running parallel notebooks are that we can run different tasks or processes simultaneously by using multiple computing resources. This will allow us to work more efficiently. (More details about this can be read in our blog post.)

With Ploomber and Ploomber Cloud you can parametrize notebooks and run multiple copies in parallel (each one with a different set of parameters). This guide will show you how!

The following sections will be covered in this tutorial. (You can click any of them to jump directly to the corresponding section.)

Pre-requisites

This section will help you setup your local environment to run notebooks in Ploomber Cloud. You only need to install Ploomber and set the API Key from your Ploomber Cloud account.

Installing Ploomber

To install the updated version of Ploomber, open a terminal and run the following command.

pip install ploomber --upgrade

Setting up the Ploomber Cloud API key

For this, you’ll need to sign in to Ploomber Cloud. Once you sign in, jou just need to copy your API key and run the following command in your terminal:

ploomber cloud set-key {your-key}

A detailed tutorial to get and set your API Key can be found here.

Notebook configuration

In this section you’ll learn how to configure your notebook to run different parameters.

First, add a cell at the top of your notebook with the notebook parameters:

# PARAMETERS
n_estimators = 1

Important: You must add the comment # PARAMETERS in the cell. With this, Ploomber will be able to identify that those parameter will be used during the execution.

Next, ensure that such parameters are used in the notebook’s body. Ploomber Cloud will change these values at runtime.

Now, add another raw cell at the top. In the raw cell, put the parameter values you want to use under the grid section:

grid:
    n_estimators: [1, 5, 10, 20]

Your notebook can have more than one parameter. In such case, Ploomber Cloud will run the notebook with all possible combinations.

Note: the raw cell must be a valid YAML string. YAML is a data serialization language that is often used for writing configuration files. It usually follows a simple format to list attributes. You can read more about YAML here.

Notebook submission to Ploomber Cloud

In the previous section, you have configured diferent parameters to run different processes in parallel. In this section, we will submit these processes to Ploomber Cloud.

Let’s submit a notebook that fits a regressor and uses 4 parameter values. For this, we have prepared a notebook for you, which already contains the previously configurations for parameters that will be used for each run.

To download the notebook, simply run the following command in your terminal:

# Create a folder named notebooks
mkdir notebooks

# Download the sample notebook to the created folder
curl https://raw.githubusercontent.com/ploomber/projects/master/guides/cloud-notebook-parametrized/notebooks/grid.ipynb -o notebooks/grid.ipynb

Now we can submit the notebook that fits the regressor with the 4 specified parameter values. In your terminal run:

[2]:
ploomber cloud nb notebooks/grid.ipynb
Uploading grid-7e41ace5.ipynb...
Triggering execution of grid-7e41ace5.ipynb...

Check that the task was submitted:

[3]:
ploomber cloud list
created_at      runid                                 status
--------------  ------------------------------------  --------
5 seconds ago   b39238a2-3826-495d-90ca-b29139e324f0  created
53 minutes ago  2d4bcadf-5acb-49a5-8806-af2dbe1b32fe  finished
7 hours ago     ee78f4c1-ee42-4ba5-ba2f-9e73ae9228d6  finished

Wait for 1-2 minutes for the Docker image to build, you’ll see the following message once it’s done:

[6]:
ploomber cloud logs @latest --image | tail -n 10
[Container] 2022/10/26 03:59:05 Phase complete: BUILD State: SUCCEEDED

[Container] 2022/10/26 03:59:05 Phase context status code:  Message:

[Container] 2022/10/26 03:59:05 Entering phase POST_BUILD

[Container] 2022/10/26 03:59:05 Phase complete: POST_BUILD State: SUCCEEDED

[Container] 2022/10/26 03:59:05 Phase context status code:  Message:

Now you’ll see that the notebook has started:

[7]:
ploomber cloud list
created_at      runid                                 status
--------------  ------------------------------------  --------
3 minutes ago   b39238a2-3826-495d-90ca-b29139e324f0  started
57 minutes ago  2d4bcadf-5acb-49a5-8806-af2dbe1b32fe  finished
7 hours ago     ee78f4c1-ee42-4ba5-ba2f-9e73ae9228d6  finished

Let’s see the status of each task (one task per parameter value):

[8]:
ploomber cloud status @latest
Geting latest ID...
Got ID: b39238a2-3826-495d-90ca-b29139e324f0
Unknown status: started
taskid                     name             runid                      status
-------------------------  ---------------  -------------------------  --------
d9c5d4d0-b076-44ba-a807-8  grid-7e41ace5-1  b39238a2-3826-495d-90ca-b  created
d6689c7b8ed                                 29139e324f0
0ac99557-5c30-4160-869d-e  grid-7e41ace5-3  b39238a2-3826-495d-90ca-b  created
65e007fbd17                                 29139e324f0
6442ea6b-cece-4530-8af6-6  grid-7e41ace5-2  b39238a2-3826-495d-90ca-b  created
8d38ac230ed                                 29139e324f0
bd3363a5-c223-4673-9a52-3  grid-7e41ace5-0  b39238a2-3826-495d-90ca-b  created
3fa9dbef681                                 29139e324f0

After a few minutes, they are done:

[9]:
ploomber cloud status @latest
Geting latest ID...
Got ID: b39238a2-3826-495d-90ca-b29139e324f0
Pipeline finished...
taskid                     name             runid                      status
-------------------------  ---------------  -------------------------  --------
d9c5d4d0-b076-44ba-a807-8  grid-7e41ace5-1  b39238a2-3826-495d-90ca-b  finished
d6689c7b8ed                                 29139e324f0
0ac99557-5c30-4160-869d-e  grid-7e41ace5-3  b39238a2-3826-495d-90ca-b  finished
65e007fbd17                                 29139e324f0
6442ea6b-cece-4530-8af6-6  grid-7e41ace5-2  b39238a2-3826-495d-90ca-b  finished
8d38ac230ed                                 29139e324f0
bd3363a5-c223-4673-9a52-3  grid-7e41ace5-0  b39238a2-3826-495d-90ca-b  finished
3fa9dbef681                                 29139e324f0

Let’s see what’s in our outputs workspace:

[10]:
ploomber cloud products
path
-----------------------------------------------------
grid-7e41ace5/output/notebook-n_estimators=1-0.ipynb
grid-7e41ace5/output/notebook-n_estimators=10-2.ipynb
grid-7e41ace5/output/notebook-n_estimators=20-3.ipynb
grid-7e41ace5/output/notebook-n_estimators=5-1.ipynb
plot-aebe61a1/output/notebook.ipynb
plot-f7ad8452/output/notebook.ipynb

Download all the executed notebooks:

[11]:
ploomber cloud download 'grid-7e41ace5/*'
Writing file into path grid-7e41ace5/output/.notebook-n_estimators=1-0.ipynb.metadata
Writing file into path grid-7e41ace5/output/.notebook-n_estimators=20-3.ipynb.metadata
Writing file into path grid-7e41ace5/output/.notebook-n_estimators=5-1.ipynb.metadata
Writing file into path grid-7e41ace5/output/.notebook-n_estimators=10-2.ipynb.metadata
Writing file into path grid-7e41ace5/output/notebook-n_estimators=5-1.ipynb
Writing file into path grid-7e41ace5/output/notebook-n_estimators=10-2.ipynb
Writing file into path grid-7e41ace5/output/notebook-n_estimators=20-3.ipynb
Writing file into path grid-7e41ace5/output/notebook-n_estimators=1-0.ipynb

Note that we’re using the identifier printed when we submitted the notebook.

For a better understanding of the previous cells, you can read more details about execution monitoring and downloading results in the previous guide.

If your notebook requires input data, you can upload it.

We have prepared two sample notebooks that will allow you to work with uploads of input data. To download the first one that will be used, run in your terminal:

curl https://raw.githubusercontent.com/ploomber/projects/master/guides/cloud-notebook-parametrized/notebooks/input-data.ipynb -o notebooks/input-data.ipynb

Let’s see what happens if we try to run a notebook with missing input data:

[12]:
ploomber cloud nb notebooks/input-data.ipynb
Uploading input-data-49dc8734.ipynb...
Triggering execution of input-data-49dc8734.ipynb...
Error: Error validating inputs/outputs: {'missing': {'../data/penguins.csv'}} (status: 400)


Ploomber Cloud will parse your notebook and look for referenced files. If they’re missing in your data workspace, it’ll show an error like the one above.

In our notebook, we have the following line:

df = pd.read_csv('../data/penguins.csv')

Ploomber realizes you’re using a local file at ../data/penguins.csv. Since files can be either inputs or outputs, you have to indicate Ploomber what they are. To fix this, add a raw cell at the top:

# this determines where to look for input data and where
# to store outputs
prefix: penguins-classification

# for each path in our notebook, indicate if it's an input or output
# the values must be the same as in your notebook
inputs:
    - ../data/penguins.csv

# no outputs, so no need to add an "outputs" section

The second sample notebook to be used will contain the raw cell example. To download it, simply run:

curl https://raw.githubusercontent.com/ploomber/projects/master/guides/cloud-notebook-parametrized/notebooks/input-data-with-raw-cell.ipynb -o notebooks/input-data-with-raw-cell.ipynb

Let’s run the notebook that contains the raw cell:

[13]:
ploomber cloud nb notebooks/input-data-with-raw-cell.ipynb
Uploading input-data-with-raw-cell-d896c53b.ipynb...
Triggering execution of input-data-with-raw-cell-d896c53b.ipynb...
Error: Cannot start execution. The following inputs are missing:
        - ../data/penguins.csv
Upload them to your data workspace or using the CLI:
ploomber cloud data --upload ../data/penguins.csv --prefix penguins-classification/input --name data-penguins.csv
 (status: 400)


This time, Ploomber Cloud is telling us the files are not in our data workspace. So let’s upload them.

First, let’s get the data:

[14]:
curl https://raw.githubusercontent.com/mwaskom/seaborn-data/master/penguins.csv -o penguins.csv
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 13478  100 13478    0     0  54957      0 --:--:-- --:--:-- --:--:-- 57110

Use the command printed in the error message:

[19]:
# NOTE: you may need to change the path in the --upload argument if the file is somewhere else
ploomber cloud data --upload penguins.csv --prefix penguins-classification/input --name data-penguins.csv
Uploading data-penguins.csv...

Let’s submit the notebook:

[20]:
ploomber cloud nb notebooks/input-data-with-raw-cell.ipynb
Uploading input-data-with-raw-cell-d539ba23.ipynb...
Triggering execution of input-data-with-raw-cell-d539ba23.ipynb...

Wait for a couple of minutes to finish (status will appear as finished):

[22]:
ploomber cloud list
created_at      runid                                 status
--------------  ------------------------------------  --------
9 minutes ago   19f8242e-373b-4b2b-bee4-0181a3edfc51  finished
31 minutes ago  b39238a2-3826-495d-90ca-b29139e324f0  finished
an hour ago     2d4bcadf-5acb-49a5-8806-af2dbe1b32fe  finished
7 hours ago     ee78f4c1-ee42-4ba5-ba2f-9e73ae9228d6  finished

The prefix in the raw cell determines where the outputs are stored. Hence, to download all outputs:

[23]:
ploomber cloud download 'penguins-classification/*'
Writing file into path penguins-classification/output/.notebook.ipynb.metadata
Writing file into path penguins-classification/output/notebook.ipynb

prefix: some-experiment

outputs:
    - path/to/model.pickle

resources

You can request more resources for your notebook execution by adding the following in the raw cell:

task_resources:
    vcpus: 8 # number of CPUs
    memory: 16384 # memory in MiB

See this notebook for an example (Note: the configuration cell is not visible on GitHub, you have to view it with Jupyter). If you want to download this sample notebook and test it locally, run the following command:

curl https://raw.githubusercontent.com/ploomber/projects/master/guides/cloud-notebook-parametrized/notebooks/resources.ipynb -o notebooks/resources.ipynb

Note: The free community plan is capped to 2 CPUS and 4GiB of memory and no GPUs. If you need more resources, you can subscribe to the Teams plan. If you’re a student or researcher, join our Slack and we’ll lift the restrictions.

By default, Ploomber Cloud will parse your import statements and install the latest version. If you want a specific version, add this in your raw cell:

dependencies:
    - matplotlib==3.5.3
    - scikit-learn==1.1.0

See this notebook for an example (Note: the configuration cell is not visible on GitHub, you have to view it with Jupyter). If you want to download this sample notebook and test it locally, run the following command:

curl https://raw.githubusercontent.com/ploomber/projects/master/guides/cloud-notebook-parametrized/notebooks/dependencies.ipynb -o notebooks/dependencies.ipynb

The free community plan allows you to run parallel jobs via the grid feature. However, you cannot start a new execution until that one is done. If you need concurrent runs, you can subscribe to the Teams plan. If you’re a student or researcher, join our Slack and we’ll lift the restrictions.

To abort your latest run:

ploomber cloud abort @latest

To see the status of your runs:

ploomber cloud list

To see tasks within a given run:

ploomber cloud status {runid}

# or for the latest run
ploomber cloud status @latest

Even if your notebook fails, the failed notebook is uploaded, you can use it for debugging:

ploomber cloud download 'path/to/notebook.ipynb'

To list existing files in your products workspace:

ploomber cloud products

To get the logs for all tasks in the run:

ploomber cloud logs {runid}

# or for the latest run
ploomber cloud logs @latest

To get the logs for the Docker building process:

ploomber cloud logs {runid} --image

# or for the latest run
ploomber cloud logs @latest --image