ploomber.tasks.NotebookRunner

class ploomber.tasks.NotebookRunner(source, product, dag, name=None, params=None, executor='papermill', executor_params=None, papermill_params=None, kernelspec_name=None, nbconvert_exporter_name=None, ext_in=None, nb_product_key='nb', static_analysis='regular', nbconvert_export_kwargs=None, local_execution=False, check_if_kernel_installed=True, debug_mode=None)

Run a Jupyter notebook using papermill. Support several input formats via jupytext and several output formats via nbconvert

Parameters
  • source (str or pathlib.Path) – Notebook source, if str, the content is interpreted as the actual notebook, if pathlib.Path, the content of the file is loaded. When loading from a str, ext_in must be passed

  • product (ploomber.File) – The output file

  • dag (ploomber.DAG) – A DAG to add this task to

  • name (str, optional) – A str to indentify this task. Should not already exist in the dag

  • params (dict, optional) – Notebook parameters. This are passed as the “parameters” argument to the papermill.execute_notebook function, by default, “product” and “upstream” are included

  • executor (str, optional) – executor to use. Currently supports “ploomber-engine” and “papermill”. Defaults to papermill executor. Can also be passed as “engine_name” in executor_params

  • executor_params (dict, optional) – Parameters passed to executor, defaults to None. Please refer to each executor execute_notebook APIs to learn more about this.

  • papermill_params (dict, optional) – Other parameters passed to papermill.execute_notebook, defaults to None

  • kernelspec_name (str, optional) – Kernelspec name to use, if the file extension provides with enough information to choose a kernel or the notebook already includes kernelspec data (in metadata.kernelspec), this is ignored, otherwise, the kernel is looked up using jupyter_client.kernelspec.get_kernel_spec

  • nbconvert_exporter_name (str or dict, optional) – Once the notebook is run, this parameter controls whether to export the notebook to a different parameter using the nbconvert package, it is not needed unless the extension cannot be used to infer the final output format, in which case the nbconvert.get_exporter is used. If nb_product_key is a list of multiple nb products keys, nbconvert_exporter_name should be a dict containing keys from this list.

  • ext_in (str, optional) – Source extension. Required if loading from a str. If source is a pathlib.Path, the extension from the file is used.

  • nb_product_key (str or list, optional) – If the notebook is expected to generate other products, pass the key to identify the output notebook (i.e. if product is a list with 3 ploomber.File, pass the index pointing to the notebook path). If the only output is the notebook itself, this parameter is not needed If multiple notebook conversions are required like html, pdf, this parameter should be a list of keys like ‘nb_ipynb’, ‘nb_html, ‘nb_pdf’.

  • static_analysis (('disabled', 'regular', 'strict'), default='regular') – Check for various errors in the notebook. In ‘regular’ mode, it aborts execution if the notebook has syntax issues, or similar problems that would cause the code to break if executed. In ‘strict’ mode, it performs the same checks but raises an issue before starting execution of any task, furthermore, it verifies that the parameters cell and the params passed to the notebook match, thus, making the notebook behave like a function with a signature.

  • nbconvert_export_kwargs (dict) – Keyword arguments to pass to the nbconvert.export function (this is only used if exporting the output ipynb notebook to another format). You can use this, for example, to hide code cells using the exclude_input parameter. See nbconvert documentation for details. Ignored if the product is file with .ipynb extension.

  • local_execution (bool, optional) – Change working directory to be the parent of the notebook’s source. Defaults to False. This resembles the default behavior when running notebooks interactively via jupyter notebook

  • debug_mode (None, 'now' or 'later', default=None) – If ‘now’, runs notebook in debug mode, this will start debugger if an error is thrown. If ‘later’, it will serialize the traceback for later debugging. (Added in 0.20)

Examples

Spec API:

tasks:
  - source: nb.ipynb
    product: report.html

Spec API (multiple outputs):

tasks:
  - source: nb.ipynb
    product:
        # generated automatically by ploomber
        nb: report.html
        # must be generated by nb.ipynb
        data: data.csv

Spec API (multiple notebook products, added in 0.19.6):

(generate the executed notebooks in multiple formats)

tasks:
  - source: script.py
    # keys can be named as per user's choice. None
    # of the keys are mandatory. However, every key mentioned
    # in this list should be a part of the product dict below.
    nb_product_key: [nb_ipynb, nb_pdf, nb_html]
    # When nb_product_key is a list, nbconvert_exporter_name
    # should be a dict with required keys from nb_product_key
    # only. If missing, it uses the default exporter
    nbconvert_exporter_name:
        nb_pdf: webpdf
    # Every notebook product defined here should correspond to key
    # defined in nb_product_key.
    product:
        nb_ipynb: nb.ipynb
        nb_pdf: doc.pdf
        nb_html: report.html
        # must be generated by nb.ipynb
        data: data.csv

Python API:

>>> from pathlib import Path
>>> from ploomber import DAG
>>> from ploomber.tasks import NotebookRunner
>>> from ploomber.products import File
>>> dag = DAG()
>>> NotebookRunner(Path('nb.ipynb'), File('report.html'), dag=dag)
NotebookRunner: nb -> File('report.html')
>>> dag.build() 

Python API (customize output notebook):

>>> from pathlib import Path
>>> from ploomber import DAG
>>> from ploomber.tasks import NotebookRunner
>>> from ploomber.products import File
>>> dag = DAG()
>>> # do not include input code (only cell's output)
>>> NotebookRunner(Path('nb.ipynb'), File('out-1.html'), dag=dag,
...                nbconvert_export_kwargs={'exclude_input': True},
...                name='one')
NotebookRunner: one -> File('out-1.html')
>>> # Selectively remove cells with the tag "remove"
>>> config = {'TagRemovePreprocessor': {'remove_cell_tags': ('remove',)},
...        'HTMLExporter':
...         {'preprocessors':
...    ['nbconvert.preprocessors.TagRemovePreprocessor']}}
>>> NotebookRunner(Path('nb.ipynb'), File('out-2.html'), dag=dag,
...                nbconvert_export_kwargs={'config': config},
...                name='another')
NotebookRunner: another -> File('out-2.html')
>>> dag.build() 

Notes

changelog

Changed in version 0.22.4: Added native ploomber-engine support with executor parameter

Changed in version 0.20: debug constructor flag renamed to debug_mode to prevent conflicts with the debug method

Changed in version 0.19.6: Support for generating output notebooks in multiple formats, see example above.

nbconvert’s documentation

Methods

build([force, catch_exceptions])

Build a single task

debug([kind])

Opens the notebook (with injected parameters) in debug mode in a temporary location

load([key])

Load task as pandas.DataFrame.

render([force, outdated_by_code, remote])

Renders code and product, all upstream tasks must have been rendered first, for that reason, this method will usually not be called directly but via DAG.render(), which renders in the right order.

run()

This is the only required method Task subclasses must implement

set_upstream(other[, group_name])

status([return_code_diff, sections])

Prints the current task status

build(force=False, catch_exceptions=True)

Build a single task

Although Tasks are primarily designed to execute via DAG.build(), it is possible to do so in isolation. However, this only works if the task does not have any unrendered upstream dependencies, if that’s the case, you should call DAG.render() before calling Task.build()

Returns

A dictionary with keys ‘run’ and ‘elapsed’

Return type

dict

Raises
  • TaskBuildError – If the error failed to build because it has upstream dependencies, the build itself failed or build succeded but on_finish hook failed

  • DAGBuildEarlyStop – If any task or on_finish hook raises a DAGBuildEarlyStop error

debug(kind='ipdb')

Opens the notebook (with injected parameters) in debug mode in a temporary location

Parameters

kind (str, default='ipdb') – Debugger to use, ‘ipdb’ to use line-by-line IPython debugger, ‘pdb’ to use line-by-line Python debugger or ‘pm’ to to post-portem debugging using IPython

Notes

Be careful when debugging tasks. If the task has run successfully, you overwrite products but don’t save the updated source code, your DAG will enter an inconsistent state where the metadata won’t match the overwritten product.

load(key=None, **kwargs)

Load task as pandas.DataFrame. Only implemented in certain tasks

render(force=False, outdated_by_code=True, remote=False)

Renders code and product, all upstream tasks must have been rendered first, for that reason, this method will usually not be called directly but via DAG.render(), which renders in the right order.

Render fully determines whether a task should run or not.

Parameters
  • force (bool, default=False) – If True, mark status as WaitingExecution/WaitingUpstream even if the task is up-to-date (if there are any File(s) with clients, this also ignores the status of the remote copy), otherwise, the normal process follows and only up-to-date tasks are marked as Skipped.

  • outdated_by_code (bool, default=True) – Factors to determine if Task.product is marked outdated when source code changes. Otherwise just the upstream timestamps are used.

  • remote (bool, default=False) – Use remote metadata to determine status

Notes

This method tries to avoid calls to check for product status whenever possible, since checking product’s metadata can be a slow operation (e.g. if metadata is stored in a remote database)

When passing force=True, product’s status checking is skipped altogether, this can be useful when we only want to quickly get a rendered DAG object to interact with it

run()

This is the only required method Task subclasses must implement

set_upstream(other, group_name=None)
status(return_code_diff=False, sections=None)

Prints the current task status

Parameters

sections (list, optional) – Sections to include. Defaults to “name”, “last_run”, “oudated”, “product”, “doc”, “location”

Attributes

PRODUCT_CLASSES_ALLOWED

client

debug_mode

exec_status

name

A str that represents the name of the task, you can access tasks in a dag using dag[‘some_name’]

on_failure

Callable to be executed if task fails (passes Task as first parameter and the exception as second parameter)

on_finish

Callable to be executed after this task is built successfully (passes Task as first parameter)

on_render

params

dict that holds the parameter that will be passed to the task upon execution.

product

The product this task will create upon execution

source

Source is used by the task to compute its output, for most cases this is source code, for example PythonCallable takes a function as source and SQLScript takes a string with SQL code as source.

static_analysis

upstream

A mapping for upstream dependencies {task name} -> [task object]