To run this locally, install Ploomber and execute: ploomber examples -n guides/serialization
Found an issue? Let us know.
Questions? Ask us on Slack.
Serialization¶
Tutorial explaining how the serializer and unserializer fields in a pipeline.yaml file work.
Incremental builds allow Ploomber to skip tasks whose source code hasn’t changed; each task must save their products to disk to enable such a feature. However, there are some cases when we don’t want our pipeline to perform disk operations. For example, if we’re going to deploy our pipeline, eliminating disk operations reduces runtime considerably.
To enable a pipeline to work in both disk-based and in-memory scenarios, we can declare a serializer
and unzerializer
in our pipeline declaration, effectively separating our task’s logic from the read/write logic.
Note that this only applies to function tasks; other tasks are unaffected by the serializer
/unserializer
configuration.
Built-in pickle serialization¶
The easiest way to get started is to use the built-in serializer and unserializer, which use the pickle
module.
Let’s see an example; the following pipeline has two tasks. The first one generates a dictionary, and the second one creates two dictionaries. Since we are using the pickle-based serialization, each dictionary is saved in the pickle binary format:
# Content of simple.yaml
serializer: ploomber.io.serializer_pickle
unserializer: ploomber.io.unserializer_pickle
tasks:
- source: tasks.first
product: output/one_dict
- source: tasks.second
product:
another: output/another_dict
final: output/final_dict
Let’s take a look at the task’s source code:
# Content of tasks.py
def first():
return dict(a=1, b=2)
def second(upstream):
first = upstream['first']
another = dict(a=first['b'] + 1, b=first['a'] + 1)
final = dict(a=100, b=200)
return dict(another=another, final=final)
Since we configured a serializer
and unserializer
, function tasks must return
their outpues instead of saving them to disk in the function’s body.
first
does not have any upstream dependencies and returns a dictionary. second
has the previous task as dependency and returns two dictionaries. Note that the keys in the returned dictionary must match the names of the products declared in pipeline.yaml
(another
, final
).
Let’s now run the pipeline.
[1]:
%%sh
ploomber build --entry-point simple.yaml --force
name Ran? Elapsed (s) Percentage
------ ------ ------------- ------------
first True 0.001281 42.9433
second True 0.001702 57.0567
Building task 'second': 100%|██████████| 2/2 [00:06<00:00, 3.39s/it]
The pickle format has important security concerns, remember only to unpickle data you trust.
Custom serialization logic¶
We can also define our own serialization logic, by using the @serializer
, and @unserializer
decorators. Let’s replicate what our pickle-based serializer/unserializer is doing as an example:
# Content of custom.py
from pathlib import Path
import pickle
from ploomber.io import serializer, unserializer
@serializer()
def my_pickle_serializer(obj, product):
Path(product).write_bytes(pickle.dumps(obj))
@unserializer()
def my_pickle_unserializer(product):
return pickle.loads(Path(product).read_bytes())
A @serializer
function must take two arguments: the object to serializer and the product object (taken from the task declaration). The @unserializer
must take a single argument (the product to unserializer), and return the unserializer object.
Let’s modify our original pipeline to use this serializer/unserializer:
# Content of custom.yaml
serializer: custom.my_pickle_serializer
unserializer: custom.my_pickle_unserializer
tasks:
- source: tasks.first
product: output/one_dict
- source: tasks.second
product:
another: output/another_dict
final: output/final_dict
[2]:
%%sh
ploomber build --entry-point custom.yaml --force
name Ran? Elapsed (s) Percentage
------ ------ ------------- ------------
first True 0.001216 19.4342
second True 0.005041 80.5658
Building task 'second': 100%|██████████| 2/2 [00:07<00:00, 3.87s/it]
Custom serialization logic based on the product’s extension¶
Under many circumstances, there are more suitable formats than pickle. For example, we may want to store lists or dictionaries as JSON files and other files using pickle. The @serializer
/@unserializer
decorators use mapping as the first argument to dispatch to different functions depending on the product’s extension. Let’s see an example:
# Content of custom.py
from pathlib import Path
import pickle
import json
from ploomber.io import serializer, unserializer
def write_json(obj, product):
Path(product).write_text(json.dumps(obj))
def read_json(product):
return json.loads(Path(product).read_text())
@serializer({'.json': write_json})
def my_serializer(obj, product):
Path(product).write_bytes(pickle.dumps(obj))
@unserializer({'.json': read_json})
def my_unserializer(product):
return pickle.loads(Path(product).read_bytes())
Let’s modify our example pipeline. The product in the first task does not have an extension (output/one_dict
), hence, it will use pickle-based logic. However, the tasks in the second task have a .json
extension, and will be saved as JSON files.
# Content of with-json.yaml
serializer: custom.my_serializer
unserializer: custom.my_unserializer
tasks:
- source: tasks.first
product: output/one_dict
- source: tasks.second
product:
another: output/another_dict.json
final: output/final_dict.json
[3]:
%%sh
ploomber build --entry-point with-json.yaml --force
name Ran? Elapsed (s) Percentage
------ ------ ------------- ------------
first True 0.001193 38.5834
second True 0.001899 61.4166
Building task 'second': 100%|██████████| 2/2 [00:06<00:00, 3.26s/it]
Let’s print the .json
files to verify they’re not pickle files:
[4]:
%%sh
cat output/another_dict.json
{"a": 3, "b": 2}
[5]:
%%sh
cat output/final_dict.json
{"a": 100, "b": 200}
Using a fallback format¶
Since it’s common to have a fallback
serialization format, the decorators have a fallback
argument that, when enabled, uses the pickle
module when the product’s extension does not match any of the registered ones in the first argument.
The example works the same as the previous one, except we don’t have to write our pickle-based logic.
fallback
can also take the joblib or cloudpickle values. They’re similar to the pickle format but have some advantages. For example, joblib
produces smaller files when the serialized object contains many NumPy arrays, while cloudpickle supports serialization of some objects that the pickle module doesn’t. To use fallback='joblib'
or fallback='cloudpickle'
the corresponding module must be
installed.
# Content of custom.py
from ploomber.io import serializer, unserializer
@serializer({'.json': write_json}, fallback=True)
def my_fallback_serializer(obj, product):
pass
@unserializer({'.json': read_json}, fallback=True)
def my_fallback_unserializer(product):
pass
# Content of fallback.yaml
serializer: custom.my_fallback_serializer
unserializer: custom.my_fallback_unserializer
tasks:
- source: tasks.first
product: output/one_dict
- source: tasks.second
product:
another: output/another_dict.json
final: output/final_dict.json
[6]:
%%sh
ploomber build --entry-point fallback.yaml --force
name Ran? Elapsed (s) Percentage
------ ------ ------------- ------------
first True 0.002278 56.8505
second True 0.001729 43.1495
Building task 'second': 100%|██████████| 2/2 [00:06<00:00, 3.45s/it]
Let’s print the JSON files to verify their contents:
[7]:
%%sh
cat output/another_dict.json
{"a": 3, "b": 2}
[8]:
%%sh
cat output/final_dict.json
{"a": 100, "b": 200}
Using default serializers¶
Ploomber comes with a few convenient serialization functions to write more succint serializers. We can request the use of such default serializers using the defaults
argument, which takes a list of extensions:
# Content of custom.py
from ploomber.io import serializer, unserializer
@serializer(fallback=True, defaults=['.json'])
def my_defaults_serializer(obj, product):
pass
@unserializer(fallback=True, defaults=['.json'])
def my_defaults_unserializer(product):
pass
Here we’re asking to dispatch .json
products and use pickle
for all other extensions, the same as we did for the previous examples, except this time, we don’t have to pass the mapping argument to the decorators.
defaults
support:
.json
: the returned object must be JSON-serializable (e.g., a list or a dictionary).txt
: the returned object must be a string.csv
: the returned object must be apandas.DataFrame
.parquet
: the returned object must be apandas.DataFrame,
and a parquet library should be installed (such aspyarrow
).
# Content of defaults.yaml
serializer: custom.my_defaults_serializer
unserializer: custom.my_defaults_unserializer
tasks:
- source: tasks.first
product: output/one_dict
- source: tasks.second
product:
another: output/another_dict.json
final: output/final_dict.json
[9]:
%%sh
ploomber build --entry-point defaults.yaml --force
name Ran? Elapsed (s) Percentage
------ ------ ------------- ------------
first True 0.001148 39.1675
second True 0.001783 60.8325
Building task 'second': 100%|██████████| 2/2 [00:07<00:00, 3.64s/it]
Let’s print the JSON files to verify their contents:
[10]:
%%sh
cat output/another_dict.json
{"a": 3, "b": 2}
[11]:
%%sh
cat output/final_dict.json
{"a": 100, "b": 200}
Wrapping up¶
Configuring a serializer
and unserializer
in your pipeline.yaml
is optional, but it helps you quickly generate a fully in-memory pipeline for serving predictions.
If you want to learn more about in-memory pipelines, check out the following guide.
For a complete example showing how to manage a training and a serving pipeline, and deploy it as a Flask API, click here.