Metaflow logo

Metaflow is (my favorite, just the best ๐Ÿ˜) a machine learning library that offers simple python annotations to establish reproducible data engineering, model training, model validation, and other steps and execute them locally or in the cloud on AWS, Kubernetes, and Titus.

Metaflow is open-source and used at Netflix and many other companies in production machine learning and data science workflows.

What problems does Metaflow help to solveโ“

  • Get training data, train a model on a schedule and keep an audit trail of all the training executions โ€” โœ…

I've seen Metaflow being used for small ETL jobs as well as for multi-day training marathons. Its simplicity makes it an extremely versatile library.

Want to try it out? This exercise will take just a few minutes of your time! I advise performing the below steps in a python virtual environment. โ—๏ธ

You can quickly create a virtual environment via a python module virtualenv
The commands are similar between Mac and Linux and slightly different on Windows.

python3 -m virtualenv venv
# activate new virtualenv
source venv/bin/activate
pip3 install metaflow

Let's start with a simple flow to make sure everything works. Create a metaflow_start.py with the below code snippet:

from metaflow import FlowSpec, step

class LinearFlow(FlowSpec):

@step
def start(self):
self.my_var = 'hello world'
self.next(self.step_one)

@step
def step_one(self):
print('the data artifact is: %s' % self.my_var)
self.next(self.end)

@step
def end(self):
print('the data artifact is still: %s' % self.my_var)

if __name__ == '__main__':
LinearFlow()

To execute the flow, let's run

python3 metaflow_start.py run

You should see an output in the console:

Metaflow 2.4.3 executing LinearFlow for user:{your_user_name}
Validating your flow...
The graph looks good!
Running pylint...
Pylint is happy!
Workflow starting (run-id 1637382785717584):
2021-11-19 20:33:05.736 [1637382785717584/start/1 (pid 6096)] Task is starting.
... Task finished successfully.
...Task is starting.
.../step_one/2 the data artifact is: hello world... Task finished successfully.
... Task is starting.
... the data artifact is still: hello world... Task finished successfully.
... Done!

๐ŸŽ‰๐ŸŽ‰๐ŸŽ‰ You have created your first flow! ๐ŸŽ‰๐ŸŽ‰๐ŸŽ‰

There are a few essential features of Metaflow showcased in the above example.

  • When you assign a value to self inside a step, it will be available to other steps to the last step, โ—๏ธ unless there is a split into parallel processing step somewhere in the middle. โ—๏ธ
The image shows steps A, B, C connected with the flow from A to B and from B to C. The B step is parallel step, it has many instances of itself. The C step is a join step, it allows to merge all the compute that happened in the B step.
The image shows steps A, B, C connected with the flow from A to B and from B to C. The B step is parallel step, it has many instances of itself. The C step is a join step, it allows to merge all the compute that happened in the B step.

As you can see, it is pretty easy to enhance your python code with Metaflow and add parallel processing or cloud computing to your data science project. Thank you for reading!

Disabled. Software Engineer at Netflix focusing on AI. Co-founder of Vortle https://www.vortle.com