Starting with Metaflow
Metaflow is (my favorite, just the best π) a machine learning library that offers simple python annotations to establish reproducible data engineering, model training, model validation, and other steps and execute them locally or in the cloud on AWS, Kubernetes, and Titus.
Metaflow is open-source and used at Netflix and many other companies in production machine learning and data science workflows.
What problems does Metaflow help to solveβ
- Get training data, train a model on a schedule and keep an audit trail of all the training executions β β
- Establish an ETL pipeline with just a few lines of python code β β
- Train a large-scale model on AWS, Kubernetes, or Titus β again, with just a few lines of python β β
- Quickly establish directed graphs of different computation steps for parallel computing? Check! β
- Resume your compute from a certain step? β β
I've seen Metaflow being used for small ETL jobs as well as for multi-day training marathons. Its simplicity makes it an extremely versatile library.
Want to try it out? This exercise will take just a few minutes of your time! I advise performing the below steps in a python virtual environment. βοΈ
You can quickly create a virtual environment via a python module virtualenv
The commands are similar between Mac and Linux and slightly different on Windows.
python3 -m virtualenv venv
# activate new virtualenv
source venv/bin/activatepip3 install metaflow
Let's start with a simple flow to make sure everything works. Create a metaflow_start.py
with the below code snippet:
from metaflow import FlowSpec, step
class LinearFlow(FlowSpec):
@step
def start(self):
self.my_var = 'hello world'
self.next(self.step_one)
@step
def step_one(self):
print('the data artifact is: %s' % self.my_var)
self.next(self.end)
@step
def end(self):
print('the data artifact is still: %s' % self.my_var)
if __name__ == '__main__':
LinearFlow()
To execute the flow, let's run
python3 metaflow_start.py run
You should see an output in the console:
Metaflow 2.4.3 executing LinearFlow for user:{your_user_name}
Validating your flow...
The graph looks good!
Running pylint...
Pylint is happy!
Workflow starting (run-id 1637382785717584):
2021-11-19 20:33:05.736 [1637382785717584/start/1 (pid 6096)] Task is starting.
... Task finished successfully.
...Task is starting..../step_one/2 the data artifact is: hello world... Task finished successfully.
... Task is starting.... the data artifact is still: hello world... Task finished successfully.
... Done!
πππ You have created your first flow! πππ
There are a few essential features of Metaflow showcased in the above example.
- When you assign a value to
self
inside a step, it will be available to other steps to the last step, βοΈ unless there is a split into parallel processing step somewhere in the middle. βοΈ - If a parallel processing step splits, then the values assigned to the preceding steps would not be available to the steps after the parallel processing step.
- When you run your flow on AWS, values assigned to
self
are pickled and stored in the S3 object-store. You can see the variablemy_var
gets the value ofhello_world
β thismy_var
the variable then can be used in other steps. You can use this pattern to pass DataFrames, media files, and other artifacts between steps.
As you can see, it is pretty easy to enhance your python code with Metaflow and add parallel processing or cloud computing to your data science project. Thank you for reading!