Flexible ML Experiment Tracking System for Python Coders with DVC and Streamlit
- https://2022.pycon.de/program/WADNGC/
- Antoine Toubhans
- Slides: https://github.com/sicara/pycon-2022-dvc-streamlit
- YOutube: https://www.youtube.com/watch?v=YOSVMMwTlHM
Motivation:
- in addition to the code, the data should also be versioned;
- in its essence, ML engineering is an exploratory work: one can not know if the model is going to work before testing it;
- there is no clear way to guarantee the quality of the trained model: the data-scientist has to play with it to make it “talk”.
DVC (Data Version Control): Versioning your ML data, code, model
Streamlit: Build apps to play with trained models
MLFLow vs DVC + Streamlit
- MLFlow is complete framework.
- DVC and Streamlit provide you flexibility
Train pipeline
- Download_data.py
- Split_dataset.py
- train.py
- evaluate.py
DVC (Data Version Control)
- What goes to git and what to dvc?
- GIt:
- train.py
- evaluate.py
dvc add .- Generates
model.h5.dv(git tracked) -
model.h5[DVC tracked] -
Data has non-linear workflow but classical software enginnering has linear workflow
-
With dvc, you can return
- All experiments with given commit hash
Streamlit
- Strmlit + DVC don't need any infra
- unlike MLFlow.