Valohai blog

Insights from the deep learning industry.

All Posts

Using DVC to version control your ML experiment data

In this blog post we will explore how you can use DVC for your data version control and how you can automate your data version control with and without DVC inside the Valohai platform.

DVC (https://dvc.org/) is an open source command-line tool for version controlling your binary data in the same way as you version control code in Git. You hook it up to your data store (e.g. AWS S3 or Azure Blob Storage) and after that use it in the same way as you use Git for pulling and pushing files.

Here we explore the usage of DVC as a version control system for machine learning, how it integrates with the Valohai platform and what benefits you can get from both.

In a nutshell, you would use DVC together with Git in the following way:

git pull # to pull a specific version of your code
dvc pull # if properly configured, this will download files used locally
dvc run ... python train.py # run something and record the results
dvc push # sends local files to remote storage
git add .
git commit -m ‘Did some changes’
git push # save code changes, allowing others to `dvc pull` after `git pull`


DVC will create meta-files (*.dvc) for:

  • all datasets and artifacts relating to the project
  • each dvc run you do to record what were the inputs and outputs of the command
  • metric files for recording results from commands, these will be saved to Git as-is

Integrating DVC with Valohai

Before jumping into how, you should first ask why you’d like to use DVC. Valohai on its own already automatically version controls all input and output data from every experiment and pre-processing step that you conduct. Thus the only use-case where DVC makes sense are cases where you’ve already used DVC previously and for compliance (or nostalgic?) reasons don’t want to get rid of it. In that case the instructions below apply.

Using DVC with Valohai is as easy as calling the dvc library on your command line. As Valohai builds on defining machine learning pipeline steps, you can add the call to dvc directly inside the pipeline step. In the example below we run a “train-my-model” step, on a container using TensorFlow with a few additional commands for calling dvc. This way dvc is called every time automatically, when you complete a training run. Thus dvc is transparent to the user.

- step:
    name: train-my-model
    image: tensorflow/tensorflow:1.13.1-py3
    command:
        - # configure your AWS credentials and DVC remote if not setup in the git
        - dvc pull
        - dvc run (dvc configuration) python train.py

This however doesn’t make much sense functionality wise as Valohai already provides data management for all input & output to your training runs.

Valohai’s automatic data management to the rescue!

The core difference is how Valohai and DVC do record keeping is that with DVC you will be tracking metadata about your data in your code repository (Git), whereas Valohai automatically tracks this for you in a dedicated database. Below a few examples on the differences between DVC and Valohai.

Storing files... 

...in DVC:

  • You run `dvc add path/to/filename.ext` or `dvc run -o path/to/filename.ext <YOUR-COMMAND>` which generates the metafile `filename.ext.dvc`
  • Then you use `git add; git commit; git push` and `dvc push` to record the data.

...in Valohai:

  • In Valohai runtime, you just write files to `/valohai/outputs` and Valohai stores it in your data store and records a reference to it. (You can also upload files through the web UI by hand.)

Using files...

...in DVC:

  • you first go to a specific git commit with the file version you want using `git checkout` 
  • then run `dvc pull` to download the right version of the data
  • and finally train on that data

...in Valohai:

  • you specify the address of the files e.g. `s3://my-data/path/to/file.ext` and Valohai downloads them automatically before running your training code

 

Using old files with updated code…

...in DVC:

  • you pull old code with git and then
  • cherry pick specific file changes using `git` to get the old `.dvc` metafiles
  • and the run `dvc pull` to download the right version of the data

...in Valohai:

  • you select the older dataset from a dropdown in the UI, or if using the CLI you specify the address to the file (e.g. by looking it up in the web UI)

 

Tracking metadata & metrics while training...

...in DVC:

  • you write files in the format of your choosing using `dvc run -M my-metrics.csv <YOUR-COMMAND>` 
  • and you store the data with your code 
  • and you view the metadata as you wish (e.g. in a file editor)

...in Valohai:

  • you simply print JSON to stdout e.g. `json.dumps({'loss': 0.123})` and Valohai will store it and visualize it automatically as graphs

Build machine learning pipelines...

... in DVC:

  • you run multiple `dvc run`s with varying `-d` and `-o` configuration that will organically build a pipeline that you can then rerun with `dvc repro`.

… in Valohai:

  • In Valohai, you specify pipelines with a dynamic syntax then run them from a Web UI

Note that Valohai pipelines are a more full-fledged processing solution where different steps can be run on different hardware or even operating systems, whereas DVC pipelines are run as a sequence on a single machine.

Conclusions

While we have seen that DVC can be used together with Valohai we have also shown that Valohai automatically takes care of everything that DVC does by hand. In conclusion, the only place where there is a benefit of using DVC with Valohai is when you’re already using DVC from before and have a specific need to version data with it specifically. Otherwise, Valohai does all of the things above automatically.

Book a DEMO

Fredrik Rönnlund
Fredrik Rönnlund
Software Engineer turned marketing lizard turned product dadbod turned ML nerd. In charge of growth at Valohai, i.e. the co-operation between products, marketing and sales.

Related Posts

Using DVC to version control your ML experiment data

In this blog post we will explore how you can use DVC for your data version control and how you can automate your data version control with and without DVC inside the Valohai platform. DVC (https://dvc.org/) is an open source command-line tool for version controlling your binary data in the same way as you version control code in Git. You hook it up to your data store (e.g. AWS S3 or Azure Blob Storage) and after that use it in the same way as you use Git for pulling and pushing files.

Continuous Integration in Automotive Machine Learning Development

What is continuous integration? Continuous Integration (CI) in software development is the process of testing that a change in one place doesn’t break something else. Continuous Delivery (CD), on the other hand, is an extension to CI where every change in the code is also deployed. Both are and have been core parts in the advancements of Extreme Programming, i.e. rapid small-batch development. This, on its hand, has been the main contributor to advancements in rapid software development.

How to Effectively Grow Your Deep Learning Team and Why Version Control Matters

There’s only one way to grow your deep learning team effectively: by adding new people to it! (We were just as shocked as you are by this revelation!) Filling your team can be done a couple ways: by recruitment, hiring freelancers, or outsourcing to consulting agencies. Finding talented people is hard enough already, so make sure your newly hired team members hit the ground running and don’t slow down the rest of the team.