Valohai blog

Insights from the deep learning industry.

All Posts

Level Up Your ML Code from Notebook to Production

leveling up machine learning code from notebook to production

Developing a machine learning model for a new project starts with certain common groundwork and exploration, to understand your data and figure out the approaches to try. A popular choice for this groundwork is Jupyter, an environment where you write Python code interactively. In Jupyter notebook's cells you can evaluate and revise and it is an attractive, visual choice (and many times the right choice) – for this step of data science work. Since Jupyter kernels, the processes backing a notebook’s execution, retain their internal state while the code is being edited and revised, they’re a highly interactive, fast-feedback environment.

jupyter notebook is not for scaling the machine learning code

However, while convenient, Jupyter notebooks can be hard to reason about exactly because of this retention of state, since the state of your environment may have changed in a non-linear fashion (or worse yet, left in an inconsistent state) after re-evaluation of an earlier cell. It's entirely possible to have a saved notebook that can't be successfully evaluated after relaunching the kernel. Since production-grade code is supposed to be easily tested and reviewed, as we’ve learned as an industry, this isn’t desirable at all.

It can also be difficult to keep track of the exact versions of dependencies you've used during development. For instance, a model that worked fine with a certain version of TensorFlow might not run at all with a newer one down the line, and it's tedious to try and figure out what exactly was being run at the time of exploration.

Jupyter Notebook in Production. Or not..

Let's assume you've played nice and been fastidious enough to not run into these problems – all your dependencies are locked down and you've found out that you can actually run your notebook non-interactively with jupyter nbconvert --to notebook --execute notebook.ipynb and maybe pipe the output into a file, for tracking results. You'll inevitably want to run your training code using different parameters (say, learning rates, network structures, etc.); jupyter nbconvert --execute isn't really conducive for that, and editing the notebook, or maybe a separate configuration file, to change constants is just silly, too.

From Jupyter Notebook to Regular Code for Production

These are some of the reasons why we advocate for using regular, linear Python scripts instead of Jupyter/IPython notebooks when going from initial exploration work to something resembling production. Another thing is that you get to develop using your favorite editor/IDE, be it vim or Emacs or VSCode or PyCharm (which, by the way, has an excellent Scientific Mode – and support for notebooks too), instead of being confined to a browser. Switching from notebooks to regular code also lets you refactor your solution to a more modular, more easily testable and reviewable package.

Of course there are drawbacks; development is less interactive, and since there is no persistent state, everything needs to be evaluated or loaded from scratch at every invocation of your script. On the other hand this property makes you think about preprocessing your data to a faster-to-load format earlier. When you extract the preprocessing code from other code, it becomes easier to maintain and more reproducible, as well.

 

EDIT: @joelgrus apparently talked about this very same thing at JupyterCon last week! The slides are hilarious, I suggest checking them out.

Summary

Notebooks Scripts
  • Fast and interactive
  • Visual feedback
  • Reviewable
  • Testable
  • Reproducible
  • Easily version controlled 

*image of stairs is derived from https://unsplash.com/photos/pKvmGR4qHrg

Aarni Koskela
Aarni Koskela
CTO and Founder of Valohai

Related Posts

Automatic Version Control Meets Jupyter Notebooks

Running a local notebook is great for early data exploration and model tinkering, there’s no doubt about it. But eventually you’ll outgrow it and want to scale up and train the model in the cloud with easy parallel executions, full version control and robust deployment. (Letting you reproduce your experiments and share them with team members at any time.)

Run Jupyter Notebook On Any Cloud Provider

This tutorial will demonstrate how to take a single cell in a local Jupyter Notebook and run it in the cloud, using the Valohai platform and its command-line client (CLI).

Level Up Your ML Code from Notebook to Production

Developing a machine learning model for a new project starts with certain common groundwork and exploration, to understand your data and figure out the approaches to try. A popular choice for this groundwork is Jupyter, an environment where you write Python code interactively. In Jupyter notebook's cells you can evaluate and revise and it is an attractive, visual choice (and many times the right choice) – for this step of data science work. Since Jupyter kernels, the processes backing a notebook’s execution, retain their internal state while the code is being edited and revised, they’re a highly interactive, fast-feedback environment.