By now you’ve surely heard about Kubeflow, the machine learning platform based out of Google. Kubeflow basically connects TensorFlow’s ML model building with Kubernetes’ scalable infrastructure (thus the name Kube and Flow) so that you can concentrate on building your predictive model logic, without having to worry about the underlying infrastructure. At least in theory.
In this blog post, we’ll look at what Kubeflow consists of and how you would go about setting up your own Kubeflow pipelines. We assume you have basic knowledge about TensorFlow and Kubernetes, although in an ideal world you’d be up and running without knowing about the latter.
What is Kubeflow?
Kubeflow consists of 4 main components that you’ll see when you open the admin console:
JupyterHub: allows spawning Notebook servers for interactive development.
TFJobs: allows monitoring your running Kubernetes training jobs (there are other more or less maintained job types too)
Katib: hyperparameter tuning tools (Study, StudyJob)
Pipelines: acyclic graphs of containerized operations written in Python, passing outputs to inputs (strings)
Getting started with all of this you at least need a basic understanding of how to configure scaling Kubernetes clusters. You’ll also need to install ksonnet to try Kubeflow and then learn how to use it before being able to use Kubeflow, which might set off a few people. [Update Feb 10th 2019: Ksonnet was discontinued a few days ago]
Most software engineers know how to build and set up their own Docker images, so this shouldn’t be a problem – but for data scientists without a Software Engineering background, it’s one more thing to learn. Setting up everything requires some research, but isn’t too hard for an experienced software developer.
As a summary, Kubeflow is a hassle to set up the first time and thus prototyping it takes more than a little effort. But once you have it up and running, experimenting is a whole lot faster. Ideally you'd have somebody else in your infrastructure team set it up so that you can start playing around with it.
How to use Kubeflow as a Data Scientist?
In practice how you’d be using Kubeflow is that you’d write your code (in JupyterHub or elsewhere using TensorFlow – there is also support for other frameworks but TF is the #1 citizen for sure), wrap it all up inside a Docker container with the set dependencies and pass all of that to Kubeflow. To deploy your code to Kubernetes, you'll build your local project into a Docker container and push the image to Container Registry so that it’s available for the cluster.
Kubeflow will then launch your GCP instances (most probably other cloud providers will be coming along shortly, but some Kubeflow components like Pipelines are only available on GCP as of today), fetch your data through TensorFlow’s native APIs and give you your results. Easy peasy, as long as you know your way around Docker and assuming you have your Kubeflow farm setup somewhere for you. Our current customers don't need to worry about Docker images or Kubernetes clusters as all of that is provided as a service. However, Kubeflow might with quite a high probability, in the future, be the backend for what's empowering Valohai.
What about Version Control?
Kubeflow's main focus is on orchestration and with Kubernetes in the background it shines at it. But at least for our current customers machine orchestration isn't everything. While using Kubeflow, the metadata about your run isn’t stored anywhere centrally, like it is with Valohai today. You should always store the Docker image you built; so that you can dig into it later to know more about which version of the code was run and in which environment. Job parameters are stored in Ksonnet component parameters, local `params.libsonnet` files, which you need to manually version.
We didn’t find best practices for version controlling your input or output data, so you’ll have to figure that out on your own – but as Kubeflow gets more traction, best practices are bound to emerge. The good news is that everything is Dockerized so as long as you store the container, you’ll have the code and libraries in one place.
What are the main benefits of Kubeflow?
The main benefits of running on Kubeflow are mainly around Kubernetes and its scalability. Once you have everything up, running your training at scale is a breeze. Also the hyperparameter tuning Katib is really cool!!!
Going forward with Kubeflow?
We at Valohai are seriously evaluating Kubeflow as our backend for the future. Once it matures a bit more, it would let us remove one pieces of the orchestration puzzle and concentrate on version control and a nice UI & API for everything.
At the time of writing, our customers would, however, lose several features (automatic input & output management, support for all major cloud providers, zero setup infrastructure to name a few) so today Kubeflow on Valohai is more of a technical PoC. Also, we aim to abstract all of that in the long run. We think that data scientists shouldn't have to worry about what is running their code in the background, much less about setting up an environment. Data Scientists should be able to just write their code, bring in their data and BOOM – get their results. Iteration speed in experimentation is everything in data science.
Going forward we see large potential for Kubeflow and are big fans of the project. If you’d like to try out Kubeflow yourself, head over to https://github.com/kubeflow/kubeflow or sign up for our Kubeflow beta to be among the first ones to run Valohai on top of Kubeflow!