Valohai blog

Insights from the deep learning industry.

All Posts

Michelangelo – Machine Learning Infrastructure at Uber

When we founded Valohai two years ago, we were lucky to make friends with team leads for Uber’s Michelangelo machine learning platform. Michelangelo has been an inspiration in building Valohai for the other 99.999...% of companies that aren’t Uber but still need to speed up their machine learning through automation.

mike_del_balso

We’re thus happy to sit down with Mike Del Balso, the former lead for Uber’s Michelangelo to talk about his thoughts on machine learning automation and DevOps. Mike was the Product Manager for the Machine Learning Infrastructure team at Uber. He recently left Uber to start a new venture, about which he isn’t saying much yet.

Thanks for taking this interview Mike, could you shortly describe what your role was at Uber?

Thanks for having me! At Uber, I was the founding product manager for Uber’s machine learning platform, Michelangelo. As the product manager, I was responsible for understanding Uber’s business needs, crafting a practical but ambitious vision for how machine learning can transform decision-making across Uber, and defining and managing a machine learning platform (Michelangelo) to scalably address those needs by making world-class machine learning easy and accessible to every data scientist and engineer in the company.

Excellent. So if we were to crystallize the goal of Michelangelo, how would you describe it? What problem does Michelangelo solve?

The core goal was to democratize machine learning across the company and make state-of-the-art intelligence driven decisions in every corner of the product and business. Achieving this goal required developing new tools and methods to build, organize, and productionize machine learning systems. We dealt with technology challenges as well as people and organization challenges such as ensuring we had a vibrant and collaborative ML community.

However, most of our time was spent solving difficult technical challenges around building a world-class machine learning platform. This involved innovating in every area of the ML stack. A big area of focus was around making it as easy as possible for data scientists to access and use the company’s data to build new ML models at scale and making a “1-click” experience for deploying those models to production.

How does Michelangelo support bringing models faster to production?

That’s a good question. The main challenge with bringing models to production quickly is that data scientists often don’t have the skills to bring a system all the way to production and therefore often need support from engineering teams. Michelangelo alleviates this need by helping them navigate the entire ML workflow without needing to depend on engineers or other job functions. This is achieved by a tight integration with Uber’s data infrastructure: data lakes are easily accessible & analyzable, and making training as simple as possible through automated orchestration, a simple UI, and sophisticated metadata management.

That sounds cool! And that’s been our guiding principle at Valohai as well–to help data scientists work by automating ML DevOps. What have you done at Michelangelo for helping with training on your Big Data?

When we got started with Michelangelo, there was no tooling that explicitly supported training on large datasets. From the beginning, we focused Michelangelo on distributed learning on big data and allowing those models to get into production. We later built and open sourced a library for distributed deep learning called Horovod.

Wow, that’s awesome! So I’ve heard stories about Uber-scale deployments–tell me a little about your inference.

Production models at Uber have high load requirements -- many models need to serve millions of predictions per second. It doesn’t make sense for individual data science teams to build these systems on their own.

Michelangelo gives data scientists a 1-click deploy experience to spin up the right data and serving infrastructure to meet their serving needs. This isn’t just about serving model predictions, it’s also about serving feature data to the models in real-time. Our blog posts go into quite some depth on Michelangelo’s feature computation and serving infrastructure which really makes the experience magical for data scientists.

Knowing all this, what would you say has been hardest about building and planning Michelangelo?

One of the hardest parts is making everything work smoothly together in a seamless experience. We went through many ups and downs with that before we nailed it. If you support a long workflow with a wide surface area, you have to get lots of things right provide a good experience to your users. We were lucky to have great data infrastructure to build on top of and many talented data scientists and engineers to work with as our partners.

One often overlooked challenge is the difficulty of the data side across the ML workflow. There’s a lot of discussion and interest around building and deploying models, but you also need to productionize the data to get it to the models at inference time. That’s an area that’s really tough to get right and I think Michelangelo went a long way to solving that for Uber. We see many companies struggling with this today.

Thanks a million for shedding some light on Uber’s Michelangelo. I’m sure all Data Science leads will be reading this with interest! Any parting words to our readers?

A principle that we learned along the way was around meeting the users where they are - don’t try to force your Data Scientists to adopt tools or workflows that are unnatural to them. It’s a lot easier to go to the development tools they love and use every day and support them there rather than rebuilding new experiences from scratch. The Michelangelo team built CLIs, notebook add-ons, a web UI, and various integrations with cloud providers and frameworks. The work never ends. :)

I spend a lot of time talking to other organizations who are going through the same struggles and trying to figure this out on their own as well. It’s a huge, complicated, crowded, and hyped-up space so it’s hard to navigate. The best advice I can offer is to find someone who has done it before and has the battle scars. Learn how to avoid their mistakes and find the path to repeatable, standardized, and safe enterprise AI. Good luck!

 

If you're interested in trying out what a machine orchestration and version control platform such as Michelangelo could do for your organization (tip: shave off 90% of your model development time and add an audit trail to all experiments) email me at fredu@valohai.com for a private demo!

Fredrik Rönnlund
Fredrik Rönnlund
Software Engineer turned marketing lizard turned product dadbod turned ML nerd. In charge of growth at Valohai, i.e. the co-operation between products, marketing and sales.

Related Posts

Machine Learning Infrastructure Lessons from Netflix

Ville Tuulos, machine learning infrastructure architect, was the first to publicly dissect Netflix’s Machine Learning infrastructure at QCon in November 2018 in San Francisco. If you haven’t seen the talk yet, read the summary of his talk here! All the pictures used here, are from Ville's presentation. The full talk is 49 minutes long and you can watch it in its entirety on YouTube. From a scattered toolset to a coherent machine learning platform Ville starts by comparing Machine Learning Infrastructure to an online store and how building one was truly a technical problem twenty years ago. Back then you needed to build the whole online shop yourself starting from setting up the servers because the cloud did not exist. New platforms and technologies have since emerged that allow basically anyone to build up an online store and nowadays it is more about knowing the customers than setting up the webshop.

Building Machine Learning Infrastructure at Netflix

In our series of machine learning infrastructure blog posts, we recently featured Uber’s Michelangelo. Today we’re happy to be interviewing Ville Tuulos from Netflix. Ville is a machine learning infrastructure architect at Netflix’s Los Gatos, CA office.

Build vs. Buy – A Scalable Machine Learning Infrastructure

In this blog post we’ll look at which parts a machine learning platform consists of and compare building your own infrastructure from scratch to buying a ready-made service that does everything for you.