A software engineer’s view at data science.
If developers used to be the rock stars of the dotcom era, Data Scientists are quickly overtaking them as the new Whitesnake cover bands of the 2020s. Although both might be sporting the same hobo beards, Data Scientists are getting their work done with just sticks and stones as their tools while us Software Engineers have every tool in the universe.
Data Science Tools
Whether you accept our new overlords or not, there’s one thing you, as fellow software engineers, will agree with me on. Data Scientists may be building the next Tower of Babel for all we know, but they’re stuck with steam engines and pitchforks as their tools.
Where us software engineers couldn’t live without version control for our code, reality for data scientists too often is manual bookkeeping of experiment data, model algorithms, testing environments and training parameters in an Excel file tossed around over Slack to other team members. It’s like Clearcase and Lotus Notes had an illegitimate baby.
When us Software Engineers want to deploy an app to production, we don’t need to build our own servers anymore – we just deploy to the cloud. Data Scientists, on the other hand, too often have to do the equivalent of mining their own ore and dancing a rain dance before training their models on a server. They need to SSH into a server, install the latest Nvidia drivers, Python dependencies and a clusters of 100 GPUs hosting docker containers over Kubernetes. I know you’d know how to do it, but the difference is, that’s our job. Wanna guess how relevant this is for getting their job done and how much more they could get done if they could just double the 10% they spend on actual machine learning algorithms?
And it doesn’t stop at the tools, they same is true for team collaboration tools and methodologies. When us software engineers use JIRA and go agile, data scientists need to invent their own ways. In every fricking company and project. Imagine if you had to re-learn everything every time you switched projects! (oh, wait… I guess you bleeding-edge JS haxxors do know what I mean…).
The ML infrastructure at Facebook, Amazon, Netflix, Google et.al.
Most big players in machine learning have seen this problem and started solving it for themselves. For example, Über has built their own Michelangelo toolset for doing version control and server management. Likewise Airbnb has Bighead, Netflix has something of their own, as do Google and Facebook. These are all in-house proprietary toolsets to make sure their data scientists’ time is not wasted on pipeline orchestration and management tools.
Not every company however has the muscles to invest into ML orchestration in the same way and many large corporations still fail to see big impact it will have on business. “Fat pigs become lazy”, as they say… As a result these early movers have a HYUUUUGE advantage over the rest — not to mention how impossible it will be for startups to get started when most of your time goes into building tools for themselves.
Data science is not a one off thing — the world changes, data changes and you constantly need to retrain your model. Unlike software engineering it’s not a simple if-clause somebody wrote in a stored procedure in 1979 running on your DB2, but something that you constantly need to re-train. And the further away you start, the more impossible it becomes to catch up with the competition.
The knight on the white horse
There are however solutions that don’t involve inventing it yourself. For example at Valohai we’re in the business of building the same tools for the masses that the big players have built in-house for themselves.
Whether you’re a software engineer moving to data science or a data scientist tired of “not-invented-here” discussions, you’ll get automatic version control from training data and training environment to training script and algorithms.
You could of course build it yourself as well – and at the end of the day you already know what to build: Version control, server orchestration and support for ML frameworks and distributed learning. The important thing is you’ll have something to help you concentrate on your actual work: Building predictive models.
Valohai, on the other hand, is free for all research purposes and super easy to get started with.