Valohai blog

Insights from the deep learning industry.

All Posts

Identify relevant text from complex documents

Selko.io builds solutions for multi-disciplinary project teams working in large companies. These teams work according to project documents that usually have several hundreds of pages. Finding the relevant sections for each team member is a real burden in the project-based working environment.

The normal workflow for the project teams is to go through the project documents manually at the beginning of each project and allocate parts of the text to a respective team. But with Selko’s solution, a machine learning model classifies different sections of the text automatically based on pre-defined categories. For example, certain parts of the text are highlighted and marked relevant to software engineers.

Selko's service identifying text sections

Multilabel text classification with domain knowledge

Selko is working with a couple of different pre-trained models – for example, open-sourced projects from Fast.ai and Hugging Face – and they use transfer learning to customize the models for their customers’ use cases.

To build a customer-specific model, we chop off the prediction part from a pre-trained language model and replace it with a feedforward network for classifying the texts. The classification layer includes the labels that our customer needs, and then we retrain the classifier.
-Aditya Jitta, Senior Data Scientist

The needs and the labels for the classification layer come from Selko’s customers. The number of labels varies from just a couple labels to tens and to even hierarchical label structures. Customers choose the required labels for the classification in the user interface and based on that, Selko knows what model to choose for this specific case.

Selko UI ❤️ Valohai API

After choosing the desired labels on Selko’s tool, the user uploads text files – containing approximately 200 sentences corresponding to each desired category – for training the model. And what happens under the hood, is that the user interface calls Valohai API to start the training with the user-defined labels and data set, and with a suitable model that is automatically chosen based on the need. Valohai then runs the training in Selko’s AWS cloud instances with automatic versioning. This way the model is ready for the inference stage without any interference from a DevOps specialist.

When the user uploads the actual project documents with the text that needs to be categorized, the inference step is run via Valohai API, and the user gets a categorized document back to Selko’s tool. So Valohai works as an orchestration layer between Selko’s user interface and the lower level architecture.

We, a team of two full-stack developers and one data scientist, took up the job to build ourselves a complete machine learning orchestration system. We faced a lot of hurdles to reach a fully functional and working system. Since time was definitely a constraint we decided that we should concentrate on our customers' needs and let Valohai take care of the ML infrastructure.

-Aditya Jitta, Senior Data Scientist

Read how Selko's technical team describes their initial steps with machine learning infrastructure and how they ended up using Valohai.

Future for Selko

Due to rapid advancements in the field, Selko's data scientists are continuously exploring different models to find out whether the new technologies would suit their customers’ needs. With the help of the Valohai platform, they can make sure that all of the experiments are stored automatically, and it is easy to share the findings across the whole team.

Aditya describes that, in the future, they plan to move towards active learning where the machine learning algorithm queries the user whether the model has predicted the right label for a section of text or not. Also, the intent is to use unsupervised learning to identify similarities across different documents.

Joanna Purosto
Joanna Purosto
Technology oriented marketer training a model to recognize sequences in my golf swing.

Related Posts

Identify relevant text from complex documents

Selko.io builds solutions for multi-disciplinary project teams working in large companies. These teams work according to project documents that usually have several hundreds of pages. Finding the relevant sections for each team member is a real burden in the project-based working environment.

Building vs. Buying ML infrastructure at Selko.io

This article is the story of us at Selko.io, productionizing our machine learning workflows. We'll describe Selko's route from starting the company to developing our first ML models. We'll also walk through how we built a fully working machine learning solution combining our UI, backend, and orchestration layer for machine learning tasks. And of course, how we went from a homegrown ML orchestration platform to Valohai. To give you some context, let's first dive into the history of the company.

Machine Learning Infrastructure Lessons from Netflix

Ville Tuulos, machine learning infrastructure architect, was the first to publicly dissect Netflix’s Machine Learning infrastructure at QCon in November 2018 in San Francisco. If you haven’t seen the talk yet, read the summary of his talk here! All the pictures used here, are from Ville's presentation. The full talk is 49 minutes long and you can watch it in its entirety on YouTube. From a scattered toolset to a coherent machine learning platform Ville starts by comparing Machine Learning Infrastructure to an online store and how building one was truly a technical problem twenty years ago. Back then you needed to build the whole online shop yourself starting from setting up the servers because the cloud did not exist. New platforms and technologies have since emerged that allow basically anyone to build up an online store and nowadays it is more about knowing the customers than setting up the webshop.