San Francisco headquartered Databricks that provides a unified analytics platform released MLflow, a new open source project that strives to provide some standardization to the complex processes that machine learning engineers face during the course of building, testing, and deploying machine learning models. Announcing the release of the open source platform, CTO Matei Zaharia, also the creator of Apache Spark noted that even though there are a number of open source tools that cover each and every phase of the machine learning ifecycle, such as data preparation and model training, it is hard to track experiments and reproduce the results.
At a keynote address, Zaharia observed that machine learning development lifecycle is highly complex and developers face a lot of issues which are usually not present in a traditional software development lifecycle.
Zaharia Listed Down A Few Pain Points Developers Face In Building ML Models:
- Number of tools have grown: Zaharia cited that unlike the process in traditional software development, where teams select one tool for each phase, in machine learning, engineers end up testing every available algorithm to see whether it improves results. In the end, developers use dozens of libraries.
- Reproducing results: Reproducing machine learning workflow by retracing the steps is extremely difficult in machine learning. For example if you have to debug a problem, it can be difficult to go back to the past work.
- productionizing ML models: A big challenge developers face is moving a model to production because there is no set way move models from a library to any of these tools, emphasises Zaharia in the post.
MLflow Open Source Project Provides A Standardized Format For Training & Deployment
MLflow, currently in alpha stage manages the entire machine learning lifecycle and allows developers to work with any machine learning library. It offers three components: MLflow tracking to record and query experiments; MLflow projects, a standardized format to package reusable code and MLflow models. Talking on the sidelines of the release at the Databricks’ Spark and AI Summit in San Francisco, Zaharia observed that MlFLOW standardises the data for training and deployment loop. “As long as developers work within the platform, if you are building models with these tools, you can deploy and productionize it thereby saving a lot of time,” he said.
Since it is an open source platform, developers from across the globe MLflow would make contributions and would be able to share workflow and ML models if developers want to open source their code. The platform’s open interface is a key feature here – it is built around REST APIs and simple data formats, instead of just replying on a small set of built-in functionality. This means developers can easily add MLflow to their existing ML code and share code across any ML library that others in the company can run.
Need For Standardized Open Source ML Platform
Besides open source ML platforms such as Keras and Theano, companies developed internal ML platforms to manage the development lifecycle. For example, earlier last year Uber Engineering released Michelangelo, machine learning-as-a-service system for building and deploying models, Facebook developed FBLearner and Google has TFX, an end-to-end general-purpose machine learning platform released last year. Google has already open sourced some TFX libraries. According to Zaharia, most machine learning platforms only support a small set of built-in algorithms, or a single ML library, and are also tied to each company’s infrastructure. This implies that developers are unable to use other machine learning libraries.
The platform is currently offered in a hosted version but if it takes off it can help startups and companies consolidate their ML workflow and can be a bit hit with businesses. However, it faces stiff competition from TensorFlow, which thanks to tech giant Google’s backing is set to become industry standard for machine learning researchers and developers. Also, Google’s TensorFlow is backed by Jeff Dean and gets continued support from the tech giant. It is also used in daily operations and TensorFlow also provides a Visualization tool called Tensorboard that most frameworks usually lack. ML practitioners also cite that the recent version of Tensorflow provides a brand new feature called Eager execution. Databrick’s project MLflow is currently hosted at GitHub and also integrates with the company’s Unified Analytics Platform.