With datasets in machine learning growing faster than ever, they have given a rise to numerous complexity as well. This calls for developing a flexible data framework which will support both the increasing capacity as well as the complexity of these datasets. Apart from providing solutions to these concerns, the framework should also include the functionality of ML. This will help deal with large datasets efficiently. Now, Apache Spark, the popular open-source cluster-computing engine, has come up with a library called MLLib which supports ML applications and features.
What Is MLLib?
MLLib is Spark’s distributed ML library which uses data parallelism technique to store and work with data. The library contains implementations of standard ML algorithms such as classification, regression, clustering, decomposition techniques and others. It also provides an avenue for using linear algebra and optimisation methods, which are essential for ML algorithms. Primarily written in Scala, MLLib extends support for application programming interfaces (APIs) written in Python, R and Java as well.
It runs on a host of cluster computing and big data platforms such as Apache Hadoop, Amazon EC2 as well as with container technologies such as Kubernetes. Furthermore, MLLib can coordinate with data across a variety of database management systems such as Apache Cassandra, and Apache Hbase among others.
Why MLLib Was Introduced In Spark
Apache Spark provides a powerful computational environment for ML due to its distributed architecture — that too, on a large-scale basis. This makes ML models run more quickly and efficiently. On top of this, the iterative nature of ML algorithms makes it ideal to run in Spark since it is highly flexible. Because of these benefits, MLLib was introduced into Spark. Initially started as an ML project called MLBase at University of California, Berkeley, it was later integrated with Spark in 2012. It was made open-source with an Apache 2.0 license in 2013.
What’s more, the MLLib developers community is also growing and is constantly improving with latest features. With the community being available open-source, a lot of tools are brought into the database ecosystem with ML capabilities.
APIs And ML
ML deployments usually involve a series of steps such as data preprocessing, feature extraction, fitting a model and its validation. Most ML libraries such as scikit-learn do not provide a comprehensive environment for all these to be done in one place, not to mention the large datasets. The process of constructing a pipeline and processing large datasets is an expensive and cumbersome task. This is where Spark’s MLLib and its support for APIs proves its mettle. MLLib addresses all these issues with the help of a support package called spark.ml. This package reduces the computation in ML by providing high-level APIs. In addition, users can customise their own algorithms with these APIs.
Since ML requires a lot of data to be cleaned and preprocessed before it is set into action, Spark’s integration is beneficial. Spark’s computing engine has over 80 operators for manipulating and transforming input data. With other API features such as support for SQL, DataFrame abstraction and GraphX, data processing and visualisations also become easier. Another feature worth noting is Spark Streaming which allows live data streams, enabling support for online learning algorithms, which in turn makes MLLib easy to implement.
These features necessitate a structured availability of information for users. As a result, Spark hosts a number of user guides and programming documentation for each feature related to MLLib. Furthermore, there are regular developers’ meets to discuss latest happenings in MLLib as well as in ML. Spark’s community has also made a provision for users to check out various software packages and libraries through their Spark Packages — a one-stop place for official and beta packages in the Spark ecosystem.
In fact, MLLib also enables faster optimisations in machine learning algorithms which extensively rely on linear algebra at an advanced level. This would help large ML models to reach employees in a company if it implements Spark for big data. Even complex decision trees in other ML models can also be handled very well when Spark is integrated with MLLib.
MLLib’s performance was found to be at par, and even better, compared to other methods such as implementations in Apache Mahout and Amazon EC2. MLLib was also found to be more scalable than the others with ML instances being easier to go with large projects. It was even tested for various available versions in MLLib.
All of these positive outcomes in MLLib hint at the possibility of this library used more in developing as well as implementing ML. With even more stable releases, it fosters an avenue for bringing optimal ML into the arena of big data and analytics space.