We all know the traditional way of machine learning, where programmers use an integrated tool for data mining and conduct analysis on the results. However, the traditional way may not work if the data is too large to store in the RAM of a single computer. Most existing ML algorithms are designed by assuming that data can be easily accessed, which means the same data may be accessed many times.
Scalability Led To the Rise of Distributed ML
It was this challenge to handle large-scale data due to scalability and efficiency of learning algorithms with respect to computational and memory resources that gave rise to distributed ML. For example, if the computational complexity of the algorithm outpaces the main memory then the algorithm will not scale well and will not be able to process the training data set or will not run due to memory restrictions. Distributed ML algorithms rose to handle very large data sets and develop efficient and scalable algorithms with regard to accuracy and to requirements of computation (memory, time and communication needs).
Distributed ML algorithms are part of large-scale learning which has received considerable attention over the last few years, thanks to its ability to allocate learning process onto several workstations — distributed computing to scale up learning algorithms. It is these advances which make ML tasks on big data scalable, flexible and efficient. There are two approaches to distributed learning algorithms. The distributed nature of these datasets can lead to the two most common types of data fragmentation:
- Horizontal fragmentation where subsets of instances are stored at different sites
- Vertical fragmentation where subsets of attributes of instances are stored at different sites
Some of the most common scenarios are where distributed ML algorithms are deployed are in healthcare or advertising where a simple application can accumulate a lot of data. Since data is huge, programmers frequently re-train data so as not to interrupt the workflow and use parallel loading. For example, MapReduce was built to allow automatic parallelisation and distribution of large-scale special-purpose computations that process large amounts of raw data, such as crawled documents or web request logs and compute various kinds of derived data.
Several Distributed ML Platforms Are New
One of the most widely-used distributed data processing systems for ML workloads is Apache Spark MLlib and Apache Mahout. Microsoft also released its Distributed ML Toolkit (DMTK), which contains both algorithmic and system innovations. Microsoft’s DMTK framework supports unified interface for data parallelisation, hybrid data structure for big model storage, model scheduling for big model training, and automatic pipelining for high training efficiency. System innovations and ML innovations are pushing the frontiers of distributed ML.
Single Machine vs Distributed ML
- Experts emphasise that traditional ML approaches are designed to address the dataset at hand which implies central processing of data in a database. However, this is usually not possible due to the fact that the cost of storing a single dataset is bigger than storing data in smaller parts. Also, the computational cost of mining a single data repository or database is bigger than processing smaller parts of data
- As opposed to a centralised approach, a distributed mining approach helps in parallel processing. Also, distributed learning algorithms have their foundations in ensemble learning which helps build a set of classifiers to improve the accuracy of a single classifier. An ensemble approach merges with that of a distributed environment since a classifier is trained onsite, with a subset of data stored in it.
- Distributed learning also provides the best solution to large-scale learning given how memory limitation and algorithm complexity are the main obstacles. Besides overcoming the problem of centralised storage, distributed learning is also scalable since data is offset by adding more processors.
- Also, experts peg that in the future data analytics will be primarily done in a distributed environment.
Disadvantages of Distributed ML algorithms
Unfortunately writing and running a distributed ML algorithm is highly complicated and developing distributed ML packages becomes difficult because of platform dependency. On the other hand, there are no standardised measures to evaluate distributed algorithms. Many ML researchers say that existing measures benchmarked against classical ML methods show less reliability.
But one thing’s clear — the practice of ML which was so far concentrated on monolithic data sets from where learning algorithms generate a single model is soon getting phased out with distributed learning algorithms. Also the rise of big data and IoT has led to several distributed data sets and these big datasets stored in a central repository impose huge processing and computing requirements. And that’s why researchers assert that distributed processing of data is the right computing platform.