MITB Banner

Why MapReduce Is Still A Dominant Approach For Large-Scale Machine Learning

Share

Google stopped using MapReduce as their primary big data processing model in 2014. Meanwhile, development on Apache Mahout had moved on to more capable and less disk-oriented mechanisms that incorporated the full map and reduce capabilities. Google itself led to the development of Hadoop with core parallel processing engine known as MapReduce.

Google introduced this new style of data processing called MapReduce to solve the challenge of large data on the web and manage its processing across large clusters of commodity servers.

What Is MapReduce?

MapReduce is a framework for writing applications which processes enormous amounts of data in-parallel on large clusters of commodity hardware in a reliable, fault-tolerant manner.

As the name suggests, MapReduce involves the following three steps:

1.Mapping: It involves sorting and filtering the dataset. In this step, each worker node applies the map function to the local data and writes the output to temporary storage. A master node ensures that only one copy of redundant input data is processed.

2.Shuffling: The worker nodes redistribute data based on the output keys as a result of the map function so that all data belonging to one key is located on the same worker node.

3.Reducing: In this step, a calculation is done on the resulting information and the worker nodes processes each group of output data in parallel.

These are not some set of unique steps and can be used with any programming language but the main objective of the MapReduce operations was in implementation. MapReduce is primarily popular for being able to break into two steps and sending out pieces to multiple servers in a cluster, for the purpose of the parallel operation.  

Responding to a 2013 O’Reilly survey, more than a quarter of data scientists indicated they used Hadoop regularly, but the numbers have gone down drastically.

The Workflow

MapReduce has the following workflow:

1.Processing: One block is processed by one mapper at a time. In the mapper, a developer can specify his own business logic as per the requirements. In this manner, Map runs on all the nodes of the cluster and process the data blocks in parallel.

2.Writing to disk: Output of Mapper also known as intermediate output is written to the local disk. An output of mapper is not stored on HDFS as this is temporary data and writing on HDFS will create many copies.

3.Copy: Output of mapper is copied to reducer node. This entails the physical movement of data which is done over the network.

4.Merging and sorting: Once all the mappers are finished and their output is shuffled on reducer nodes, then this intermediate output is merged & sorted. Which is then provided as an input to reduce phase.

5.Reducing: Reduce is the second phase of processing where the user can specify his own custom business logic as per the requirements. An input to a reducer is provided from all the mappers. An output of reducer is the final output, which is written on HDFS.

MapReduce for Machine Learning

MapReduce has a wide variety of applications in machine learning. It has the ability to aid building systems that learn from data without the need for rigorous and explicit programming. Apart from ML, it is used in a distributed   searching, distributed sorting, document clustering. Another application is statistical machine translation, where it translates a phrase or a sentence in more ways than one and so the method uses statistics from previous translations to find the best fit one.

It is also used in data clustering to solve computational complexity due to large data used in processing. It is used in applications like distributed pattern-based searching,   query processing, fraud detection and user behaviour analysis. Moreover, the MapReduce model has been adapted to several computing environments like multi-core and many-core systems, desktop grids, as well as dynamic and mobile cloud environments.

At Google, MapReduce was used to completely regenerate Google’s index of the World Wide Web. Many high-level query languages like Hive and Pig use MapReduce as a base, making MapReduce framework as the obvious choice and more approachable to traditional database programmers.

Outlook

Although MapReduce follows a restricted programming framework because of its tasks to be written as acyclic dataflow programs, it stands as a strong solution for many tasks in the domain of data science. It will certainly bring new and more accessible programming techniques for working on massive data stores with both structured and unstructured data. It has a great potential of added features in the upcoming years. Every parallel technology makes claims about scalability and MapReduce with other implementations has presented with genuine scalability so far, and will hopefully continue to do so.

Share
Picture of Disha Misal

Disha Misal

Found a way to Data Science and AI though her fascination for Technology. Likes to read, watch football and has an enourmous amount affection for Astrophysics.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.