Google stopped using MapReduce as their primary big data processing model in 2014. Meanwhile, development on Apache Mahout had moved on to more capable and less disk-oriented mechanisms that incorporated the full map and reduce capabilities. Google itself led to the development of Hadoop with core parallel processing engine known as MapReduce.
Google introduced this new style of data processing called MapReduce to solve the challenge of large data on the web and manage its processing across large clusters of commodity servers.
What Is MapReduce?
MapReduce is a framework for writing applications which processes enormous amounts of data in-parallel on large clusters of commodity hardware in a reliable, fault-tolerant manner.
As the name suggests, MapReduce involves the following three steps:
1.Mapping: It involves sorting and filtering the dataset. In this step, each worker node applies the map function to the local data and writes the output to temporary storage. A master node ensures that only one copy of redundant input data is processed.
2.Shuffling: The worker nodes redistribute data based on the output keys as a result of the map function so that all data belonging to one key is located on the same worker node.
3.Reducing: In this step, a calculation is done on the resulting information and the worker nodes processes each group of output data in parallel.
These are not some set of unique steps and can be used with any programming language but the main objective of the MapReduce operations was in implementation. MapReduce is primarily popular for being able to break into two steps and sending out pieces to multiple servers in a cluster, for the purpose of the parallel operation.
Responding to a 2013 O’Reilly survey, more than a quarter of data scientists indicated they used Hadoop regularly, but the numbers have gone down drastically.
The Workflow
MapReduce has the following workflow:
1.Processing: One block is processed by one mapper at a time. In the mapper, a developer can specify his own business logic as per the requirements. In this manner, Map runs on all the nodes of the cluster and process the data blocks in parallel.
2.Writing to disk: Output of Mapper also known as intermediate output is written to the local disk. An output of mapper is not stored on HDFS as this is temporary data and writing on HDFS will create many copies.
3.Copy: Output of mapper is copied to reducer node. This entails the physical movement of data which is done over the network.
4.Merging and sorting: Once all the mappers are finished and their output is shuffled on reducer nodes, then this intermediate output is merged & sorted. Which is then provided as an input to reduce phase.
5.Reducing: Reduce is the second phase of processing where the user can specify his own custom business logic as per the requirements. An input to a reducer is provided from all the mappers. An output of reducer is the final output, which is written on HDFS.
MapReduce for Machine Learning
MapReduce has a wide variety of applications in machine learning. It has the ability to aid building systems that learn from data without the need for rigorous and explicit programming. Apart from ML, it is used in a distributed searching, distributed sorting, document clustering. Another application is statistical machine translation, where it translates a phrase or a sentence in more ways than one and so the method uses statistics from previous translations to find the best fit one.
It is also used in data clustering to solve computational complexity due to large data used in processing. It is used in applications like distributed pattern-based searching, query processing, fraud detection and user behaviour analysis. Moreover, the MapReduce model has been adapted to several computing environments like multi-core and many-core systems, desktop grids, as well as dynamic and mobile cloud environments.
At Google, MapReduce was used to completely regenerate Google’s index of the World Wide Web. Many high-level query languages like Hive and Pig use MapReduce as a base, making MapReduce framework as the obvious choice and more approachable to traditional database programmers.
Outlook
Although MapReduce follows a restricted programming framework because of its tasks to be written as acyclic dataflow programs, it stands as a strong solution for many tasks in the domain of data science. It will certainly bring new and more accessible programming techniques for working on massive data stores with both structured and unstructured data. It has a great potential of added features in the upcoming years. Every parallel technology makes claims about scalability and MapReduce with other implementations has presented with genuine scalability so far, and will hopefully continue to do so.