It was the changing nature of big data technology and architectural models, that wrote the story for Hadoop. The infrastructure architecture moved towards edge computing, IoT and cloud computing and especially containers where the market is seeing an increase in Kuberenetes workload. With analytical and machine learning workloads increasing, there was an increased need for a unified analytics platform. And that’s exactly how Spark outperformed Hadoop in metrics such as In memory processing vs disk, real-time streaming and batch streaming besides providing a layer for integrating machine learning as well.
As Apache Spark turned 10 years old, let’s see the strong driver that led to Spark adoption and what keeps it going. Dubbed as the official “in-memory replacement for MapReduce”, the disk-based computational engine is at the heart of early Hadoop clusters. Why Spark took off was because it reflects the changing processing paradigm to a more memory intensive pipeline, so if your cluster has a decent memory and an API simpler than MapReduce, processing in Spark will be faster. The reason why Spark is faster is because most of the operations (including reads) decrease in processing time roughly linearly with the number of machines since it’s all distributed.
Spark is also useful to preprocess massive datasets and its machine learning libraries (ML and MLLIB) perform well vis-a-vis libraries like lightgbm, xgboost etc.
Companies Reinvent Their Business Around Spark
According to a research, the Apache Spark Market is expected to grow at a CAGR of 33.9% during the forecast period of 2018 to 2025. Some of the top key vendors of Spark are Databricks, Cloudera, Unravel Data, MapR Technologies Inc and Qubole Inc. Over the last few years, these companies and users have made Spark the leading-edge computing framework and it is also one of the most active open source projects in Big Data, a recent survey indicates.
As big data companies realised the impact of Spark and usage grew, companies like MapR Technologies and Qubole (cloud-based big data processing platform) were not far behind in jumping onto the Spark wave and phasing out MapReduce with the faster Spark engine. As usage grew around batch ETL, streaming ETL, and machine learning workloads in the customer base, leading big data companies, including IBM, also started adding support for Spark in its products.
According to a recent big data survey from Unravel Data, dubbed as a leader in APM (Application Performance Management) platform for Big Data, Spark is the second most deployed big data technology, with 31% of respondents deploying it, while 32% deployed Hadoop. Spark was the number one big data technology that IT decision makers plan to deploy for the first time in 2019 (16%) and will soon eclipse Hadoop in popularity.
The usage of Spark is now common healthcare, e-commerce, social media and financial sectors. Also, thanks to its multiple language support such as Python, R and Scala and a multitude of use cases around it, Spark’s popularity as the leading big data framework is growing. Another key point is the thriving community of Apache Spark which has led to its sustained success and its applicability in machine learning and AI technologies. A lot of data science platforms are now built on Spark and we are seeing developers and data scientists are collaborating and developing solutions by leveraging Spark.
Inside The Project That Originated From UC Berkeley
A bit on the background on Spark — this Apache project came out of UC Berkeley in 2012 (it began in 2009) and focused on parallel processing across the clusters. Also, as opposed to Hadoop, spark engine works in-memory and the information is processed by using Resilient Distributed Datasets (RDDs), the basic data structure of Apache Spark which is defined as an immutable distributed collection of objects. Spark was open-sourced in 2012 and over the years has become a dominating force in the big data world. Analysts believe speed and scale are what made it tick for the analytics world and by filling the gaps for OLAP (Online Analytical Processing) at scale, Spark achieved the kind of success Hadoop couldn’t. Another reason for its high adoption in the industry is the support for Python, R, Scala and Java.
Apache Spark is now being used at scale for productionising machine learning models and the number of AI use cases are increasing. Last year, San Francisco-based Databricks started by Apache Spark creator Matei Zaharia announced MLFlow, an open source platform that allows developers to manage the entire machine learning lifecycle, from experimentation, reproducibility and deployment. Essentially, this cloud agnostic toolkit enables enterprises to package their code and execute it across any hardware platform. This framework also integrates with other open-source ML frameworks like TensorFlow and SciKit Learn.