The hype is over and so is big data, proclaimed Gartner’s 2015 Hype Cycle and while there was a lot of marketing buzz around the big data phenomenon, the demand for Hadoop specifically is on decline. But that was 2015, yet the perils are far from over. In a recent report, Gartner forecasted that in 2017, 60% of big data projects will not go beyond the piloting stage.
Even in one of our recent studies, Analytics & Data Science Leaders Outlook in India, Big Data declined as a growth area in industry outlook of Indian analytics leaders since last year. In 2016, 1 in 2 leaders saw this space to be a growth area. Today, just 1 in 3 analytics leaders in India expect ‘Big Data’ to be growth area for next 12 months.
Typically, big data is characterized by three Vs – volume, velocity and variety; though the number of V’s have increased to include veracity and even value. Statistics point out that by the year 2020, about 1.7 megabytes of new information will be created every second for every human being on the planet. Another key insight is that by 2020, our accumulated digital knowledge will be around 44 zettabytes, or 44 trillion gigabytes, up from just 4.4 zettabytes today. This is good enough use case for Big data to be of tremendous value, let alone flourish.
Hadoop’s dream of unifying data comes to an end, no longer the best data management architecture
Over the years, organizations had already embrace all kinds of data, existing data, historical data, log data, social and transactional among other types. According to a Bain research, the drive to collect and mine new data sets gained ground with the rise of social media and mobile devices. Even then, many enterprises are grappling with data deluge and data silos, where they are unable to make use of existing data that cannot be easily accessed, organized, linked or interrogated.
Hadoop is the most widely-adopted open source distributed computing platform when it comes to big data management. Yet, it has not quite lived up to enterprises’ expectations of scale and be a go to authority on everything big data. So, is Hadoop irrelevant, and if so, where will all the unstructured data land up? At the recent Strata + Hadoop World, 2017 conference, reigning sentiment by experts was that Hadoop has outlived its concept of being a data hub.
Even though Hadoop had been marketed as the best data warehouse, reportedly Hortonworks CTO Scott Gnau, believes it failed to deliver business value because of the inferior SQL repository and engine compared to the traditional EDW vendors, like Teradata.
Pitfalls of Hadoop technology base:
- Hadoop is good at extract, transform and load (ETL), the SQL-handling features aren’t great.
- Storage-centric technology is not apt for machine learning and other advanced analytics tasks
- Streaming analytics comes in the picture, extracting information from data quickly. Big data management companies such as Cloudera, MapR and Hortonworks have already adopted streaming data pipelines in their core platform.
- What stream processing systems do is fulfil the enterprise’s analytics tasks
Is Kafka the answer to Hadoop?
Though big data technology is by and large synonymous with Hadoop, there is a slew of open source software out there – Apache Kafka, Apache Spark and MongoDB. And according to reports, the adoption of Apache Kafka, first developed at LinkedIn is on a significant rise. What Apache Kafka provides is a central streaming platform, wherein the data streams are stored, processed, and sent on to any subscribers. Kafka works in tandem with Apache HBase, Apache Storm and Apache Spark for real-time analysis and rendering of streaming data. In fact, according to a Cloudera post, Kafka’s unique attributes make it best suited for integration. Technologists are increasingly marketing Apache Kafka for big data applications, since Hadoop is a complicated technology stack to build on. From scalability to low latency and data partitioning, Apache Kafka has the ability to handle large number of consumers.
Big Data, the most hyped technology of 2016
One of the most bandied terms that has been used interchangeably with analytics is big data. We had earlier pointed on how Big data and analytics are used in the same breath and recently have come to almost synonymise each other.
A recent survey of CIOs indicated it is the most hyped jargon in India in 2016. While overselling is seen as a necessary evil in marketing, it leads to overexposure. Big data is the most popular buzzwords in the media and the recent rise in interest can lead to unmanaged expectations. According to celebrated statistician Nate Silver, every new technology is supremely hyped to make it more mainstream, “and expectations of that tech skyrocket”.
Eventually the hype crashes into the “trough of disillusionment”, he said reportedly, saying the hype cycle is at its peak. Silver, who’s known for his astute election predictions added that “big tech behemoths like Google and Facebook tend to have an iterative and unstubborn approach in terms of investing in technologies and ideas.”
In the process, small data has been lost in the melee, since experts believe not every task requires big amount of data which may not lend value. Smaller sample sizes of data can also reveal meaningful insights. For example, small scale studies such as product testing and car crash tests need not be skewed by large amounts of data.
In fact, experts point out that though Hadoop wasn’t the better Enterprise Data Warehouse (EDW), it was initially marketed as such by Cloudera. Peter Wang, CTO and co-founder of Continuum Analytics, speaking at a recent conference hinted that Hadoop was used as a means to hoist data analytics aspirations, but so much innovation has happened around it, Tensorflow, Spark and Kafka, that Hadoop’s got left behind.
Lack of accountability in Big Data
According to a post in World Economic Forum, big data poses challenges in accountability as well. One of the major challenges in big data is that it can be gamed, the results can be skewed by “Google bombing,” and “spamdexing,” among other methods. More data has never been a replacement for quality data. In fact, the recent US elections pointed to glaring gaps in data quality and political opinions harvested from social media outlets and polls are by no means an indication of the result.
Silver reportedly said that while working with big data sets, one manages to get major things right that improves the accuracy. However, in big data environment, one can’t test the data on real-world customers.
Dark side of big data
According to writer of Weapons of Math Destruction and data scientist Cathy O’Neil, the mathematical model has invaded all aspects of our life, from health insurances to loans and even evaluation. It is high time the mathematical modellers start taking responsibility.
In algorithms the world believes: With algorithms ruling the universe, there is no way for people to challenge the specific results dished out by machine learning algorithms. According to a recent development, European Union has put in place measures wherein people who believe they have been affected by algorithms have a “right to an explanation”.