“The popularity of data science has grown drastically in the last five years, while big data and artificial intelligence have picked in the last 2-3 years. With an increase in these trends, the companies are keen on leveraging data assets such as customer data, machine generated data and make use of it”, noted Sharad Agrawal, CEO and Founder, Sprinkledata, during Cypher 2017.
While this is true, the idea is to store huge amounts of data that has been generated over the years in a way that can be accessed easily. The data is so huge that it reaches up to petabyte scale which Agrawal explains is the amount, if suppose an app or website has 10 billion users who are active daily, the kind of activity they do and the engagement data they generate, if it is accumulated for 24 months, it would generate a petabyte of data.
Having said that, and having understood that sales and marketing analytics is very important, and that it can make a difference in customer retention and acquisition, there are only 2% people who say that they are happy with the kind of analytics they have in their organization. Agrawal states that the reasons for this could be—firstly, the data is too large or fragmented and processing them is a challenge. Secondly, the data is accessible to only a select set of people and requires cost and energy of the organization. There may be more challenges in the form of system driven or platform challenges and people challenges.
“Platform challenges are if you are collecting large amounts of data for pipelining you may use big data systems, Spark or Hadoop systems and feed the data into warehouse and apply visualization tool on the top. In such cases warehouse may be able to take all the data beyond terabyte scale. Here lies the matching of scale and speed of the system as they have to prune the data, which may result in the loss of data”, he said.
Few approaches for typical analytics project to overcome these challenges-
Agrawal is quick to add that instead of using multiple systems, everything should be done in a cluster. “Data at entry stage is large and we need large big data cluster for ingesting large amount of data—so big data cluster becomes an important part of the ecosystem. If ingestion and pipelining is done at big data cluster, a major chunk of the challenge could be overcome’, he says.
The other way could be, instead of analysts asking for reports and going to the dashboard, they can do a self served reporting on the big data platform. “This would avoid mismatch in speed and scalability as there would be no information loss and would be accessible to everyone”, he said. “Self serve is very important if we want to be in data driven industry”, added Agrawal.
Some of the other ways to build an analytics platform at petabyte scale are building new systems that have the agility to ask new questions, has accessibility to data, has an easy to use interface and platform complexity should be removed. Most of the cases may require a user to join data from multiple source, so enrichment process should be highly scalable and turnaround time should be fast and efficient.
Agrawal summarizes by saying that in today’s world, there is a tremendous rise of data science, analytics, machine learning and artificial intelligence, yet we are not able to drive value from these systems. We are capturing huge amount of data and applying big data technologies, yet the approach we are following is age old and it should be reconsidered, to leverage the best data assets.
Try deep learning using MATLAB