[dropcap size=”2″]SS[/dropcap]Srihari Srinivasan: At the outset it will be good to clarify that we are essentially a custom software company not an analytics services company. TW has been a pioneer at adopting the Agile approach to software development. As we now foray into building newer analytics solutions our approach is largely based on adapting the Agile/Lean techniques to the nuances of data engineering. Our unique approach can be best described as the intersection between agile/lean delivery methods, advanced statistical techniques and distributed systems engineering with a keen sensitivity towards data privacy.
AIM: What are the next steps/road ahead for analytics at your organization?
SS: While we have made some significant inroads over the last year or so at TW India, the Analytics practice here is still in its early stages. Given where we are in the lifecycle one of our big focus areas is to grow our internal capacity to service some interesting analytics opportunities.
We are also noticing that the adoption of several Big Data technologies is beginning quickly move past the proof of concept stage in many large enterprises. In response to this a key next step for us is to quickly adapt our development, testing, deployment and operations approach to suit large-scale data engineering problems. Accomplishing this is a very interesting technical challenge for talented data scientists and engineers.
Last but not the least, we also aspire to create new solutions for personalizing the online experience for the consumer internet and e-commerce applications. These solutions will be specifically targeted for emerging markets where e-commerce and consumer internet is still burgeoning.
AIM: What are the most significant challenges you face being in the analytics space?
SS: From the perspective of analytics software and data engineering we find ourselves tackling a fairly heterogeneous set of challenges of late –
Firstly there is the challenge of data quality that comes from handling data from diverse sources. This is not entirely a new issue and has plagued the data-warehousing world for a long time now. A significant amount of time and effort is still being spent on distinguishing issues that occur due to bad data from the logical errors in the code. While there is no silver bullet for solving this we see teams manage this by constructing different forms of anti-corruption layers as part of their data processing pipelines.
Good software generally is an outcome of the collaboration between technical folks and subject matter experts. From that perspective building high quality analytics systems hinges quite a lot on the collaboration between data scientists and engineers. While data science does require some programming, data scientists are not necessarily the most adept at modern software engineering practices. Conversely, engineers too have to learn a fair amount of statistics and machine learning in order to operationalize the models produced by data scientists. The learning curve that people in either role go through on early projects still remains a big challenge.
There are benefits that can be gained by capturing the digital trails of users. At the same time it’s worth noting that these trails may as well end up as indelible records. In Europe, many companies are adopting a strategy of Datensparsamkeit, a term that roughly translates as “data austerity” or “data parsimony.” The term Datensparsamkeit is taken from German privacy legislation and describes the idea to only store as much personal information as is absolutely required for the business or applicable laws. This is certainly one way that privacy can be maintained even in the unfortunate event of a data breach. The challenge of delivering a more personalized user experience while remaining sensitive to data privacy concerns deserves special mention in the analytics context.
AIM: How do you see Analytics evolving today in the industry as a whole? What are the most important contemporary trends that you see emerging in the Analytics space across the globe?
SS: All product development activities within the Big Data landscape can broadly be categorized into – Infrastructure solutions and Applications. While the infrastructure solutions have already reached a stage of maturity there are still some optimization efforts going on in this space.
Managed Platforms – Open Source projects such as the Savanna platform from Open Stack and a host of Hadoop-as-a-Service offerings from different commercial organizations are trying to make it easier to deploy Hadoop in multi tenant cloud environments.
SQL-on-Hadoop – The SQL-on-Hadoop trend continues to make progress with solutions like Apache Drill and Impala. These solutions aim to bring the familiar experience of working with the ubiquitous SQL language to the Hadoop platform. This perhaps is most significant innovation that will drive Hadoop’s adoption within enterprises.
Efficient cluster management – Soon after enterprises adopt different distributed processing frameworks the question of utilizing the clusters more efficiently comes up. This is something the distributed systems community has been trying to address for a while now. Cluster managers such as Apache Mesos, provide efficient resource isolation and sharing across a pool of machines instead of having dedicated pools of machines for each distributed processing framework.
Intelligent Data Curation – As infrastructure solutions become more mature we are beginning to see a gradual shift in the trends and investments away from infrastructure and towards applications. New data integration solutions are emerging that enable organizations to curate data from heterogeneous sources very efficiently at scale. This is a space to watch out for in the coming months.
Biography of Srihari Srinivasan
Srihari is the Head of Technology for ThoughtWorks India. He’s been a developer and architect for several enterprise applications with focus on building large scale systems based on service oriented architectures, domain specific languages etc. He is quite passionate about distributed systems and databases and blogs about them on www.systemswemake.com.
Try deep learning using MATLAB