With the proliferation of applications and end devices – web, mobile, sensors etc. – the last few years and the foreseeable future promise the continuation of the trend of explosive growth in the volume of data being collected by various organizations. At the same time there is an ever growing variety in the types of data – ranging from structured data originating in application databases, to more semi-structured content originating in social media, web properties such as wikipedia as well as internal applications such as email systems, company support boards etc. Additionally systems and software stacks (most notably Apache Hadoop) that are able to keep up with this growth – both in variety as well as in volume – are still complex to operate and far from perfect. They are still in the early stages of development and market adoption. As a result it comes as no surprise that many organizations struggle to keep up with operating, optimizing and making their data infrastructure work to serve their data processing needs.
The complete data infrastructure solution has many components. The main ones are as follows:
- Data Collection Service for both real time and bulk upload of data from different data sources such as applications, databases, web crawls etc.
- Batch Computation Service such as Hadoop/Hive to process this data and transform it from data to information.
- Real Time Computation Service to generate real time results on data streams and data captures for time sensitive and actionable reporting and monitoring.
- AdHoc Query Service to answer one of queries sometimes exactly and other times approximately in a short amount of time.
- Tools and Frameworks for job dependencies, data and query discovery, SLA and monitoring etc.
Qubole (www.qubole.com) aims to provide all of the above components (and some more) in the cloud. We want to provide a fast, easy and reliable access to all the services mentioned above so that our clients can focus more on their data and their algorithms while we take care of optimizing, operating and evolving the data infrastructure for them. We want to enable the data engineers, data scientists and data analysts to work with their data and generate data driven applications whether these applications are simple reporting applications or more complex targeting or recommendations applications.
In the pursuit of this vision our first offering is an Adhoc Query and Batch Computation Service in the Cloud. This service provides Apache Hive and Apache Hadoop as a service with close integration with Apache Oozie. It is ideal for data stored in S3 that you want to do adhoc analysis on and on which you want to create data pipelines. This service is currently available as part of an early access program. We are working with a select set of companies in this program and we will be making this service available to everyone by Q4 2012. The details of this program and the service are in the subsequent sections of this white paper.
Qubole Team Background
Qubole was started by data infrastructure veterans from Facebook who conceived, built, managed and operated the infrastructure on which almost all of Facebook backend data processing works. The co-founders of the company (Ashish and Joydeep) are the co-creators of the Apache Hive project – a very prominent platform built on top of Apache Hadoop. The Hadoop and Hive clusters at Facebook grew under their guidance from managing 80TB of data to 20PB of compressed data from late 2007 to late 2011. The Qubole team comprises of talented engineers who have worked and delivered strong products in companies like Oracle, NetApp and Yahoo. Qubole raised money from Lightspeed Ventures and Charles River Ventures – two well known VC firms in the valley. We are seeking out organization to try out an early beta version of our service.
Try deep learning using MATLAB