Brian Chesky co-founded peer-to-peer room and home rental company Airbnb with Nathan Blecharczyk and Joe Gebbia in 2008. Now, almost 11 years later, Airbnb has been used by more than 300 million people in 81,000 cities in 191 countries.
At Airbnb, dashboard analytics plays a crucial role in real-time tracking and monitoring of various aspects of business and systems. As a result, the timeliness of these dashboards is critical to the daily operation of Airbnb. And, Druid serves as the best platform for ingestion, streaming and handling of large volumes of data while ensuring there are no latencies.
Handling dashboard analytics in real time comes with following challenges:
- Latency at data aggregation in the warehouse.
- A large volume of data
- Integration between systems. For example, Airbnb stores their datasets in Hadoop and use Kafka and Spark to process data streams.
So, Druid helps Airbnb in tackling these challenges and keeping the services afloat even during unwanted shutdowns.
A Brief Look At Druid’s Architecture
At Airbnb, two Druid clusters are running in production. Two separate clusters allow dedicated support for different uses, even though a single Druid cluster can handle more data sources than what we need. In total, we have 4 Brokers, 2 Overlords, 2 Coordinators, 8 Middle Managers, and 40 Historical nodes. In addition, our clusters are supported by one MySQL server and one ZooKeeper cluster with 5 nodes. Druid clusters are relatively small and low cost compared to other service clusters like HDFS and Presto.
For dashboards using systems like Hive and Presto at query time, data aggregation takes a long time. The storage format makes it difficult for repeated slicing of the data.
The dashboards built on top of Druid can be noticeably faster than those built on others systems. Compared to Hive and Presto, Druid can be an order of magnitude faster.
With high inflows of data, any breakdown will take a toll on the systems. To make the systems breakdown-ready, Druid architecture is well separated out into different components for ingestion, serving, and overall coordination.
For batch analytics, Druid’s API allows easy access of data from Hadoop. Druid’s streaming client API, Tranquility, helps in the integration of streaming engines such as Samza with any other streaming engine. Whereas, real-time analytics is implemented through Spark.
Apache Superset, an open source data viz of Airbnb serves as the interface for users to visualize the results of their analytics queries.
Querying With Druid
Druid allows individual teams and data scientists to easily define how the data their application or service produces should be aggregated and exposed as a Druid data source.
Every data segment needs to be ingested from MapReduce jobs first before it is available for queries. Druid queries are much faster than other systems is that there is a cost imposed on ingestion. This works great in a write-once-read-multiple-times model, and the framework only needs to ingest new data on a daily basis.
One of the issues we deal with is the growth in the number of segment files that are produced every day that need to be loaded into the cluster. Segment files are the basic storage unit of Druid data, that contains the pre-aggregated data ready for serving.
Some data needs recomputation, resulting in a large number of segment files that need to be loaded at once onto the cluster. Ingested segments are loaded by coordinators sequentially in a single thread, centrally. As segments pile up, the delay will be longer as the coordinator will have a hard time keeping up with the pace.
The input volume of data to produce a larger segment is so high that the Hadoop job would run for too long crunching that data, and many times would fail due to various reasons.
One possible solution that Airbnb plans on working on is the compacting of segments right after ingestion and before they are handed off to the coordinator, and increase the segment size while keeping the ingestion stable.
That said, Druid is robust and resilient to failures. Most failures are transparent and unnoticeable to users. Even if a role that is a single point of failure (like Coordinator, Overlord, or even ZooKeeper) fails, Druid cluster is still able to provide query service to the users.