Disco is the latest addition in the growing list of Big Data tools, which allows parallel processing of large amounts of data. Disco was developed in 2008 at Nokia Research Center to address real challenges in handling massive amounts of data and the framework has been actively developed since by Nokia.
Disco is used for a variety of applications, such as log analysis, probabilistic modelling, data mining, and full-text indexing. It is an open-source framework and usage of Python makes it robust and easy-to-use. Since, Disco combines a distributed file system and map reduction features, Hadoop & Disco can often be seen as complementary.
Erlang’s programming language helps in building massively scalable real-time systems with requirements on high availability. Since, Disco uses the concurrency and clustering of Erlang, users don’t require anything other than one master server to submit jobs and push files into the Disco Distributed File System (DDFS). With Disco, there is virtually no latency when starting jobs and getting up and running quickly.
Primary features and benefits of using Disco framework
There are several features of Disco which makes it an extremely reliable framework. Some of them are listed as below:
- The framework helps to build and query indices with billions of keys and values, using DiscoDB.
- It provides efficient data-locality-preserving IO, either over HTTP or the built-in petabyte-scale Disco Distributed Filesystem.
- Disco can be easily installed on Linux, Mac OS X, and FreeBSD.
- The framework supports profiling and debugging of Map Reduce jobs.
- It can run jobs written in any language using the worker protocol.
- Disco also furnishes random access data and auxiliary results through out of band results.
Can Disco replace Hadoop?
Hadoop comes with its own drawbacks. The platform hasn’t been specifically designed for data integration. Data integration depends on factors such as supporting governance, metadata management, data quality, and flexible data delivery styles.
Disco core is remarkably compact, and it’s really easy to understand how the core works. Users can start experimenting with it or adapt it to new environments. Moreover, it becomes easier to add new features around Disco’s core, leveraging Python. This helps in ensuring that Disco responds quickly to real-world needs.
Often, it’s a matter of choice whether Erlang and Python are more suitable for the task, compared to Java. Erlang is completely compatible with the Disco core, which needs to handle tens of thousands of tasks in parallel. Disco was programmed in Erlang, natively for the Map Reduce scripts in Python. Additionally, it works from the documentation closer to a private Linux environment and is much easier to set up.
Furthermore, users can run a single node cluster on their local laptop using Disco to test jobs and push data into DDFS. The framework has almost zero overhead and users don’t need to run it in a virtual machine.
Should we still consider leveraging Hadoop?
The Hadoop Distributed File System (HDFS) allows user to take all their files and data to anywhere they go, irrespective of the type of system they login from. Moreover, HDFS provide users with the ability to dump very large datasets (usually log files) to the distributed filesystem. From here, users can easily access the data by using tools.
Besides storing a large amount of data, HDFS is also fault-tolerant. In this case, losing a disk, or a machine, typically does not spell disaster for the data under consideration. Undoubtedly, HDFS has become a trusted way to store data and share it with other open-source data analysis tools.
The future for DISCO project
Undoubtedly, DISCO is a very innovative and unique framework. However, the project will improve upon three major aspects as we move into the future. This is to ensure that the framework can see more applications in the future.
Versatility in the framework is the first aspect, Disco will integrate other distributed processing frameworks. This will furnish end-user with a wider range of options for deciding the way data is to be processed effectively.
The second aspect revolves around user-friendliness. The release of a web interface to issue complicated HTTP requests is touted as a major update to Disco. In terms of ease-of-use for users, all communication will be automated and the end user will only have to fill in a form with the requirements. In a more advanced version, DISCO will be able to deduce and suggest the best-suited frameworks on its own, based on a questionnaire and an algorithm.
Last but nowhere the least, Disco needs work on the configurability aspect. The given parameters will have to be mapped to the individual configuration for achieving highest data throughput, besides the need for specifying entirely new settings. Here, factors like memory management, ratio, disk space, etc. come into play, which have to be ideally configured for obtaining the best performance out of the system.