MITB Banner

Why You Should Use Dask If You Are Into Data Science And Machine Learning

Share

What if there was a solution to speed up algorithms, parallelise computing, parallelise Pandas and  NumPy and integrate with libraries like sklearn and XGBoost? Then it would be called Dask.

There are many solutions available in the market which are parallelisable, but they are not clearly transformable into a big DataFrame computation. Today these companies tend to solve their problems either by writing custom code with low-level systems like MPI, or complex queuing systems or by heavy lifting with MapReduce or Spark.

Dask exposes low-level APIs to its internal task scheduler to execute advanced computations. This enables the building of personalised parallel computing system which uses the same engine that powers Dask’s arrays, DataFrames, and machine learning algorithms.

What Made Dask Tick

Dask emphasizes the following virtues:

  • The ability to work in parallel with  NumPy array and Pandas DataFrame objects
  • integration with other projects.
  • Distributed computing
  • Faster operation because of its low overhead and minimum serialisation
  • Runs resiliently on clusters with thousands of cores
  • Real-time feedback and diagnostics

Dask’s 3 parallel collections namely Dataframes, Bags and Arrays, enables it to store data that is larger than RAM. Each of these is able to use data partitioned between RAM and a hard disk as well distributed across multiple nodes in a cluster.

Dask can enable efficient parallel computations on single machines by leveraging their multi-core CPUs and streaming data efficiently from disk. It can run on a distributed cluster.

Dask also allows the user to replace clusters with a single-machine scheduler which would bring down the overhead. These schedulers require no setup and can run entirely within the same process as the user’s session.

Dask vs Pandas

Dask DataFrames coordinate many Pandas DataFrames/Series arranged along the index. A Dask DataFrame is partitioned row-wise, grouping rows by index value for efficiency. These Pandas objects may live on disk or on other machines.

Dask DataFrame has the following limitations:

  1. It is expensive to set up a new index from an unsorted column.
  2. The Pandas API is very large. Dask DataFrame does not attempt to implement many Pandas features.
  3. Wherever Pandas lacked speed, that would carry on to Dask DataFrame as well.

Dask For ML

Any Machine Learning project would suffer from either of the following two factors

  1. Long training times
  2. Large Datasets

Dask can address the above problems in the following ways:

  • Dask-ML makes it easy to use normal Dask workflows to prepare and set up data, then it deploys XGBoost or Tensorflow alongside Dask, and hands the data over.
  • Replacing  NumPy arrays with Dask arrays would make scaling algorithms easier.
  • In all cases Dask-ML endeavours to provide a single unified interface around the familiar  NumPy, Pandas, and Scikit-Learn APIs. Users familiar with Scikit-Learn should feel at home with Dask-ML.

Dask also has methods from sklearn for hyperparameter search such as GridSearchCV, RandomizedSearchCV etc.

from dask_ml.datasets import make_regression

from dask_ml.model_selection import train_test_split, GridSearchCV

Here is an implementation of sklearn with Dask for prediction models:

from sklearn.linear_model import ElasticNet

from dask_ml.wrappers import ParallelPostFit

el = ParallelPostFit(estimator=ElasticNet())

el.fit(Xtrain, ytrain)

preds = el.predict(Xtest)

Implementing joblib to parallelise workload:

import dask_ml.joblib

from sklearn.externals import joblib

Dask lets analysts handle large datasets (100GB+) even on relatively low-power devices without the need for configuration or setup.

Conclusion

Pandas is still the go-to option as long as the dataset fits into the user’s RAM. For functions that don’t work with Dask DataFrame, dask.delayed offers more flexibility can be used.

Dask is very selective in the way it uses the disk. It evaluates computations in a low-memory footprint by pulling in chunks of data from disk, going ahead with the necessary processing shedding off the intermediate values.

Dask’s active participation at the community level has contributed a lot to the way it has evolved from within this ecosystem. This enables the rest of the ecosystem to benefit from parallel and distributed computing with minimal coordination.

As a result, Dask development is pushed forward by developer communities. This shall ensure that the Python ecosystem will continue to evolve with great consistency.

Installing Dask with pip:

pip install “dask[complete]”

Check the Dask cheatsheet here

Share
Picture of Ram Sagar

Ram Sagar

I have a master's degree in Robotics and I write about machine learning advancements.
Related Posts

CORPORATE TRAINING PROGRAMS ON GENERATIVE AI

Generative AI Skilling for Enterprises

Our customized corporate training program on Generative AI provides a unique opportunity to empower, retain, and advance your talent.

Upcoming Large format Conference

May 30 and 31, 2024 | 📍 Bangalore, India

Download the easiest way to
stay informed

Subscribe to The Belamy: Our Weekly Newsletter

Biggest AI stories, delivered to your inbox every week.

AI Courses & Careers

Become a Certified Generative AI Engineer

AI Forum for India

Our Discord Community for AI Ecosystem, In collaboration with NVIDIA. 

Flagship Events

Rising 2024 | DE&I in Tech Summit

April 4 and 5, 2024 | 📍 Hilton Convention Center, Manyata Tech Park, Bangalore

MachineCon GCC Summit 2024

June 28 2024 | 📍Bangalore, India

MachineCon USA 2024

26 July 2024 | 583 Park Avenue, New York

Cypher India 2024

September 25-27, 2024 | 📍Bangalore, India

Cypher USA 2024

Nov 21-22 2024 | 📍Santa Clara Convention Center, California, USA

Data Engineering Summit 2024

May 30 and 31, 2024 | 📍 Bangalore, India

Subscribe to Our Newsletter

The Belamy, our weekly Newsletter is a rage. Just enter your email below.