How to Install Dask?
To install this module type the below command in the terminal –
python -m pip install "dask[complete]"
Let’s see an example comparing dask and pandas.
To download the dataset used in the below examples, click here.
1. Pandas Performance: Read the dataset using pd.read_csv()
import pandas as pd
%time
temp = pd.read_csv('dataset.csv',
encoding = 'ISO-8859-1')
Output:
CPU times: user 619 ms, sys: 73.6 ms, total: 692 ms
Wall time: 705 ms
2. Dask Performance: Read the dataset using dask.dataframe.read_csv
import dask.dataframe as dd
%time df = dd.read_csv("dataset.csv",
encoding = 'ISO-8859-1')
Output:
CPU times: user 21.7 ms, sys: 938 µs, total: 22.7 ms
Wall time: 23.2 ms
Now a question might arise that how large datasets were handled using pandas before dask? There are few tricks handled to manage large datasets in pandas.
- Using chunksize parameter of read_csv in pandas
- Use only needed columns while reading the csv files
The above techniques will be followed in most cases while reading large datasets using pandas. But in some cases, the above might not be useful at that time dask comes into play a major role.
Types of Dask Schedulers
- Single-Threaded Scheduler: The single-threaded scheduler is the default option for Dask. It runs all the tasks on single thread sequentially. While that may not fulfill the potential of parallel computing, It is useful to debug and understand the task execution flow.
- Multi-Threaded Scheduler: Multi-threaded is beneficial for tasks that involves a significant amount of time spent waiting for external resources, such as reading from disk or network operations.
- Multi-Process Scheduler: Multi-Process scheduler uses multiple processes to execute tasks in parallel. Each process has its own Python interpreter and enables true parallelization and efficient use of multi-core machines.
- Distribution Scheduler: Distributed Scheduler extends the multi-process scheduler to work across multiple machines in a cluster. It Allows distributed computing by mange task on cluster of interconnected machines.
- Adaptive Schedular: Adaptive Scheduler dynamically to adjusts the number of worker processes based on the workload. It makes suitable for handling varying workloads.
Limitations of dask
There are certain limitations in dask.
- Dask cannot parallelize within individual task
- As a distributed-computing framework, dask enables remote execution of arbitrary code. So dask workers should be hosted within trusted network only.
Dask in Python
Dask is an open-source parallel computing library and it can serve as a game changer, offering a flexible and user-friendly approach to manage large datasets and complex computations.
In this article, we will delve into the world of Dask, How to install Dask, and Its features.