Comparison between computational times of Pandas and cuDF

In order to analyze the time taken in both cases, let us try to load a huge dataset data.csv – first using pandas library and then using cuDF, and compare the computational time in both the cases.

In the following example, we have taken a massive dataset ‘Data.csv’ comprising 887379 Rows and 22 Columns. First, we will load the dataset using Pandas compute the time taken, then we will repeat the same using cuDF to load the same data set and compare the runtimes.

Using Pandas to load a Dataset:

Python3




# Loading the Dataset using Pandas Library (CPU Based)
import pandas as pd
import time
  
  
start = time.time()
df = pd.read_csv("Data.csv")
print("no. of rows in the dataset", df.shape[0])
print("no. of columns in the dataset", df.shape[1])
end = time.time()
print("CPU time= ", end-start)


Output:

no. of rows in the dataset 887379
no. of columns in the dataset 22
CPU time=  2.3006720542907715

The output of the above code uses Pandas to load Data.csv.

Using cuDF to load a Dataset:

Python3




# Loading the Dataset using Pandas Library (GPU Based)
import cudf
import time
  
start = time.time()
df = cudf.read_csv("../input/data-big/Data.csv")
print("no. of rows in the dataset", df.shape[0])
print("no. of columns in the dataset", df.shape[1])
end = time.time()
print("GPU time= ", end-start)


Output:

no. of rows in the dataset 887379
no. of columns in the dataset 22
GPU time=  0.1478710174560547

The output of the above code uses cuDF to load Data.csv.

From the above two cases, it can be seen that the CPU (Pandas) takes 2.3006720542907715 seconds to load the dataset while GPU (cuDF) takes only 0.1478710174560547 seconds which is much faster.

How to speed up Pandas with cuDF?

Pandas data frames in Python are extremely useful; they provide an easy and flexible way to deal with data and a large number of in-built functions to handle, analyze, and process the data. While Pandas data frames have a decent processing time, still in the case of computationally intensive operations, Pandas data frames tend to be slow, causing delays in data science and ML workflows. This limited speed of pandas data frames is because pandas work on CPUs that only have 8 cores. However, GPU acceleration of data science and machine learning workflows provides a solution to this problem and enhances the speed of operations at an impressive level.

Similar Reads

cuDF

cuDF (CUDA DF) is a Python GPU data frame library that helps accelerate the loading, processing, and manipulating of massive data – thus, enabling users to perform computer-intensive operations fast. cuDF is based on an apache arrow columnar layout which we will discuss later....

Comparison between computational times of Pandas and cuDF

In order to analyze the time taken in both cases, let us try to load a huge dataset data.csv – first using pandas library and then using cuDF, and compare the computational time in both the cases....

Arrow Columnar Layout in cuDF

...