Arrow Columnar Layout in cuDF

As stated earlier, cuDF employs Apache Arrow Columnar Layout, an in-memory columnar format used to represent structured datasets. This columnar format is fast and allows computational intensive operations to work with maximum efficiency while handling and iterating big datasets.

The following represents a sample dataset in Traditional Memory Buffer and Arrow Memory Buffer (Columnar Layout).

Traditional memory Buffer Vs Arrow Memory Buffer 

Traditional Memory Buffer the data is stored in contiguous memory locations row-wise. In contrast, in the case of Arrow Memory Buffer, the data is stored in contiguous memory locations column-wise. This is one of the contributing factors towards accelerating the speed of cuDF data frames.

Note: Since cuDF requires you to have specific RAPIDS compatible GPUs, for the sake of practice/exploring one can use Kaggle or Google Colaboratory as both these platforms provide free GPU access. However, while using Google Colabs just ensure that you’ve been allocated either of the following GPUs: Tesla T4, P4, or P100 as these are the only RAPIDS compatible GPUs on Google Colab.

Thus, it is evident that using cuDF we can employ GPU acceleration on Python data frames and make the processing of data quite fast. This holds immense significance in fields of data science and ML as data is being generated in overwhelming quantities every second – and its speedy processing is imperative. 



How to speed up Pandas with cuDF?

Pandas data frames in Python are extremely useful; they provide an easy and flexible way to deal with data and a large number of in-built functions to handle, analyze, and process the data. While Pandas data frames have a decent processing time, still in the case of computationally intensive operations, Pandas data frames tend to be slow, causing delays in data science and ML workflows. This limited speed of pandas data frames is because pandas work on CPUs that only have 8 cores. However, GPU acceleration of data science and machine learning workflows provides a solution to this problem and enhances the speed of operations at an impressive level.

Similar Reads

cuDF

cuDF (CUDA DF) is a Python GPU data frame library that helps accelerate the loading, processing, and manipulating of massive data – thus, enabling users to perform computer-intensive operations fast. cuDF is based on an apache arrow columnar layout which we will discuss later....

Comparison between computational times of Pandas and cuDF

In order to analyze the time taken in both cases, let us try to load a huge dataset data.csv – first using pandas library and then using cuDF, and compare the computational time in both the cases....

Arrow Columnar Layout in cuDF

...