Handling Large Datasets in Python
Handling large datasets is a common task in data analysis and modification. When working with large datasets, it’s important to use efficient techniques and tools to ensure optimal performance and avoid memory issues. In this article, we will see how we can handle large datasets in Python.
Handle Large Datasets in Python
To handle large datasets in Python, we can use the below techniques:
Reduce Memory Usage by Optimizing Data Types
By default, Pandas assigns data types that may not be memory-efficient. For numeric columns, consider downcasting to smaller types (e.g., int32 instead of int64, float32 instead of float64). For example, if a column holds values like 0, 1, 2, 3, 4, 5, 6, 7, 8, or 9, using int8 (8 bits) instead of int64 (64 bits) is sufficient. Similarly, converting object data types to categories can also save memory.
import pandas as pd
# Define the size of the dataset
num_rows = 1000000 # 1 million rows
# Example DataFrame with inefficient datatypes
data = {'A': [1, 2, 3, 4],
'B': [5.0, 6.0, 7.0, 8.0]}
df = pd.DataFrame(data)
# Replicate the DataFrame to create a larger dataset
df_large = pd.concat([df] * (num_rows // len(df)), ignore_index=True)
# Check memory usage before conversion
print("Memory usage before conversion:")
print(df_large.memory_usage().sum())
# Convert to more memory-efficient datatypes
df_large['A'] = pd.to_numeric(df_large['A'], downcast='integer')
df_large['B'] = pd.to_numeric(df_large['B'], downcast='float')
# Typecasting
df_large['A'] = df_large['A'].astype('int32')
df_large['B'] = df_large['B'].astype('float32')
# Check memory usage after conversion
print("Memory usage after conversion:")
print(df_large.memory_usage().sum())
# Print type casting
print("\nType casting:")
print("Column 'A' dtype:", df_large['A'].dtype)
print("Column 'B' dtype:", df_large['B'].dtype)
Output
Memory usage before conversion:
16000128
Memory usage after conversion:
5000128
Split Data into Chunks
Use the chunksize parameter in pd.read_csv() to read the dataset in smaller chunks. Process each chunk iteratively to avoid loading the entire dataset into memory at once.
import pandas as pd
# Create sample DataFrame
data = {'A': range(10000),
'B': range(10000)}
# Process data in chunks
chunk_size = 1000
for chunk in pd.DataFrame(data).groupby(pd.DataFrame(data).index // chunk_size):
print(chunk)
Output
(0, A B
0 0 0
1 1 1
2 2 2
3 3 3
4 4 4
.. ... ...
995 995 995
996 996 996
997 997 997
998 998 998
999 999 999
[1000 rows x 2 columns])
(1, A B
1000 1000 1000
1001 1001 1001
1002 1002 1002
1003 1003 1003
1004 1004 1004
... ... ...
1995 1995 1995
1996 1996 1996
1997 1997 1997
1998 1998 1998
1999 1999 1999
[1000 rows x 2 columns])
(2, A B
2000 2000 2000
2001 2001 2001
2002 2002 2002
2003 2003 2003
2004 2004 2004
... ... ...
2995 2995 2995
2996 2996 2996
2997 2997 2997
2998 2998 2998
2999 2999 2999
[1000 rows x 2 columns])
(3, A B
3000 3000 3000
3001 3001 3001
3002 3002 3002
3003 3003 3003
3004 3004 3004
... ... ...
3995 3995 3995
3996 3996 3996
3997 3997 3997
3998 3998 3998
3999 3999 3999
[1000 rows x 2 columns])
(4, A B
4000 4000 4000
4001 4001 4001
4002 4002 4002
4003 4003 4003
4004 4004 4004
... ... ...
4995 4995 4995
4996 4996 4996
4997 4997 4997
4998 4998 4998
4999 4999 4999
[1000 rows x 2 columns])
(5, A B
5000 5000 5000
5001 5001 5001
5002 5002 5002
5003 5003 5003
5004 5004 5004
... ... ...
5995 5995 5995
5996 5996 5996
5997 5997 5997
5998 5998 5998
5999 5999 5999
[1000 rows x 2 columns])
(6, A B
6000 6000 6000
6001 6001 6001
6002 6002 6002
6003 6003 6003
6004 6004 6004
... ... ...
6995 6995 6995
6996 6996 6996
6997 6997 6997
6998 6998 6998
6999 6999 6999
[1000 rows x 2 columns])
(7, A B
7000 7000 7000
7001 7001 7001
7002 7002 7002
7003 7003 7003
7004 7004 7004
... ... ...
7995 7995 7995
7996 7996 7996
7997 7997 7997
7998 7998 7998
7999 7999 7999
[1000 rows x 2 columns])
(8, A B
8000 8000 8000
8001 8001 8001
8002 8002 8002
8003 8003 8003
8004 8004 8004
... ... ...
8995 8995 8995
8996 8996 8996
8997 8997 8997
8998 8998 8998
8999 8999 8999
[1000 rows x 2 columns])
(9, A B
9000 9000 9000
9001 9001 9001
9002 9002 9002
9003 9003 9003
9004 9004 9004
... ... ...
9995 9995 9995
9996 9996 9996
9997 9997 9997
9998 9998 9998
9999 9999 9999
[1000 rows x 2 columns])
Use Dask for Parallel Computing
Dask is a parallel computing library that allows us to scale Pandas workflows to larger-than-memory datasets. Leverage parallel processing for efficient handling of big data.
import dask.dataframe as dd
import pandas as pd
# Create sample DataFrame
data = {'A': range(10000),
'B': range(10000)}
df = pd.DataFrame(data)
# Load data using Dask
ddf = dd.from_pandas(df, npartitions=4)
# Perform parallelized operations
result = ddf.groupby('A').mean().compute()
print(result)
Output
B
A
0 0.0
1 1.0
2 2.0
3 3.0
4 4.0
... ...
9995 9995.0
9996 9996.0
9997 9997.0
9998 9998.0
9999 9999.0
[10000 rows x 1 columns]
Conclusion
In conclusion, handling large datasets in Python involves using streaming techniques, lazy evaluation, parallel processing, and data compression to optimize performance and memory usage. These steps helps to efficiently process and analyze large datasets for data analysis and modification.