Shuffling
Shuffling means the random reordering of data samples for every epoch to improve the model performance and generalization shuffling was used. By setting shuffle (True), internally random sampler was used.
Enabling shuffling with the shuffle argument
When shuffle=True, the DataLoader randomly rearranges the data at the start of each epoch. The DataLoader returns the batched data (input features and labels) to the training loop.
From the code , we can infer that the built-in dataset MNIST was downloaded by shuffle=True. It ensures that the model encounters a diverse mix of samples in each batch.
shuffle (bool, optional) –> set to True to have the data reshuffled at every epoch (default: False).
from torch.utils.data import DataLoader
from torchvision import datasets, transforms
# DataLoader with shuffle = True
train_loader = DataLoader(datasets.MNIST('data', train=True, download=True,
transform=transforms.Compose([
transforms.ToTensor(),
transforms.Normalize((0.1307,), (0.3081,))
])),
batch_size=64, shuffle=True)
Difference between shuffle = True & shuffle = False
To see the difference , we’ll use a dataset of integers from 0 to 99 as our data points for simplicity. The goal here is not to train a real model but to observe how the order of data points changes with and without shuffling.
import torch
from torch.utils.data import DataLoader, TensorDataset
# Create a synthetic dataset of integers from 0 to 99
data = torch.arange(0, 100)
# Create dummy targets (just for the sake of having them)
targets = torch.zeros(100)
# Create a TensorDataset
dataset = TensorDataset(data, targets)
# DataLoader with shuffle=True
dataloader_shuffle = DataLoader(dataset, batch_size=10, shuffle=True)
# DataLoader with shuffle=False
dataloader_noshuffle = DataLoader(dataset, batch_size=10, shuffle=False)
# Function to print the first batch of the dataloader
def print_first_batch(dataloader, shuffle_status):
for batch in dataloader:
data, _ = batch
print(f"First batch with shuffle={shuffle_status}: {data}")
break # We break the loop to print only the first batch
# Print the first batch of each DataLoader to compare
print_first_batch(dataloader_shuffle, shuffle_status=True)
print_first_batch(dataloader_noshuffle, shuffle_status=False)
Output:
First batch with shuffle=True: tensor([53, 0, 56, 3, 92, 49, 72, 79, 64, 47])
First batch with shuffle=False: tensor([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])
- With shuffle=True: Each time you run this script, the “First batch with shuffle=True” will contain a different random assortment of integers from 0 to 99. This demonstrates that the DataLoader is shuffling the dataset before each epoch.
- With shuffle=False: Regardless of how many times you run the script, the “First batch with shuffle=False” will always display the first ten integers (0 to 9) in the same order. This shows that the DataLoader is serving the dataset in the same order it was given.
Alternative approaches for shuffling with samplers
Shuffling can also be achieved using custom sampler classes. The Samplers provide flexibility in how you shuffle your dataset and can be used based on your specific requirements. Below we mentioned about the few sampler for a built-in image dataset MNIST .
1. Random Sampler:
This sampler randomly samples elements from the dataset without replacement. It ensures that each example is sampled exactly once in an epoch.
from torch.utils.data import DataLoader, RandomSampler,Dataset
datset=datasets.MNIST(root='./data',train=False,download=True,transform=t)
random_sampler = RandomSampler(dataset)
data_loader = DataLoader(dataset, batch_size=32, sampler=random_sampler)
This shows how to use the built-in RandomSampler to randomly shuffle the entire dataset before each epoch.
2. Sequential Sampler with Shuffle:
This sampler sequentially samples elements from the dataset but shuffles the order of the elements at the beginning of each epoch, providing a balance between randomness and order.
from torch.utils.data import SequentialSampler
sequential_sampler = SequentialSampler(dataset)
data_loader = DataLoader(dataset, batch_size=32, sampler=sequential_sampler)
3. Custom Sampler:
It is used for implementing more complex shuffling strategies or sampling schemes based on specific requirements.A Custom Sampler allows for complex sampling strategies. Below is a simple example of a custom sampler that randomly selects half of the dataset without replacement.
from torch.utils.data import Sampler
import random
class CustomSampler(Sampler):
def __init__(self, data_source):
self.data_source = data_source
self.indices = list(range(len(data_source)))
def __iter__(self):
random.shuffle(self.indices)
return iter(self.indices)
def __len__(self):
return len(self.indices)
custom_sampler = CustomSampler(dataset)
data_loader = DataLoader(dataset, batch_size=32, sampler=custom_sampler)
PyTorch DataLoader
PyTorch’s DataLoader is a powerful tool for efficiently loading and processing data for training deep learning models. It provides functionalities for batching, shuffling, and processing data, making it easier to work with large datasets. In this article, we’ll explore how PyTorch’s DataLoader works and how you can use it to streamline your data pipeline.
Table of Content
- What is Pytorch DataLoader?
- Importance of Batching, Shuffling, and Processing in Deep Learning
- Batching
- Shuffling
- Processing Data
- PyTorch Dataset class for Customizing data transformations