Shuffling

Shuffling means the random reordering of data samples for every epoch to improve the model performance and generalization shuffling was used. By setting shuffle (True), internally random sampler was used.

Enabling shuffling with the shuffle argument

When shuffle=True, the DataLoader randomly rearranges the data at the start of each epoch. The DataLoader returns the batched data (input features and labels) to the training loop.

From the code , we can infer that the built-in dataset MNIST was downloaded by shuffle=True. It ensures that the model encounters a diverse mix of samples in each batch.

shuffle (bool, optional) –> set to True to have the data reshuffled at every epoch (default: False).

Python3

from torch.utils.data import DataLoader
from torchvision import datasets, transforms

# DataLoader with shuffle = True
train_loader = DataLoader(datasets.MNIST('data', train=True, download=True,
                           transform=transforms.Compose([
                               transforms.ToTensor(),
                               transforms.Normalize((0.1307,), (0.3081,))
                           ])),
                           batch_size=64, shuffle=True)

Difference between shuffle = True & shuffle = False

To see the difference , we’ll use a dataset of integers from 0 to 99 as our data points for simplicity. The goal here is not to train a real model but to observe how the order of data points changes with and without shuffling.

Python3

import torch
from torch.utils.data import DataLoader, TensorDataset

# Create a synthetic dataset of integers from 0 to 99
data = torch.arange(0, 100)
# Create dummy targets (just for the sake of having them)
targets = torch.zeros(100)

# Create a TensorDataset
dataset = TensorDataset(data, targets)

# DataLoader with shuffle=True
dataloader_shuffle = DataLoader(dataset, batch_size=10, shuffle=True)

# DataLoader with shuffle=False
dataloader_noshuffle = DataLoader(dataset, batch_size=10, shuffle=False)

# Function to print the first batch of the dataloader
def print_first_batch(dataloader, shuffle_status):
    for batch in dataloader:
        data, _ = batch
        print(f"First batch with shuffle={shuffle_status}: {data}")
        break  # We break the loop to print only the first batch

# Print the first batch of each DataLoader to compare
print_first_batch(dataloader_shuffle, shuffle_status=True)
print_first_batch(dataloader_noshuffle, shuffle_status=False)

Output:

First batch with shuffle=True: tensor([53,  0, 56,  3, 92, 49, 72, 79, 64, 47])
First batch with shuffle=False: tensor([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

With shuffle=True: Each time you run this script, the “First batch with shuffle=True” will contain a different random assortment of integers from 0 to 99. This demonstrates that the DataLoader is shuffling the dataset before each epoch.
With shuffle=False: Regardless of how many times you run the script, the “First batch with shuffle=False” will always display the first ten integers (0 to 9) in the same order. This shows that the DataLoader is serving the dataset in the same order it was given.

Alternative approaches for shuffling with samplers

Shuffling can also be achieved using custom sampler classes. The Samplers provide flexibility in how you shuffle your dataset and can be used based on your specific requirements. Below we mentioned about the few sampler for a built-in image dataset MNIST .

1. Random Sampler:

This sampler randomly samples elements from the dataset without replacement. It ensures that each example is sampled exactly once in an epoch.

Python

from torch.utils.data import DataLoader, RandomSampler,Dataset

datset=datasets.MNIST(root='./data',train=False,download=True,transform=t)
random_sampler = RandomSampler(dataset)
data_loader = DataLoader(dataset, batch_size=32, sampler=random_sampler)

This shows how to use the built-in RandomSampler to randomly shuffle the entire dataset before each epoch.

2. Sequential Sampler with Shuffle:

This sampler sequentially samples elements from the dataset but shuffles the order of the elements at the beginning of each epoch, providing a balance between randomness and order.

Python

from torch.utils.data import SequentialSampler

sequential_sampler = SequentialSampler(dataset)
data_loader = DataLoader(dataset, batch_size=32, sampler=sequential_sampler)

3. Custom Sampler:

It is used for implementing more complex shuffling strategies or sampling schemes based on specific requirements.A Custom Sampler allows for complex sampling strategies. Below is a simple example of a custom sampler that randomly selects half of the dataset without replacement.

Python

from torch.utils.data import Sampler
import random

class CustomSampler(Sampler):
    def __init__(self, data_source):
        self.data_source = data_source
        self.indices = list(range(len(data_source)))

    def __iter__(self):
        random.shuffle(self.indices)
        return iter(self.indices)

    def __len__(self):
        return len(self.indices)

custom_sampler = CustomSampler(dataset)
data_loader = DataLoader(dataset, batch_size=32, sampler=custom_sampler)

PyTorch DataLoader

PyTorch’s DataLoader is a powerful tool for efficiently loading and processing data for training deep learning models. It provides functionalities for batching, shuffling, and processing data, making it easier to work with large datasets. In this article, we’ll explore how PyTorch’s DataLoader works and how you can use it to streamline your data pipeline.

Table of Content

What is Pytorch DataLoader?
Importance of Batching, Shuffling, and Processing in Deep Learning
Batching
Shuffling
Processing Data
PyTorch Dataset class for Customizing data transformations

Shuffling

Enabling shuffling with the shuffle argument

Difference between shuffle = True & shuffle = False

Alternative approaches for shuffling with samplers

1. Random Sampler:

2. Sequential Sampler with Shuffle:

3. Custom Sampler:

PyTorch DataLoader

Similar Reads