Shuffling

Shuffling means the random reordering of data samples for every epoch to improve the model performance and generalization shuffling was used. By setting shuffle (True), internally random sampler was used.

Enabling shuffling with the shuffle argument

When shuffle=True, the DataLoader randomly rearranges the data at the start of each epoch. The DataLoader returns the batched data (input features and labels) to the training loop.

From the code , we can infer that the built-in dataset MNIST was downloaded by shuffle=True. It ensures that the model encounters a diverse mix of samples in each batch.

shuffle (bool, optional) –> set to True to have the data reshuffled at every epoch (default: False).

Python3
from torch.utils.data import DataLoader
from torchvision import datasets, transforms

# DataLoader with shuffle = True
train_loader = DataLoader(datasets.MNIST('data', train=True, download=True,
                           transform=transforms.Compose([
                               transforms.ToTensor(),
                               transforms.Normalize((0.1307,), (0.3081,))
                           ])),
                           batch_size=64, shuffle=True)


Difference between shuffle = True & shuffle = False

To see the difference , we’ll use a dataset of integers from 0 to 99 as our data points for simplicity. The goal here is not to train a real model but to observe how the order of data points changes with and without shuffling.

Python3
import torch
from torch.utils.data import DataLoader, TensorDataset

# Create a synthetic dataset of integers from 0 to 99
data = torch.arange(0, 100)
# Create dummy targets (just for the sake of having them)
targets = torch.zeros(100)

# Create a TensorDataset
dataset = TensorDataset(data, targets)

# DataLoader with shuffle=True
dataloader_shuffle = DataLoader(dataset, batch_size=10, shuffle=True)

# DataLoader with shuffle=False
dataloader_noshuffle = DataLoader(dataset, batch_size=10, shuffle=False)

# Function to print the first batch of the dataloader
def print_first_batch(dataloader, shuffle_status):
    for batch in dataloader:
        data, _ = batch
        print(f"First batch with shuffle={shuffle_status}: {data}")
        break  # We break the loop to print only the first batch

# Print the first batch of each DataLoader to compare
print_first_batch(dataloader_shuffle, shuffle_status=True)
print_first_batch(dataloader_noshuffle, shuffle_status=False)

Output:

First batch with shuffle=True: tensor([53,  0, 56,  3, 92, 49, 72, 79, 64, 47])
First batch with shuffle=False: tensor([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])


  • With shuffle=True: Each time you run this script, the “First batch with shuffle=True” will contain a different random assortment of integers from 0 to 99. This demonstrates that the DataLoader is shuffling the dataset before each epoch.
  • With shuffle=False: Regardless of how many times you run the script, the “First batch with shuffle=False” will always display the first ten integers (0 to 9) in the same order. This shows that the DataLoader is serving the dataset in the same order it was given.

Alternative approaches for shuffling with samplers

Shuffling can also be achieved using custom sampler classes. The Samplers provide flexibility in how you shuffle your dataset and can be used based on your specific requirements. Below we mentioned about the few sampler for a built-in image dataset MNIST .

1. Random Sampler:

This sampler randomly samples elements from the dataset without replacement. It ensures that each example is sampled exactly once in an epoch.

Python
from torch.utils.data import DataLoader, RandomSampler,Dataset

datset=datasets.MNIST(root='./data',train=False,download=True,transform=t)
random_sampler = RandomSampler(dataset)
data_loader = DataLoader(dataset, batch_size=32, sampler=random_sampler)

This shows how to use the built-in RandomSampler to randomly shuffle the entire dataset before each epoch.

2. Sequential Sampler with Shuffle:

This sampler sequentially samples elements from the dataset but shuffles the order of the elements at the beginning of each epoch, providing a balance between randomness and order.

Python
from torch.utils.data import SequentialSampler

sequential_sampler = SequentialSampler(dataset)
data_loader = DataLoader(dataset, batch_size=32, sampler=sequential_sampler)


3. Custom Sampler:

It is used for implementing more complex shuffling strategies or sampling schemes based on specific requirements.A Custom Sampler allows for complex sampling strategies. Below is a simple example of a custom sampler that randomly selects half of the dataset without replacement.

Python
from torch.utils.data import Sampler
import random

class CustomSampler(Sampler):
    def __init__(self, data_source):
        self.data_source = data_source
        self.indices = list(range(len(data_source)))

    def __iter__(self):
        random.shuffle(self.indices)
        return iter(self.indices)

    def __len__(self):
        return len(self.indices)

custom_sampler = CustomSampler(dataset)
data_loader = DataLoader(dataset, batch_size=32, sampler=custom_sampler)

PyTorch DataLoader

PyTorch’s DataLoader is a powerful tool for efficiently loading and processing data for training deep learning models. It provides functionalities for batching, shuffling, and processing data, making it easier to work with large datasets. In this article, we’ll explore how PyTorch’s DataLoader works and how you can use it to streamline your data pipeline.

Table of Content

  • What is Pytorch DataLoader?
  • Importance of Batching, Shuffling, and Processing in Deep Learning
  • Batching
  • Shuffling
  • Processing Data
  • PyTorch Dataset class for Customizing data transformations

Similar Reads

What is Pytorch DataLoader?

PyTorch Dataloader is a utility class designed to simplify loading and iterating over datasets while training deep learning models. It has various constraints to iterating datasets, like batching, shuffling, and processing data. To implement the dataloader in Pytorch, we have to import the function by the following code,...

Importance of Batching, Shuffling, and Processing in Deep Learning

To improve the stability, efficiency, and generalization of the model, batching, shuffling, and processing are used for effective computation in data preparation. Let’s look at the importance of each constraint separately,...

Batching

Batching is the process of grouping data samples into smaller chunks (batches) for efficient training. Automatic batching is the default behavior of DataLoader. When batch_size is specified, the DataLoader automatically collates individual fetched data samples into batches, typically with the first dimension representing as the batch dimension....

Shuffling

Shuffling means the random reordering of data samples for every epoch to improve the model performance and generalization shuffling was used. By setting shuffle (True), internally random sampler was used....

Processing Data

The DataLoader uses single-process data loading by default. In this mode, data fetching is done in the same process a DataLoader is initialized. Thus, data loading may block computing. Processing will be preferred when resources used for sharing data among processes (e.g., shared memory, file descriptors) is limited, or when the entire dataset is small and can be loaded entirely in memory. By processing, it will apply transformations to the data. Transformations include resizing images, normalizing pixel values, or any other preprocessing steps....

PyTorch Dataset class for Customizing data transformations

The Dataset class in PyTorch plays a pivotal role in data handling and preprocessing, serving as a foundational building block for loading and organizing data in a way that is efficient and scalable for training deep learning models. Customizing data transformations within a Dataset class allows for flexible and dynamic data preprocessing, tailored specifically to the needs of a given model ....

Conclusion

DataLoader significantly impacts training quality. Batching, shuffling, and preprocessing are essential to a robust model and efficient for deep learning pipeline....