Batching

Batching is the process of grouping data samples into smaller chunks (batches) for efficient training. Automatic batching is the default behavior of DataLoader. When batch_size is specified, the DataLoader automatically collates individual fetched data samples into batches, typically with the first dimension representing as the batch dimension.

How DataLoader creates mini-batches?

When you create a DataLoader just specify the batch_size to create a mini-batches. During training, the DataLoader slices your dataset into multiple mini-batches for the given batch size. Each batch contains multiple data points (e.g., images, text samples) . The DataLoader returns the batched data (input features and labels) to the training loop.

batch_size (int, optional) -> how many samples per batch to load (default: 1).

Python3
import torch
from torch.utils.data import DataLoader, TensorDataset

# sample dummy image tensors
image_data = torch.randn(1000, 3, 64, 64) 
labels = torch.randint(0, 10, (1000,))  

dataset = TensorDataset(image_data, labels)

#Split into batches
batch_size = 32
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)

#to view every iterated batch
for batch_images, batch_labels in dataloader:
    print(f"Batch shape: {batch_images.shape}, Labels: {batch_labels}")

Output:

Batch shape: torch.Size([32, 3, 64, 64]), Labels: tensor([0, 3, 7, 1, 4, 2, 9, 7, 4, 7, 1, 8, 3, 4, 6, 3, 8, 7, 2, 8, 3, 2, 4, 9,
6, 3, 8, 0, 0, 1, 3, 0])
Batch shape: torch.Size([32, 3, 64, 64]), Labels: tensor([1, 1, 5, 5, 4, 6, 1, 1, 2, 3, 8, 3, 7, 0, 6, 3, 1, 7, 7, 9, 4, 0, 8, 0,
7, 4, 8, 1, 0, 6, 2, 5])
Batch shape: torch.Size([32, 3, 64, 64]), Labels: tensor([8, 2, 6, 2, 8, 2, 0, 0, 4, 4, 9, 4, 8, 5, 2, 7, 6, 5, 0, 5, 4, 3, 9, 8,
8, 8, 1, 7, 7, 1, 7, 9])
Batch shape: torch.Size([32, 3, 64, 64]), Labels: tensor([6, 6, 9, 3, 7, 2, 3, 6, 9, 4, 7, 8, 7, 7, 9, 4, 8, 3, 9, 9, 9, 9, 4, 2,
0, 0, 5, 4, 5, 7, 5, 7])
Batch shape: torch.Size([32, 3, 64, 64]), Labels: tensor([3, 0, 2, 3, 7, 8, 0, 6, 6, 9, 1, 4, 6, 2, 7, 9, 1, 2, 9, 8, 4, 6, 2, 3,
7, 3, 5, 3, 6, 7, 3, 1])
Batch shape: torch.Size([32, 3, 64, 64]), Labels: tensor([8, 8, 0, 4, 9, 1, 5, 4, 9, 4, 7, 5, 2, 2, 1, 6, 0, 3, 1, 1, 1, 8, 8, 5,
0, 6, 3, 3, 9, 3, 4, 7])
Batch shape: torch.Size([32, 3, 64, 64]), Labels: tensor([7, 4, 9, 6, 2, 1, 0, 0, 1, 1, 3, 3, 9, 3, 5, 7, 0, 2, 3, 5, 9, 2, 3, 0,
9, 9, 0, 1, 9, 0, 3, 2])

The above results indicates a batch of 32 images, each with 3 channels and a size of 64×64 pixels. The batch_labels will be a tensor of shape (32,) containing the class labels for each image in the batch.

Benefits of using mini-batches

  • Mini-Batches allows for parallel processing on GPUs. Thus, it speeds up computation.
  • It reduces the memory required by processing in batches not an entire dataset at once.
  • It is feasible to train larger datasets by memory optimization and reduces the overhead.
  • It provides a stable update to model weights.

Choosing the right batch size

The smaller batch sizes leads to a stable training which results in noisy updates. On Larger batch size leads to faster convergence and excessively larger batches results in slower convergence. So it is beneficial to experiment with different batch sizes and monitor the training performance to determine a optimal batch size. For most of the cases , batch size of 64 for larger model and 32 for smaller datasets are used often.

PyTorch DataLoader

PyTorch’s DataLoader is a powerful tool for efficiently loading and processing data for training deep learning models. It provides functionalities for batching, shuffling, and processing data, making it easier to work with large datasets. In this article, we’ll explore how PyTorch’s DataLoader works and how you can use it to streamline your data pipeline.

Table of Content

  • What is Pytorch DataLoader?
  • Importance of Batching, Shuffling, and Processing in Deep Learning
  • Batching
  • Shuffling
  • Processing Data
  • PyTorch Dataset class for Customizing data transformations

Similar Reads

What is Pytorch DataLoader?

PyTorch Dataloader is a utility class designed to simplify loading and iterating over datasets while training deep learning models. It has various constraints to iterating datasets, like batching, shuffling, and processing data. To implement the dataloader in Pytorch, we have to import the function by the following code,...

Importance of Batching, Shuffling, and Processing in Deep Learning

To improve the stability, efficiency, and generalization of the model, batching, shuffling, and processing are used for effective computation in data preparation. Let’s look at the importance of each constraint separately,...

Batching

Batching is the process of grouping data samples into smaller chunks (batches) for efficient training. Automatic batching is the default behavior of DataLoader. When batch_size is specified, the DataLoader automatically collates individual fetched data samples into batches, typically with the first dimension representing as the batch dimension....

Shuffling

Shuffling means the random reordering of data samples for every epoch to improve the model performance and generalization shuffling was used. By setting shuffle (True), internally random sampler was used....

Processing Data

The DataLoader uses single-process data loading by default. In this mode, data fetching is done in the same process a DataLoader is initialized. Thus, data loading may block computing. Processing will be preferred when resources used for sharing data among processes (e.g., shared memory, file descriptors) is limited, or when the entire dataset is small and can be loaded entirely in memory. By processing, it will apply transformations to the data. Transformations include resizing images, normalizing pixel values, or any other preprocessing steps....

PyTorch Dataset class for Customizing data transformations

The Dataset class in PyTorch plays a pivotal role in data handling and preprocessing, serving as a foundational building block for loading and organizing data in a way that is efficient and scalable for training deep learning models. Customizing data transformations within a Dataset class allows for flexible and dynamic data preprocessing, tailored specifically to the needs of a given model ....

Conclusion

DataLoader significantly impacts training quality. Batching, shuffling, and preprocessing are essential to a robust model and efficient for deep learning pipeline....