Torchtext Dataset

Loading demo IMDB text dataset in torchtext using Pytorch. To load your custom text data we use  torch.utils.data.DataLoader() method.

Syntax: torch.utils.data.DataLoader(‘path to/imdb_data’, batch_size, shuffle=True)

Code Explanation:

  • The procedure is almost the same as loading the image and audio data.
  • Here, instead of torchvision, torchtext has to be imported.
  • Use the torchtext function with the datasets accessor, followed by dataset name (IMDB).
  • Now, pass the split function to the torchtext function to split the dataset to train and test data.
  • Now define a function to split each line in the corpus to separate tokens by iterating each line in the corpus as shown. In this way, we can easily load text data using Pytorch.

Python3




# import the torch and torchtext dataset packages.
import torch
import torchtext
 
# access the dataset in torchtext package
# using .datasets followed by dataset name.
text_data = torchtext.datasets.IMDB(split='train')
 
# define a function to tokenize
# the words in the corpus
def tokenize(label, line):
    return line.split()
 
 
# define a empty list to store
# the tokenized words
tokens = []
 
# iterate over the text_data and
# tokenize each line and store
# it in the list tokens
for label, line in text_data:
    tokens += tokenize(label, line)
 
print('The total no. of tokens in imdb dataset is',
      len(tokens))


Output:



Loading Data in Pytorch

In this article, we will discuss how to load different kinds of data in PyTorch.

For demonstration purposes, Pytorch comes with 3 divisions of datasets namely torchaudio, torchvision, and torchtext. We can leverage these demo datasets to understand how to load Sound, Image, and text data using Pytorch.

Similar Reads

Torchaudio Dataset

Loading demo yes_no audio dataset in torchaudio using Pytorch....

Torchvision Dataset

...

Torchtext Dataset

Loading demo ImageNet vision dataset in torchvision using Pytorch. Click here to download the dataset by signing up....