This article guides building and preprocessing a custom dataset for training machine learning models using PyTorch, covering data cleaning, annotation, and transformations.
In this lesson, we will guide you through building and preprocessing a custom dataset to train a machine learning model. Whether you are working with images, text, audio, or any other data modality, this step-by-step tutorial covers data cleaning, annotation creation, dataset splitting, versioning, applying transformations, and ultimately preparing PyTorch datasets along with DataLoaders.
First, we load the dataset and visualize the images to verify that they meet the training requirements.
Viewing your dataset before training helps identify any images that do not belong to your target classes.
Copy
Ask AI
# View all images in our datasetimport globimport matplotlib.pyplot as pltfrom PIL import Image# Get a list of images with jpg extension in any subdirectoryimages_list = glob.glob("images/**/*jpg")
Next, we display each image along with its file name. This step assists in detecting any out-of-scope images.
Copy
Ask AI
# Display each image in our datasetimport globimport matplotlib.pyplot as pltfrom PIL import Image# Get a list of images with jpg extension from the subdirectoriesimages_list = glob.glob("images/*/*.jpg")for image in images_list: plt.title(image) img = Image.open(image) plt.imshow(img) plt.axis("on") plt.show()
In our dataset, the expected images are of cats and dogs. However, the initial exploration might reveal images of a frog or horse, which are irrelevant for a cat-versus-dog classification task.
After examining the images, it’s crucial to remove any that do not match the target classes. In this example, we remove the horse and frog images from the dataset and then generate an annotations CSV file.
Copy
Ask AI
# Print the image list before cleaningprint(images_list)# Remove images that shouldn't be in our datasetimages_list.remove('images/cat/horse-1.jpg') # Remove the horse imageimages_list.remove('images/cat/frog-1.jpg') # Remove the frog image# Verify the cleaned listprint(images_list)
Now, create the annotations CSV file where each record maps the image file path to its class label (extracted from the directory structure).
Copy
Ask AI
# Write the cleaned image list to a CSV file for annotationsimport osimport pandas as pddata = []for file_path in images_list: # Extract the class label from the path (e.g., 'dog' or 'cat') label = os.path.basename(os.path.dirname(file_path)) data.append({"file_path": file_path, "label": label})# Save the annotations as CSVdf = pd.DataFrame(data)df.to_csv("image_data.csv", index=False)
The resulting CSV (image_data.csv) should look similar to:file_path,label
images/cat/cat-4.jpg,cat
images/cat/cat-5.jpg,cat
images/cat/cat-2.jpg,cat
images/cat/cat-3.jpg,cat
images/dog/dog-4.jpg,dog
images/dog/dog-1.jpg,dog
images/dog/dog-3.jpg,dog
images/dog/dog-5.jpg,dog
Next, we create an initial PyTorch dataset class that reads our annotations CSV file and returns the image path along with its label. This forms the basis for our training pipeline.
It is important to split the dataset into training, validation, and testing subsets. Here, we randomly partition the data into 70% for training, 15% for validation, and 15% for testing.
Copy
Ask AI
from torch.utils.data import random_split# Define the sizes of each splittrain_size = int(0.7 * len(dataset))val_size = int(0.15 * len(dataset))test_size = len(dataset) - train_size - val_size# Split the datasettrain_dataset, val_dataset, test_dataset = random_split(dataset, [train_size, val_size, test_size])print(train_dataset.indices, val_dataset.indices, test_dataset.indices)
The indices printed represent the positions of images in the original dataset. You can further verify the allocation by printing rows from dataset.img_labels:
Copy
Ask AI
# Check the annotation of an image from the training set using its indexprint(dataset.img_labels.loc[train_dataset.indices[0]])
Versioning your data annotations is a best practice for reproducibility. By saving separate CSV files for each subset (training, validation, and testing), you can easily track and reproduce your training experiments.For example, to generate annotations for the training set:
Copy
Ask AI
import pandas as pddata = []# Create annotations for the training setfor idx in train_dataset.indices: img_path = dataset.img_labels['file_path'].loc[idx] label = dataset.img_labels['label'].loc[idx] data.append({"file_path": img_path, "label": label})df = pd.DataFrame(data)df.to_csv("training_data.csv", index=False)
Similarly, create annotation files for the validation and testing sets:
Copy
Ask AI
# Annotations for the validation setdata = []for idx in val_dataset.indices: img_path = dataset.img_labels['file_path'].loc[idx] label = dataset.img_labels['label'].loc[idx] data.append({"file_path": img_path, "label": label})df = pd.DataFrame(data)df.to_csv("validation_data.csv", index=False)# Annotations for the test setdata = []for idx in test_dataset.indices: img_path = dataset.img_labels['file_path'].loc[idx] label = dataset.img_labels['label'].loc[idx] data.append({"file_path": img_path, "label": label})df = pd.DataFrame(data)df.to_csv("testing_data.csv", index=False)
This separation helps maintain clear records of which images are used during each phase of training.
Data transformations and augmentations are key to preparing your images for model training. Typically, training data benefits from a variety of augmentations, while validation data should remain consistent.
In this example, we use TorchVision’s v2 transforms to resize images, perform random cropping and horizontal flipping, convert to tensors, and apply normalization.
Copy
Ask AI
import torchfrom torchvision.transforms import v2train_transform = v2.Compose([ v2.Resize((128, 128)), # Resize the image v2.RandomCrop(size=(75, 75)), # Perform a random crop v2.RandomHorizontalFlip(p=0.7), # Apply horizontal flip with 70% probability v2.ToImage(), # Convert to image (if required) v2.ToDtype(torch.float32, scale=True), # Convert image to tensor and scale the values v2.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]) # Normalize])
7. Constructing a Custom Image Dataset and DataLoaders
We now build a custom dataset class that incorporates our annotations, image directory, and transformation pipelines. Additionally, we use a label encoding strategy to convert categorical labels into numerical format.
Copy
Ask AI
import osimport pandas as pdfrom torch.utils.data import Datasetfrom PIL import Imageclass CustomImageDataset(Dataset): def __init__(self, annotations_file, img_dir, transform, target_transform): self.img_labels = pd.read_csv(annotations_file) self.img_dir = img_dir self.transform = transform # Convert label strings to numerical values using the mapping provided self.target_transform = lambda y: target_transform[y] def __len__(self): return len(self.img_labels) def __getitem__(self, idx): # Construct the full image path img_path = os.path.join(self.img_dir, self.img_labels.iloc[idx, 0]) image = Image.open(img_path) label = self.img_labels.iloc[idx, 1] # Apply the transformation and encode the label image = self.transform(image) label = self.target_transform(label) return image, label# Define label encoding mappinglabel_encoding = {"cat": 0, "dog": 1}# Create the training dataset using the custom dataset classtrain_dataset = CustomImageDataset( annotations_file='training_data.csv', img_dir='./', transform=train_transform, target_transform=label_encoding)print("Encoded label for 'dog':", train_dataset.target_transform('dog'))
DataLoaders help batch and shuffle data during training and evaluation.
Copy
Ask AI
from torch.utils.data import DataLoader# DataLoader for the training datasettrain_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)# Inspect the batch size for training datafeatures, labels = next(iter(train_loader))print(f"Training features shape: {features.size()}")# DataLoader for the validation dataset (no shuffling)val_loader = DataLoader(val_dataset, batch_size=32, shuffle=False)# Inspect the batch size for validation datafeatures, labels = next(iter(val_loader))print(f"Validation features shape: {features.size()}")
Due to the use of random cropping in the training transformations, the spatial dimensions of training images (e.g., 75×75) might differ from the fixed dimensions of the validation images (128×128).
Build an initial PyTorch dataset and split it into training, validation, and testing subsets.
Implement data versioning for reproducibility.
Define and apply transformation pipelines for data augmentation.
Develop custom PyTorch datasets and DataLoaders.
With these foundational steps, you are well-equipped to proceed with model training using your custom data. For further guidance, consider reviewing resources like PyTorch Documentation and TorchVision Transforms. Happy coding!