Skip to content

4. Custom Datasets

EthanTreg edited this page Jun 12, 2025 · 1 revision

Dataset Requirements for BaseNetwork Architectures

In order to train a network using some data, a PyTorch dataset has to be created in a way that is compatible with BaseNetwork.
The main requirement is that the __getitem__ method, which retrieves a single samples from the dataset, returns a tuple of four tensors, the data ID, the low-dimensional tensor (i.e. a label), a high-dimensional tensor (i.e. an image), and any extra information for the sample.
If labels are unknown, such as in supervised learning, a placeholder tensor can be used instead and remain unused in the architecture.
BaseDataset from netloader.data can be used to easily create a BaseNetwork compatible dataset.

BaseDataset

BaseDataset is a parent class to allow the easy creation of datasets for training with BaseNetwork.

Attributes:

  • extra: list[Any] | ndarray | None = None, additional data for each sample in the dataset of length N with shape (N,...) and type Any
  • idxs: ndarray = np.arange(len(self.high_dim)), index for each sample in the dataset with shape (N) and type int
  • low_dim: ndarray | Tensor | None = None, low dimensional data for each sample in the dataset with shape (N)
  • high_dim: ndarray | Tensor | object = UNSET, high dimensional data for each sample in the dataset with shape (N), this is required

Methods of BaseDataset

Public Methods:

  • get_low_dim: Gets a low dimensional sample of the given index
    • idx: int, sample index
    • return: ndarray | Tensor, Low dimensional sample
  • get_high_dim: Gets a high dimensional sample of the given index
    • idx: int, sample index
    • return: ndarray | Tensor, High dimensional sample
  • get_extra: Gets extra data for the sample of the given index
    • idx: int, sample index
    • return: ndarray | Tensor, Sample extra data

Magic Methods:

  • __len__: Length of the dataset
    • return: int, number of samples in the dataset
  • __getitem__: Gets a sample from the dataset at the given index
    • idx: int, sample index
    • _return_: tuple[int, ndarray | Tensor, ndarray | Tensor, Any], sample index, low dimensional data, high dimensional data, and extra data

loader_init

loader_init is a function that initialises data loaders from a subset of the dataset with the given ratios with arguments and return:

  • dataset: Dataset, dataset to create data loaders from
  • batch_size: int = 64, batch size when sampling from the data loaders
  • ratios: tuple[float, ...] | None = (0.8, 0.2), ratios to split up the dataset into sub-datasets
  • idxs: ndarray | None = None, dataset indexes for creating the subsets, if the length of idxs does not equal the length of the dataset, then the unaccounted for indexes will be assigned to the last subset
  • **kwargs: optional keyword arguments to pass to DataLoader
  • return: tuple[DataLoader, ...], data loaders for each subset given by the number of ratios

Example

As an example, a dataset for images as high-dimensional data and class labels for low-dimensional data will be created.

Inherit BaseDataset from netloader.data and create the initialisation method which will load the dataset:

import pickle

from netloader.data import BaseDataset


class CustomDataset(BaseDataset):
    def __init__(self, data_path: str):
        # Load dataset
        with open(data_path, 'rb') as file:
            self.low_dim, self.high_dim = pickle.load(file)

All the __len__ and __getitem__ magic methods are already defined in BaseDataset, as well as the idxs attribute.
However, if you need to perform some more complex sample fetching, such as loading data from a file if the whole dataset can not be loaded into RAM, then you can overwrite get_low_dim, get_high_dim, and/or get_extra:

import os
import pickle
import numpy as np
from torch import Tensor
from numpy import ndarray
from netloader.data import BaseDataset


class CustomDataset(BaseDataset):
    def __init__(self, data_path: str):
        # Files paths to load
        self.high_dim = np.array(os.listdir(data_path))

    def get_high_dim(self, idx: int) -> ndarray | Tensor:
        with open(self.high_dim[idx], 'rb') as file:
            return pickle.load(file)[1]

    def get_low_dim(self, idx: int) -> ndarray | Tensor:
        with open(self.high_dim[idx], 'rb') as file:
            return pickle.load(file)[0]

Finally, to create the training and validation data loaders from the dataset, we can use loader_init from netloader.data to randomly split the dataset into training and validation data loaders.

from netloader.data import loader_init


# Create dataset using the Dataset class we created before
dataset = CustomDataset('/path/to/data')

# Create train and validation data loaders
loaders = loader_init(dataset, ratios=[0.8, 0.2])

Clone this wiki locally