4. Custom Datasets

Dataset Requirements for `BaseNetwork` Architectures

In order to train a network using some data, a PyTorch dataset has to be created in a way that is compatible with BaseNetwork.
The main requirement is that the __getitem__ method, which retrieves a single samples from the dataset, returns a tuple of four tensors, the data ID, the low-dimensional tensor (i.e. a label), a high-dimensional tensor (i.e. an image), and any extra information for the sample.
If labels are unknown, such as in supervised learning, a placeholder tensor can be used instead and remain unused in the architecture.
BaseDataset from netloader.data can be used to easily create a BaseNetwork compatible dataset.

`BaseDataset`

BaseDataset is a parent class to allow the easy creation of datasets for training with BaseNetwork.

Attributes:

extra: list[Any] | ndarray | None = None, additional data for each sample in the dataset of length N with shape (N,...) and type Any
idxs: ndarray = np.arange(len(self.high_dim)), index for each sample in the dataset with shape (N) and type int
low_dim: ndarray | Tensor | None = None, low dimensional data for each sample in the dataset with shape (N)
high_dim: ndarray | Tensor | object = UNSET, high dimensional data for each sample in the dataset with shape (N), this is required

Methods of `BaseDataset`

Public Methods:

get_low_dim: Gets a low dimensional sample of the given index
- idx: int, sample index
- return: ndarray | Tensor, Low dimensional sample
get_high_dim: Gets a high dimensional sample of the given index
- idx: int, sample index
- return: ndarray | Tensor, High dimensional sample
get_extra: Gets extra data for the sample of the given index
- idx: int, sample index
- return: ndarray | Tensor, Sample extra data

Magic Methods:

__len__: Length of the dataset
- return: int, number of samples in the dataset
__getitem__: Gets a sample from the dataset at the given index
- idx: int, sample index
- _return_: tuple[int, ndarray | Tensor, ndarray | Tensor, Any], sample index, low dimensional data, high dimensional data, and extra data

`loader_init`

loader_init is a function that initialises data loaders from a subset of the dataset with the given ratios with arguments and return:

dataset: Dataset, dataset to create data loaders from
batch_size: int = 64, batch size when sampling from the data loaders
ratios: tuple[float, ...] | None = (0.8, 0.2), ratios to split up the dataset into sub-datasets
idxs: ndarray | None = None, dataset indexes for creating the subsets, if the length of idxs does not equal the length of the dataset, then the unaccounted for indexes will be assigned to the last subset
**kwargs: optional keyword arguments to pass to DataLoader
return: tuple[DataLoader, ...], data loaders for each subset given by the number of ratios

Example

As an example, a dataset for images as high-dimensional data and class labels for low-dimensional data will be created.

Inherit BaseDataset from netloader.data and create the initialisation method which will load the dataset:

import pickle

from netloader.data import BaseDataset


class CustomDataset(BaseDataset):
    def __init__(self, data_path: str):
        # Load dataset
        with open(data_path, 'rb') as file:
            self.low_dim, self.high_dim = pickle.load(file)

All the __len__ and __getitem__ magic methods are already defined in BaseDataset, as well as the idxs attribute.
However, if you need to perform some more complex sample fetching, such as loading data from a file if the whole dataset can not be loaded into RAM, then you can overwrite get_low_dim, get_high_dim, and/or get_extra:

import os
import pickle
import numpy as np
from torch import Tensor
from numpy import ndarray
from netloader.data import BaseDataset


class CustomDataset(BaseDataset):
    def __init__(self, data_path: str):
        # Files paths to load
        self.high_dim = np.array(os.listdir(data_path))

    def get_high_dim(self, idx: int) -> ndarray | Tensor:
        with open(self.high_dim[idx], 'rb') as file:
            return pickle.load(file)[1]

    def get_low_dim(self, idx: int) -> ndarray | Tensor:
        with open(self.high_dim[idx], 'rb') as file:
            return pickle.load(file)[0]

Finally, to create the training and validation data loaders from the dataset, we can use loader_init from netloader.data to randomly split the dataset into training and validation data loaders.

from netloader.data import loader_init


# Create dataset using the Dataset class we created before
dataset = CustomDataset('/path/to/data')

# Create train and validation data loaders
loaders = loader_init(dataset, ratios=[0.8, 0.2])

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

4. Custom Datasets

Dataset Requirements for `BaseNetwork` Architectures

`BaseDataset`

Attributes:

Methods of `BaseDataset`

Public Methods:

Magic Methods:

`loader_init`

Example

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally

4. Custom Datasets

Dataset Requirements for BaseNetwork Architectures

BaseDataset

Attributes:

Methods of BaseDataset

Public Methods:

Magic Methods:

loader_init

Example

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally

Dataset Requirements for `BaseNetwork` Architectures

`BaseDataset`

Methods of `BaseDataset`

`loader_init`