Toys: The data science toolbox

Toys is a toolbox for data science, built with PyTorch, and designed for rapid research.

Documentation

Datasets

The Dataset protocol is borrowed from PyTorch and is the boundary between the preprocess and the model. The protocol is quite easy to implement. A dataset need only have methods __len__() and __getitem__() with integer indexing. Most simple collections can be used as datasets, including list and ndarray.

We use the following vocabulary when discussing datasets:

Row:The value at dataset[i] is called the ith row of the dataset. Each row must be a sequence of arrays and/or scalars, and each array may be of different shape.
Column:The positions in a row are called the columns. The jth column of the dataset is the sequence of the jth column of every row.
Supervised:A supervised dataset has at least two columns where the last column is designated as the target column, and the rest as feature columns. In unsupervised datasets, all are considered feature columns.
Feature:The data in any one feature column of a row is called a feature of that row.
Target:Likewise, the data in the target column of a row is called the target of that row.
Instance:The features of a row are collectively called an instance.
Shape:The shape of a row or instance is the sequence of shapes of its columns. The shape of a dataset is the shape of its rows. Note that the shape of a dataset does not include its length.

For example, the CIFAR10 dataset is a supervised dataset with two columns. The feature column contains 32x32 pixel RGB images, and the target column contains integer class labels. The shape of the feature is (32, 32, 3), and the shape if the target is () (i.e. the target is a scalar). The shape of the CIFAR10 dataset is thus ((32,32,3), ()).

Note

Unlike arrays, columns need not have the same shape across all rows. In fact, the same column may have a different number of dimensions in different rows, and the rows may even have different number of columns all together. While most estimators expect some consistency, this freedom allows us to efficiently represent, e.g., variable sequence lengths. A dataset shape (as opposed to a row or instance shape) may use None to represent a variable aspect of its shape.

Batching and iteration

The function toys.batches() iterates over mini-batches of a dataset by delegating to PyTorch’s DataLoader class. The batches() function forwards all of its arguments to the DataLoader constructor, but it allows the dataset to recommend default values through the Dataset.hints attribute. This allows the dataset to, e.g. specify an appropriate collate function or sampling strategy.

The most common arguments are:

batch_size:The maximum number of rows per batch.
shuffle:A boolean set to true to sample batches at random without replacement.
collate_fn:A function to merge a list of samples into a mini-batch. This is required if the shape of the dataset is variable, e.g. to pad or pack a sequence length.
pin_memory:If true, batches are loaded into CUDA pinned memory. Unlike vanilla PyTorch, this defaults to true whenever CUDA is available.

Note

Most estimators will require an explicit batch_size argument when it can effect model performance. Thus the batch_size hint provided by the dataset is more influential to scoring functions than to estimators. Therefore the hinted value should be for scoring purposes and can be quite large.

See also

See torch.utils.data.DataLoader for a full description of all possible arguments.

Todo

Add examples

Creating and combining datasets

The primary functions for combining datasets are toys.concat() and toys.zip() which concatenate datasets by rows and columns respectively.

Of these, toys.zip() is the more commonly used. It allows us to, e.g., combine the features and target from separate datasets:

>>> features = np.random.random(size=(100, 1, 5))  # 100 rows, 1 column of shape (5,)
>>> target = np.prod(features, axis=-1)            # 100 rows, 1 scalar column
>>> dataset = toys.zip(features, target)           # 100 rows, 2 columns
>>> toys.shape(features)
((5,),)
>>> toys.shape(target)
((),)
>>> toys.shape(dataset)
((5,), ())

Most estimators will automatically zip datasets if you pass more than one:

>>> from toys.supervised import LeastSquares
>>> estimator = LeastSquares()
>>> model = estimator(dataset)           # Each of these calls
>>> model = estimator(features, target)  # is equivalent to the other

Style Guide

All Python code should follow the Google Python Style Guide with the following exceptions and additions.

Doc strings

Doc strings should use triple single-quotes (''').

All values in return, yield, attributes, arguments, and keyword arguments sections must include both names and type annotations. (See the following section on type annotations).

The description for arguments and return values start on the line following their name and type annotation. This is more visually appealing when long type annotations are used, and so we require it globally for consistency.

E.g.:

def torch_dtype(dtype):
    '''Casts dtype to a PyTorch tensor class.

    The input may be a conventional name, like 'float' and 'double', or an
    explicit name like 'float32' or 'float64'. If the input is a known
    tensor class, it is returned as-is.

    Args:
        dtype (str or TorchDtype):
            A conventional name, explicit name, or known tensor class.

    Returns:
        cls (TorchDtype):
            The tensor class corresponding to `dtype`.
    '''
    ...

Type Annotations

Type hints are useful for both documentation and static analysis tooling but can be very distracting syntactically. As a compromise, always include PEP 484 compliant type hints in docstrings for arguments, return, and yield values. Don’t include type annotations in code.

The following sugar is allowed, given in order of precedence:

  • Union[A, B] may be written as A or B.
  • Callable[A, B] may be written as A -> B.

Note that Optional[T] is equivalent to Union[T, None]. The preferred notation for optional types is T or None.

When types become complex, create an alias, e.g.:

CrossValSplitter = Callable[[Dataset], Iterable[Tuple[Dataset, Dataset]]]

Imports

Use relative imports for anything under the current package, and use absolute imports for everything else. This allows package to be moved without modifying their contents, in the common case (other cases are a bad smell).

Import classes directly, using from pkg import MyClass.

Group imports by dependency, and separate each group by a single blank line. The first import should be the top-level package of the dependency. Sort groups by dependency name except when conflicting with the following.

Reserve the first group for the Python standard library.

Reserve the second group for the SciPy stack, e.g. numpy, scipy, matplotlib, and pandas. Other general purpose data handling tools may be included in this section, like dask and xarray. Use simple import statements in this group. If you find yourself writing many import from the same package, use a dedicated group instead.

Place relative imports last in their own group.

Within each group, sort bare import ... statements before from ... import ... statements. Otherwise sort imports lexicographically.

Always import the top-level package for each dependency. Import all objects used in docstrings, and use objects in docstrings as imported. Otherwise avoid dead imports.

E.g.:

from typing import Any, Mapping, Sequence

import numpy as np
import scipy
import matplotlib as mpl
import matplotlib.pyplot as plt
import pandas as pd

import torch
from torch.nn import DataParallel, Module
from torch.optim import Optimizer

import toys
from toys import Dataset
from toys.metrics import Mean

from .cross_val import KFold

Code layout

Code is divided into packages (folders) and modules (*.py files). By default, all code in modules is considered private. Public objects should be reexported by the package’s __init__.py file. Other than comments and a package-level docstring, each __init__.py file should only contain relative import statements for the public objects in submodules of the package.

Do not use __all__. The rules above serve the same purpose.

toys

Core protocols

The core protocols are pure abstract classes. They provide no functionality and are for documentation purpose only. There is no requirement to subclass them; however doing so provides certain runtime protections through Python’s abstract base class (abc) functionality.

Dataset
Estimator
Model

Common classes

BaseEstimator
TorchModel

Dataset utilities

toys.batches
toys.zip
toys.concat
toys.flatten
toys.subset
toys.shape

Argument parsers

parse_activation
parse_initializer
parse_optimizer
parse_loss
parse_dtype
parse_metric

Type aliases

These type aliases exist to aid in documentation and static analysis. They are irrelevant at runtime.

class toys.ColumnShape = Optional[Tuple[Optional[int], ...]]

The shape of a single datum in a column. None is used for dimensions of variable length, and when the total number of dimensions is variable.

Note that the shape of a column does not include the index dimension.

class toys.RowShape = Optional[Tuple[ColumnShape, ...]]

The shape of a row is the sequence of (possibly variable) shapes of its columns. The dataset shape may be None to indicate that the number of columns is variable.

For example, the CIFAR10 dataset has two columns. The first contains 32x32 RGB images; it’s shape is (32, 32, 3). The second contains scalar class labels; it’s shape is (). The shape of the whole row is thus ((32, 32, 3), ()).

>>> from toys.datasets import CIFAR10
>>> cifar = CIFAR10()
>>> toys.shape(cifar)
((32, 32, 3), ())

toys.datasets

CIFAR

See https://www.cs.toronto.edu/~kriz/cifar.html.

CIFAR10
CIFAR20
CIFAR100

Simulated datasets

SimulatedLinear
SimulatedPolynomial

toys.layers

Linear layers

Dense

Convolution layers

Conv2d
MaxPool2d

toys.metrics

Core classes

Metric
Accumulator
Mean
Sum
MultiMetric

Classification metrics

Accuracy
TruePositives
FalsePositives
TrueNegatives
FalseNegatives
Precision
Recall
FScore

Regression metrics

MeanSquaredError
NegMeanSquaredError

toys.model_selection

Functions

combinations

Cross validation splitting

KFold

Type aliases

These type aliases exist to aid in documentation and static analysis. They are irrelevant at runtime.

class CrossValSplitter = Callable[[Dataset], Iterable[Fold]]

A function that takes a dataset and returns an iterable over some Fold s of the dataset. These can be used by meta-estimators, like GridSearchCV, to test how estimators generalize to unseen data.

class Fold = Tuple[Dataset, Dataset]

A fold is the partitioning of a dataset into two disjoint subsets, (train, test).

class ParamGrid = Mapping[str, Sequence]

toys.networks

VGG

VGG11
VGG13
VGG16
VGG19

toys.supervised

GradientDescent
LeastSquares

Contributing

All are welcome to contribute. But because the project is so young, coordination is key. Please reach out on the issue tracker, or in person if you are around UGA, if you are interested in contributing.

The contributing file contains style guides and other useful guidelines for contributing to the project.