Dataset and data loader utilities #

This module defines a small in-memory dataset wrapper together with a deterministic (seeded) loader. The shuffling logic is pure and reproducible.

Design boundary:

this is not a replacement for high-throughput framework data loaders;
it is a small, auditable loader for examples, tests, and proof-adjacent training loops;
large or streaming datasets should sit above this layer and feed already-collated tensors into the runtime.

Dataset #

source

structure Runtime.Autograd.Train.Dataset (a : Type) :

Type

A lightweight in-memory dataset wrapper backed by an Array.

data : Array a
Samples stored in insertion order.

Instances For

source

def Runtime.Autograd.Train.Dataset.ofList {a : Type} (xs : List a) :

Dataset a

Construct a dataset from a list.

Instances For

source

def Runtime.Autograd.Train.Dataset.toList {a : Type} (ds : Dataset a) :

List a

Convert a dataset to a list.

Instances For

source

@[simp]

theorem Runtime.Autograd.Train.Dataset.toList_ofList {a : Type} (xs : List a) :

(ofList xs).toList = xs

Round-tripping a list through Dataset preserves order.

source

def Runtime.Autograd.Train.Dataset.size {a : Type} (ds : Dataset a) :

ℕ

Number of samples in the dataset.

Instances For

source

@[simp]

theorem Runtime.Autograd.Train.Dataset.size_ofList {a : Type} (xs : List a) :

(ofList xs).size = xs.length

ofList preserves the list length as dataset size.

source

def Runtime.Autograd.Train.Dataset.isEmpty {a : Type} (ds : Dataset a) :

Bool

Return true iff the dataset has no samples.

Instances For

source

@[simp]

theorem Runtime.Autograd.Train.Dataset.isEmpty_ofList_cons {a : Type} (x : a) (xs : List a) :

(ofList (x :: xs)).isEmpty = false

A dataset built from a nonempty cons list is not empty.

source

def Runtime.Autograd.Train.Dataset.get? {a : Type} (ds : Dataset a) (i : ℕ) :

Option a

Safe indexing into the dataset.

Instances For

source

def Runtime.Autograd.Train.Dataset.map {a b : Type} (f : a → b) (ds : Dataset a) :

Dataset b

Map a function over all samples in the dataset.

Instances For

source

def Runtime.Autograd.Train.Dataset.append {a : Type} (x y : Dataset a) :

Dataset a

Append two datasets.

Instances For

source

def Runtime.Autograd.Train.Dataset.splitAt {a : Type} (n : ℕ) (ds : Dataset a) :

Dataset a × Dataset a

Split a dataset at index n (preserving order).

Instances For

Deterministic shuffle #

We use a small LCG-based shuffle for reproducibility without IO.

source

def Runtime.Autograd.Train.Dataset.lcgNext (s : ℕ) :

ℕ

One step of the simple LCG used to generate a deterministic pseudo-random stream.

Instances For

source

def Runtime.Autograd.Train.Dataset.shufflePairs {a : Type} (seed : ℕ) (xs : List a) :

ℕ × Array (ℕ × a)

Pair each element with a deterministic pseudo-random key.

We then sort by the key to obtain a deterministic permutation.

Instances For

source

def Runtime.Autograd.Train.Dataset.shuffle {a : Type} (seed : ℕ) (ds : Dataset a) :

ℕ × Dataset a

Deterministically shuffle a dataset.

Returns the next seed along with the shuffled dataset, so repeated epochs can thread the seed and remain pure/replayable.

Instances For

Batching #

These helpers return batches as lists for easy use with TapeM utilities.

source

def Runtime.Autograd.Train.Dataset.batches {a : Type} (tag : String) (batchSize : ℕ) (ds : Dataset a) :

Result (List (List a))

Split a dataset into non-empty batches of size at most batchSize.

Errors if batchSize = 0 or the dataset is empty.

Instances For

source

def Runtime.Autograd.Train.Dataset.batchesArray {a : Type} (tag : String) (batchSize : ℕ) (ds : Dataset a) :

Result (List (Array a))

Like batches, but return each batch as an Array.

Instances For

DataLoader #

DataLoader.epoch optionally shuffles and returns a list of batches.

source

structure Runtime.Autograd.Train.DataLoader (a : Type) :

Type

Deterministic data loader configuration.

epoch threads the seed and (optionally) shuffles to produce a list of batches. This is a deliberately small, pure analogue of a PyTorch DataLoader.

dataset : Dataset a
Dataset to batch. The loader keeps this value pure and explicit.
batchSize : ℕ
Batch size.
shuffle : Bool
If true, run the deterministic seeded shuffle before each epoch.
seed : ℕ
Seed threaded through deterministic shuffles.
dropLast : Bool
Drop the final partial batch when it is shorter than batchSize.

Instances For

source

def Runtime.Autograd.Train.DataLoader.epoch {a : Type} (tag : String) (dl : DataLoader a) :

Result (DataLoader a × List (List a))

Run one epoch: optionally shuffle and return the list of batches.

The returned DataLoader has its seed updated so you can call epoch repeatedly to get different but reproducible shuffles.

Instances For

source

def Runtime.Autograd.Train.DataLoader.epochArray {a : Type} (tag : String) (dl : DataLoader a) :

Result (DataLoader a × List (Array a))

Like epoch, but return each batch as an Array.

Instances For

source

def Runtime.Autograd.Train.DataLoader.epochCollate {a b : Type} (tag : String) (dl : DataLoader a) (collate : List a → Result b) :

Result (DataLoader a × List b)

Run one epoch and collate each batch into a single value.

This is the main building block for minibatch training where your model expects tensors with a leading batch axis, but your dataset is stored as individual samples.

Instances For

TorchLean API

NN.Runtime.Autograd.Train.Dataset

Dataset and data loader utilities #

Dataset #

Deterministic shuffle #

Batching #

DataLoader #