Dataset and data loader utilities #
This module defines a small in-memory dataset wrapper together with a deterministic (seeded) loader. The shuffling logic is pure and reproducible.
Design boundary:
- this is not a replacement for high-throughput framework data loaders;
- it is a small, auditable loader for examples, tests, and proof-adjacent training loops;
- large or streaming datasets should sit above this layer and feed already-collated tensors into the runtime.
Dataset #
Construct a dataset from a list.
Instances For
Convert a dataset to a list.
Instances For
Number of samples in the dataset.
Instances For
Return true iff the dataset has no samples.
Instances For
Map a function over all samples in the dataset.
Instances For
Append two datasets.
Instances For
Deterministic shuffle #
We use a small LCG-based shuffle for reproducibility without IO.
One step of the simple LCG used to generate a deterministic pseudo-random stream.
Instances For
Batching #
These helpers return batches as lists for easy use with TapeM utilities.
DataLoader #
DataLoader.epoch optionally shuffles and returns a list of batches.
Deterministic data loader configuration.
epoch threads the seed and (optionally) shuffles to produce a list of batches. This is a
deliberately small, pure analogue of a PyTorch DataLoader.
- dataset : Dataset a
Dataset to batch. The loader keeps this value pure and explicit.
- batchSize : ℕ
Batch size.
- shuffle : Bool
If true, run the deterministic seeded shuffle before each epoch.
- seed : ℕ
Seed threaded through deterministic shuffles.
- dropLast : Bool
Drop the final partial batch when it is shorter than
batchSize.
Instances For
Run one epoch: optionally shuffle and return the list of batches.
The returned DataLoader has its seed updated so you can call epoch repeatedly to get different
but reproducible shuffles.
Instances For
Like epoch, but return each batch as an Array.
Instances For
Run one epoch and collate each batch into a single value.
This is the main building block for minibatch training where your model expects tensors with a leading batch axis, but your dataset is stored as individual samples.