Dataset and data loader utilities #
This module defines an in-memory dataset wrapper together with a deterministic (seeded) loader. The shuffling logic is pure and reproducible.
Design boundary:
- this layer owns deterministic in-memory sampling and batching;
- streaming datasets and framework-scale input pipelines can sit above it; and
- once tensors are collated, the same training loops consume them through the runtime loader API.
Dataset #
Construct a dataset from a list.
Instances For
Convert a dataset to a list.
Instances For
Number of samples in the dataset.
Instances For
Return true iff the dataset has no samples.
Instances For
Map a function over all samples in the dataset.
Instances For
Concatenate two datasets while preserving sample order.
Instances For
Deterministic shuffle #
We use a small LCG-based shuffle for reproducibility without IO.
One step of the simple LCG used to generate a deterministic pseudo-random stream.
Instances For
Batching #
These helpers return batches as lists for easy use with TapeM utilities.
DataLoader #
DataLoader.epoch optionally shuffles and returns a list of batches.
Deterministic data loader configuration.
epoch threads the seed and (optionally) shuffles to produce a list of batches. This is a
pure, local analogue of a PyTorch DataLoader.
- dataset : Dataset a
Dataset to batch. The loader keeps this value pure and explicit.
- batchSize : ℕ
Batch size.
- shuffle : Bool
If true, run the deterministic seeded shuffle before each epoch.
- seed : ℕ
Seed threaded through deterministic shuffles.
- dropLast : Bool
Drop the final partial batch when it is shorter than
batchSize.
Instances For
Run one epoch: optionally shuffle and return the list of batches.
The returned DataLoader has its seed updated so you can call epoch repeatedly to get different
but reproducible shuffles.
Instances For
Like epoch, but return each batch as an Array.
Instances For
Run one epoch and collate each batch into a single value.
This is the main building block for minibatch training where your model expects tensors with a leading batch axis, but your dataset is stored as individual samples.