Datasets, Loaders, and File Sources #
This module is TorchLean's public data layer. The intended workflow is:
- Convert outside-world datasets to canonical
.npytensors or small numeric CSV files. - Describe those files with
TensorSource,SupervisedSource, orLabeledSource. - Load them into shape-typed TorchLean tensors and datasets.
- Train with
batchLoader/BatchLoader.epoch, the publictrainer.trainpath, orTorchLean.Trainer.trainDatasetwhen an advanced runner loop is still the right tool.
We keep the implementation small and predictable:
- datasets are in-memory and pure (often backed by
List) - loader shuffling is seed-driven and reproducible
.npyis the canonical numeric interchange format- CSV is supported for small tabular data
- MATLAB
.mat, PyTorch.pt/.pth, NumPy.npz, and image folders should be converted to.npywithscripts/datasets/torchlean_data_convert.py - there are no multiprocessing workers, memory maps, or pinned-memory support
PyTorch Mapping #
This is inspired by torch.utils.data:
- Dataset, DataLoader:
https://pytorch.org/docs/stable/data.html - TensorDataset:
https://pytorch.org/docs/stable/data.html#torch.utils.data.TensorDataset - DataLoader:
https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader
TorchLean’s key difference is that samples typically carry type-level shapes (via TensorPack),
so many helpers here are shape-aware by construction.
Main Entry Points #
TensorSource: one file plus expected dimensions.SupervisedSource: two batched tensors,X : (N, xDims...)andY : (N, yDims...).LabeledSource: batched inputs plus integer class labels, one-hot encoded on load.TabularSupervisedSource: one CSV table with input columns followed by target columns.batchLoader: deterministic, typed minibatching.
For examples and conversion commands, see NN/Examples/Data/README.md.
Typed analogue of PyTorch's TensorDataset.
In TorchLean, a sample is usually a TensorPack α shapes, i.e. a shape-tracked tuple of tensors.
Instances For
Build a dataset from an explicit list of samples.
Instances For
Require that all paths exist, otherwise raise a user-facing error with a shared hint.
Instances For
Require one named data file to exist.
Instances For
Require paired supervised input/target files to exist.
Instances For
Write a small CSV file, creating the parent directory if needed.
Instances For
Write a one-dimensional prediction probe CSV.
Rows are i,x,input,target,prediction, where x = i/(n-1) for n > 1.
This writes the compact prediction table used by plotting examples such as 1D operator learning.
Instances For
Materialize a dataset as a list.
Instances For
Number of elements in the dataset.
Instances For
Whether the dataset is empty.
Instances For
Build a cycling index function for a nonempty dataset.
cycleDataset ds h i returns ds[i % ds.size].
This is the dataset analogue of cycleList. It avoids per-step Option handling in fixed-step
training loops.
Instances For
Like cycleDataset, but fail with a message if the dataset is empty.
This is the preferred helper for “PyTorch-style” fixed-step loops over in-memory datasets.
Instances For
Map a dataset elementwise (pure, deterministic).
Instances For
Split a dataset into equal-sized minibatches (as lists), dropping the final partial batch.
This is a low-level helper; ordinary loader code should use DataLoader.epoch or
Data.batchedSupervised.
Instances For
Untyped analogue of PyTorch's torch.utils.data.DataLoader.
This is the deterministic, purely-functional loader provided by the TorchLean runtime.
Instances For
Construct a RawDataLoader from a dataset.
If shuffle := true, shuffling is deterministic w.r.t. seed.
If dropLast := true, incomplete final batches are discarded.
Instances For
Run one epoch worth of minibatching and return:
- an updated loader (with the new seed), and
- the list of minibatches.
Instances For
Like epoch, but apply a user-provided collate function to each minibatch.
This is the TorchLean analogue of PyTorch's collate_fn= option.
Instances For
Typed wrapper around RawDataLoader for supervised samples.
The batch size n is reflected in the type, and BatchLoader.epoch returns fully-collated
dim n minibatches (so dropLast=true is required).
- raw : RawDataLoader (SupervisedSample α σ τ)
Raw underlying data.
Instances For
Existential wrapper for loaders when the batch size is chosen at runtime.
Instances For
Note on default arguments:
The underlying CSV loaders take an opts : CsvOptions := {} argument.
If we write abbrev fromCsvRows := readCsvFloatRows, Lean will apply the default argument
and fromCsvRows will no longer accept opts.
So we eta-expand here to keep the public surface configurable.
Read a CSV file as a list of rows of floats.
Instances For
Read a CSV file as (x, y) float pairs.
Instances For
Read a CSV file as length-n float vectors.
Instances For
Read a .npy file into a TorchLean dataset.
Instances For
Read a .npy file as a vector dataset.
Instances For
Read a .npy file as a matrix dataset.
Instances For
Read the row count from an .npy file and check its trailing shape.
For a batched tensor with shape (N, d₁, ..., dₖ), this returns N when the trailing dimensions
match tailShape.
Instances For
Convert a list of (x, y) float tensors into a dataset of TorchLean supervised samples.
This casts float data into the selected scalar backend α and packs it into a
TensorPack α [σ, τ].
Instances For
Convert a list of (x, label) pairs into a dataset of one-hot classification samples.
Labels are given as Nat and converted to one-hot targets of shape Vec classes.
Instances For
TensorDataset (dim0 batching) #
PyTorch's TensorDataset concept is: given one or more tensors that share the same size(0),
build a dataset of samples by slicing each tensor along dimension 0.
In TorchLean we do the same thing, but with shapes tracked in the type:
Slice a batched TensorPack along dimension 0.
If a sample is represented as a shape-indexed tuple TensorPack β ss, then a minibatch of size n
is TensorPack β (ss.map (fun s => .dim n s)). This function picks a batch index i : Fin n and returns
the corresponding single sample.
Instances For
Convert a shape-indexed TensorPack of Float tensors to the runtime scalar type α.
Instances For
Build a dataset by slicing a batched TensorPack along dim0.
This is the TorchLean analogue of PyTorch's TensorDataset(t1, t2, ...).
Instances For
Float-to-α variant of tensorDatasetDim0, for data loaded from disk.
Instances For
Supervised dataset from two batched tensors X : (n, σ) and Y : (n, τ) by slicing dim0.
This is the common regression/supervised-learning case: the TorchLean analogue of
TensorDataset(X, Y) in PyTorch.
Instances For
Float-to-α variant of supervisedDim0, for data loaded from disk.
Instances For
Higher-level loaders (PyTorch-style ergonomics) #
These are convenience helpers on top of the low-level CSV/NPY readers so example code can stay "data first" without re-implementing row splitting and casting at every call site.
Load an N-D tensor from a .npy file, checking the on-disk shape matches dims.
Instances For
Load an N-D tensor from a .npy file, allowing the file to contain more rows on dim 0.
This is the dataset-loader analogue of taking tensor[:n] in PyTorch. The rank and trailing
dimensions must still match exactly; only the leading dimension may be larger than requested.
We use this for dataset sources rather than the stricter fromNpyTensorND because an exported
dataset usually has a fixed full size, while local runs often request a bounded prefix. For example,
a CIFAR file may have shape (50000, 3, 32, 32) while an example command asks for n = 80; the
resulting TorchLean tensor has type-level shape
(80, 3, 32, 32).
This is still a checked loader, not an implicit reshape:
- rank must agree;
- all trailing dimensions must agree;
- the file must contain at least the requested number of rows;
- only C-order NPY files can be prefix-loaded efficiently by the low-level parser.
Instances For
Load an image tensor from a .npy file, checking it has shape (C, H, W).
Instances For
Load a batch of images from a .npy file, checking it has shape (N, C, H, W).
Instances For
Labeled dataset from a batched tensor X : (n, σ) and a label vector y : (n,).
Labels are stored as floats (common when exporting from NumPy); we validate each label is an
integer in [0, classes), then one-hot encode it.
Instances For
Load a supervised dataset from two .npy files containing batched arrays:
and we build a dataset by slicing along dim0.
Instances For
Load a labeled classification dataset from two .npy files:
and we build a dataset by slicing along dim0 and one-hot encoding the labels.
Instances For
Load a supervised dataset from a CSV with inDim + outDim columns per row:
x1, ..., x_inDim, y1, ..., y_outDim.
Instances For
Load a labeled dataset from a CSV with inDim + 1 columns per row:
x1, ..., x_inDim, label where label is in {0, ..., classes-1}.
Instances For
Unified file-source layer #
The lower-level helpers above stay close to file formats (fromNpyTensorND,
fromCsvRows, fromNpySupervised, ...). The definitions below give examples and applications a
single scheme:
- describe each tensor as a
TensorSource; - load it as a typed TorchLean tensor;
- build supervised/labeled datasets by slicing dim0, just like PyTorch
TensorDataset.
Policy for external ecosystems:
- NumPy
.npyis the canonical interchange format for numeric tensors. - CSV is supported for small tabular data.
- MATLAB
.mat, PyTorch checkpoints, HDF5, Parquet, and image archives should be converted by a small preparation script into.npytensors plus metadata. The Lean runtime loader intentionally handles a small deterministic interchange format rather than every external binary format.
File formats supported directly by the Lean-side unified data-source loader.
- npy : TensorFormat
NumPy
.npy, supporting the subset decoded byfromNpyTensorND. - csv : TensorFormat
Numeric CSV table. CSV sources are interpreted as 2D tensors
[rows, cols].
Instances For
Human-facing extension used by messages and examples.
Instances For
Description of one tensor stored on disk.
dims is the expected tensor shape. NPY can load any rank supported by tensorND; CSV is treated
as a numeric table and therefore expects dims = [rows, cols].
- path : System.FilePath
Path to the file.
Expected dimensions.
- format : TensorFormat
Direct Lean-side format. External formats should be preconverted to
.npy. - csvOptions : CsvOptions
Instances For
Load a numeric CSV table as a tensor.
Supported shapes:
Instances For
Load a Float tensor from a path/format/dimension tuple.
Instances For
Load a Float tensor, allowing NPY files to contain more rows than requested on dim 0.
TensorSource.loadFloatAs is exact: the file shape must equal dims. This prefix variant is for
dataset-style sources where dims starts with the number of rows requested by the current run. CSV
sources remain exact because CSV has no binary prefix contract; NPY sources use
fromNpyTensorNDPrefixDim0.
Instances For
Load a TensorSource as a Float tensor with the statically reflected shapeOfDims src.dims.
Instances For
Two tensor sources representing supervised data:
- n : ℕ
Number of samples along dim0.
Per-sample input dimensions.
Per-sample target dimensions.
- x : TensorSource
Source for the batched input tensor.
- y : TensorSource
Source for the batched target tensor.
Instances For
Construct a supervised source from paths using the same file format for x and y.
Instances For
Load a supervised dataset by slicing dim0 from the two tensors.
This is the preferred public loader for regression/operator-learning examples, regardless of
whether the backing files are .npy or small numeric CSV tables.
Instances For
Paired .npy source for supervised regression or operator-learning datasets.
Instances For
Load paired .npy files as concrete Float supervised samples.
This is useful for reporting, custom evaluation loops, and native kernels that need concrete
Float tensors outside the public trainer facade.
Instances For
Two tensor sources representing labeled classification data:
- n : ℕ
Number of samples along dim0.
Per-sample input dimensions.
- classes : ℕ
Number of classes for one-hot targets.
- x : TensorSource
Source for the batched input tensor.
- y : TensorSource
Source for the label vector.
Instances For
Construct a labeled source from paths using the same file format for x and y.
Instances For
Load a labeled classification dataset by slicing dim0 and one-hot encoding labels.
For CSV label vectors, store labels as a single-column table with dims = [n, 1] and use a custom
TensorSource if needed; the path constructor above is aimed at .npy label vectors.
Instances For
Single-table supervised CSV source.
Use this when one CSV row contains both input and target columns:
x1, ..., x_inDim, y1, ..., y_outDim.
- path : System.FilePath
CSV file path.
- inDim : ℕ
Number of input feature columns.
- outDim : ℕ
Number of target columns.
- csvOptions : CsvOptions
CSV parsing options.
Instances For
Load a single-table supervised CSV source.
Instances For
Build a supervised dataset from two matrices X : n×inDim and Y : n×outDim by pairing rows.
This is the TorchLean analogue of PyTorch's TensorDataset(X, Y) for simple regression.
Instances For
Collate a length-n supervised batch into a single sample with a leading batch axis.
If your samples are (x : σ, y : τ), the collated sample is:
xBatch : (n × σ)andyBatch : (n × τ)
In shapes: TensorPack α [dim n σ, dim n τ].
Instances For
Turn a per-sample supervised dataset into a dataset of fixed-size minibatches.
This is useful for metrics (meanLossDataset, accuracy, etc.) when your model expects a leading
batch axis.
Notes:
- This drops the final partial batch (PyTorch
drop_last=Truebehavior). - Batches are formed in dataset order (shuffling is the loader's job).
Instances For
Extract the underlying per-sample dataset from a typed BatchLoader.
Instances For
The batch size n carried in the type of a BatchLoader.
Instances For
Whether the loader is configured to shuffle samples each epoch.
Instances For
RNG seed used for shuffling (if enabled).
Instances For
Materialize the dataset as a dataset of full minibatches (dropping any final partial batch).
Instances For
Run one epoch: return the updated loader state and a list of typed minibatches.
Instances For
Like epoch, but post-process each minibatch with a user-supplied collate/transform f.
Instances For
Run one epoch and require at least one full typed minibatch.
This is the shared checked boundary for examples that need a nonempty list of full batches. It keeps the "drop partial batches, but fail if nothing remains" policy with the loader API rather than repeating it in each dataset-specific helper.
Instances For
Run one epoch and return its first full typed minibatch.
Instances For
Public loader API: supervised datasets become fixed-size minibatch loaders by default.
The underlying dataset still stores individual samples; the loader batches them and epoch
returns tensors with a leading dim0 batch axis. Because the batch size is reflected in the type,
the public batched path requires full batches, so dropLast defaults to true.
Instances For
Load a numeric supervised CSV and immediately wrap it as a typed minibatch loader.
The CSV convention is the same as TabularSupervisedSource: each row contains inDim feature
columns followed by outDim target columns. This belongs in the data API rather than in an
individual model file because tabular examples, benchmarks, and downstream users all need the same
operation: CSV -> typed dataset -> shuffled minibatch loader.
Instances For
Build a batch loader when the batch size is only known at runtime.