Datasets, Loaders, and File Sources #
This module is TorchLean's public data layer. The intended workflow is:
- Convert outside-world datasets to canonical
.npytensors or small numeric CSV files. - Describe those files with
TensorSource,SupervisedSource, orLabeledSource. - Load them into shape-typed TorchLean tensors and datasets.
- Train with
batchLoader/BatchLoader.epochor the higher-leveltrain.fit*helpers.
We keep the implementation small and predictable:
- datasets are in-memory and pure (often backed by
List) - loader shuffling is seed-driven and reproducible
.npyis the canonical numeric interchange format- CSV is supported for small tabular data
- MATLAB
.mat, PyTorch.pt/.pth, NumPy.npz, and image folders should be converted to.npywithscripts/datasets/torchlean_data_convert.py - there are no multiprocessing workers, memory maps, or pinned-memory support
PyTorch Mapping #
This is inspired by torch.utils.data:
- Dataset, DataLoader:
https://pytorch.org/docs/stable/data.html - TensorDataset:
https://pytorch.org/docs/stable/data.html#torch.utils.data.TensorDataset - DataLoader:
https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader
TorchLean’s key difference is that samples typically carry type-level shapes (via TList),
so many helpers here are shape-aware by construction.
Main Entry Points #
TensorSource: one file plus expected dimensions.SupervisedSource: two batched tensors,X : (N, xDims...)andY : (N, yDims...).LabeledSource: batched inputs plus integer class labels, one-hot encoded on load.TabularSupervisedSource: one CSV table with input columns followed by target columns.batchLoader: deterministic, typed minibatching.
For examples and conversion commands, see NN/Examples/Data/README.md.
Typed analogue of PyTorch's TensorDataset.
In TorchLean, a "sample" is usually a TList α shapes, i.e. a tuple of tensors whose shapes are
tracked by the type-level list shapes.
Instances For
Build a dataset from an explicit list of samples.
Instances For
Require that all paths exist, otherwise raise a user-facing error with a shared hint.
Instances For
Write a small CSV file, creating the parent directory if needed.
Instances For
Write a one-dimensional prediction probe CSV.
Rows are i,x,input,target,prediction, where x = i/(n-1) for n > 1.
This is intentionally simple and meant for plotting examples such as 1D operator learning.
Instances For
Materialize a dataset as a list.
Instances For
Number of elements in the dataset.
Instances For
Whether the dataset is empty.
Instances For
Like cycleDataset, but fail with a message if the dataset is empty.
This is the preferred helper for “PyTorch-style” fixed-step loops over in-memory datasets.
Instances For
Map a dataset elementwise (pure, deterministic).
Instances For
Concatenate two datasets.
Instances For
Split a dataset into equal-sized minibatches (as lists), dropping the final partial batch.
This is a low-level helper; most users should use DataLoader.epoch or Data.batchedSupervised.
Instances For
Untyped analogue of PyTorch's torch.utils.data.DataLoader.
This is the deterministic, purely-functional loader provided by the TorchLean runtime.
Instances For
Construct a RawDataLoader from a dataset.
If shuffle := true, shuffling is deterministic w.r.t. seed.
If dropLast := true, incomplete final batches are discarded.
Instances For
Run one epoch worth of minibatching and return:
- an updated loader (with the new seed), and
- the list of minibatches.
Instances For
Like epoch, but apply a user-provided collate function to each minibatch.
This is the TorchLean analogue of PyTorch's collate_fn= option.
Instances For
Typed wrapper around RawDataLoader for supervised samples.
The batch size n is reflected in the type, and BatchLoader.epoch returns fully-collated
dim n minibatches (so dropLast=true is required).
- raw : RawDataLoader (sample.Supervised α σ τ)
Raw underlying data.
Instances For
Existential wrapper for loaders when the batch size is chosen at runtime.
Instances For
Note on default arguments:
The underlying CSV loaders take an opts : CsvOptions := {} argument.
If we write abbrev fromCsvRows := readCsvFloatRows, Lean will apply the default argument
and fromCsvRows will no longer accept opts.
So we eta-expand here to keep the public surface configurable.
Read a CSV file as a list of rows of floats.
Instances For
Read a CSV file as (x, y) float pairs.
Instances For
Read a CSV file as length-n float vectors.
Instances For
Read a .npy file into a TorchLean dataset.
Instances For
Read a .npy file as a vector dataset.
Instances For
Read a .npy file as a matrix dataset.
Instances For
Convert a list of (x, y) float tensors into a dataset of TorchLean supervised samples.
This casts float data into the selected scalar backend α and packs it into a TList α [σ, τ].
Instances For
Convert a list of (x, label) pairs into a dataset of one-hot classification samples.
Labels are given as Nat and converted to one-hot targets of shape Vec classes.
Instances For
TensorDataset (dim0 batching) #
PyTorch's TensorDataset concept is: given one or more tensors that share the same size(0),
build a dataset of samples by slicing each tensor along dimension 0.
In TorchLean we do the same thing, but with shapes tracked in the type:
Slice a batched TList along dimension 0.
If a sample is represented as a shape-indexed tuple TList β ss, then a minibatch of size n is
TList β (ss.map (fun s => .dim n s)). This function picks a batch index i : Fin n and returns
the corresponding single sample.
Instances For
Convert a shape-indexed TList of Float tensors to the runtime scalar type α.
Instances For
Build a dataset by slicing a batched TList along dim0.
This is the TorchLean analogue of PyTorch's TensorDataset(t1, t2, ...).
Instances For
Float-to-α variant of tensorDatasetDim0, for data loaded from disk.
Instances For
Supervised dataset from two batched tensors X : (n, σ) and Y : (n, τ) by slicing dim0.
This is the common regression/supervised-learning case: the TorchLean analogue of
TensorDataset(X, Y) in PyTorch.
Instances For
Float-to-α variant of supervisedDim0, for data loaded from disk.
Instances For
Higher-level loaders (PyTorch-style ergonomics) #
These are convenience helpers on top of the low-level CSV/NPY readers so example code can stay "data first" without re-implementing row splitting and casting at every call site.
Load an N-D tensor from a .npy file, checking the on-disk shape matches dims.
Instances For
Load an N-D tensor from a .npy file, allowing the file to contain more rows on dim 0.
This is the dataset-loader analogue of taking tensor[:n] in PyTorch. The rank and trailing
dimensions must still match exactly; only the leading dimension may be larger than requested.
We use this for dataset sources rather than the stricter fromNpyTensorND because a real exported
dataset usually has a fixed full size, while tutorials often ask for a small prefix during smoke
tests or quick CUDA checks. For example, a CIFAR file may have shape (50000, 3, 32, 32) while a
demo command asks for n = 80; the resulting TorchLean tensor has type-level shape
(80, 3, 32, 32).
This is intentionally still a checked loader, not an implicit reshape:
- rank must agree;
- all trailing dimensions must agree;
- the file must contain at least the requested number of rows;
- only C-order NPY files can be prefix-loaded efficiently by the low-level parser.
Instances For
Load an image tensor from a .npy file, checking it has shape (C, H, W).
Instances For
Load a batch of images from a .npy file, checking it has shape (N, C, H, W).
Instances For
Labeled dataset from a batched tensor X : (n, σ) and a label vector y : (n,).
Labels are stored as floats (common when exporting from NumPy); we validate each label is an
integer in [0, classes), then one-hot encode it.
Instances For
Load a supervised dataset from two .npy files containing batched arrays:
and we build a dataset by slicing along dim0.
Instances For
Load a labeled classification dataset from two .npy files:
and we build a dataset by slicing along dim0 and one-hot encoding the labels.
Instances For
Load a supervised dataset from a CSV with inDim + outDim columns per row:
x1, ..., x_inDim, y1, ..., y_outDim.
Instances For
Load a labeled dataset from a CSV with inDim + 1 columns per row:
x1, ..., x_inDim, label where label is in {0, ..., classes-1}.
Instances For
Unified file-source layer #
The lower-level helpers above intentionally stay close to file formats (fromNpyTensorND,
fromCsvRows, fromNpySupervised, ...). The definitions below give examples and applications a
single scheme:
- describe each tensor as a
TensorSource; - load it as a typed TorchLean tensor;
- build supervised/labeled datasets by slicing dim0, just like PyTorch
TensorDataset.
Policy for external ecosystems:
- NumPy
.npyis the canonical interchange format for numeric tensors. - CSV is supported for small tabular data.
- MATLAB
.mat, PyTorch checkpoints, HDF5, Parquet, and image archives should be converted by a small preparation script into.npytensors plus metadata. This keeps the Lean runtime loader small, deterministic, and auditable instead of embedding every external binary format parser.
File formats supported directly by the Lean-side unified data-source loader.
- npy : TensorFormat
NumPy
.npy, supporting the subset decoded byfromNpyTensorND. - csv : TensorFormat
Numeric CSV table. CSV sources are interpreted as 2D tensors
[rows, cols].
Instances For
Human-facing extension used by messages and examples.
Instances For
Description of one tensor stored on disk.
dims is the expected tensor shape. NPY can load any rank supported by tensorND; CSV is treated
as a numeric table and therefore expects dims = [rows, cols].
- path : System.FilePath
Path to the file.
Expected dimensions.
- format : TensorFormat
Direct Lean-side format. External formats should be preconverted to
.npy. - csvOptions : CsvOptions
Instances For
Load a numeric CSV table as a tensor.
Supported shapes:
Instances For
Load a Float tensor from a path/format/dimension tuple.
Instances For
Load a Float tensor, allowing NPY files to contain more rows than requested on dim 0.
TensorSource.loadFloatAs is exact: the file shape must equal dims. This prefix variant is for
dataset-style sources where dims starts with the number of rows requested by the current run. CSV
sources remain exact because CSV has no cheap binary prefix contract; NPY sources use
fromNpyTensorNDPrefixDim0.
Instances For
Load a TensorSource as a Float tensor with the statically reflected shapeOfDims src.dims.
Instances For
Two tensor sources representing supervised data:
- n : ℕ
Number of samples along dim0.
Per-sample input dimensions.
Per-sample target dimensions.
- x : TensorSource
Source for the batched input tensor.
- y : TensorSource
Source for the batched target tensor.
Instances For
Construct a supervised source from paths using the same file format for x and y.
Instances For
Load a supervised dataset by slicing dim0 from the two tensors.
This is the preferred public loader for regression/operator-learning examples, regardless of
whether the backing files are .npy or small numeric CSV tables.
Instances For
Two tensor sources representing labeled classification data:
- n : ℕ
Number of samples along dim0.
Per-sample input dimensions.
- classes : ℕ
Number of classes for one-hot targets.
- x : TensorSource
Source for the batched input tensor.
- y : TensorSource
Source for the label vector.
Instances For
Construct a labeled source from paths using the same file format for x and y.
Instances For
Load a labeled classification dataset by slicing dim0 and one-hot encoding labels.
For CSV label vectors, store labels as a single-column table with dims = [n, 1] and use a custom
TensorSource if needed; the path constructor above is aimed at .npy label vectors.
Instances For
Single-table supervised CSV source.
Use this when one CSV row contains both input and target columns:
x1, ..., x_inDim, y1, ..., y_outDim.
- path : System.FilePath
CSV file path.
- inDim : ℕ
Number of input feature columns.
- outDim : ℕ
Number of target columns.
- csvOptions : CsvOptions
CSV parsing options.
Instances For
Load a single-table supervised CSV source.
Instances For
Build a supervised dataset from two matrices X : n×inDim and Y : n×outDim by pairing rows.
This is the TorchLean analogue of PyTorch's TensorDataset(X, Y) for simple regression.
Instances For
Collate a length-n supervised batch into a single sample with a leading batch axis.
If your samples are (x : σ, y : τ), the collated sample is:
xBatch : (n × σ)andyBatch : (n × τ)
In shapes: TList α [dim n σ, dim n τ].
Instances For
Turn a per-sample supervised dataset into a dataset of fixed-size minibatches.
This is useful for metrics (meanLossDataset, accuracy, etc.) when your model expects a leading
batch axis.
Notes:
- This drops the final partial batch (PyTorch
drop_last=Truebehavior). - Batches are formed in dataset order (shuffling is the loader's job).
Instances For
Extract the underlying per-sample dataset from a typed BatchLoader.
Instances For
The batch size n carried in the type of a BatchLoader.
Instances For
Whether the loader is configured to shuffle samples each epoch.
Instances For
RNG seed used for shuffling (if enabled).
Instances For
Materialize the dataset as a dataset of full minibatches (dropping any final partial batch).
Instances For
Run one epoch: return the updated loader state and a list of typed minibatches.
Instances For
Like epoch, but post-process each minibatch with a user-supplied collate/transform f.
Instances For
Public loader API: supervised datasets become fixed-size minibatch loaders by default.
The underlying dataset still stores individual samples; the loader batches them and epoch
returns tensors with a leading dim0 batch axis. Because the batch size is reflected in the type,
the public batched path requires full batches, so dropLast defaults to true.
Instances For
Build a batch loader when the batch size is only known at runtime.