TorchLean API

NN.API.Data

Datasets, Loaders, and File Sources #

This module is TorchLean's public data layer. The intended workflow is:

  1. Convert outside-world datasets to canonical .npy tensors or small numeric CSV files.
  2. Describe those files with TensorSource, SupervisedSource, or LabeledSource.
  3. Load them into shape-typed TorchLean tensors and datasets.
  4. Train with batchLoader / BatchLoader.epoch or the higher-level train.fit* helpers.

We keep the implementation small and predictable:

PyTorch Mapping #

This is inspired by torch.utils.data:

TorchLean’s key difference is that samples typically carry type-level shapes (via TList), so many helpers here are shape-aware by construction.

Main Entry Points #

For examples and conversion commands, see NN/Examples/Data/README.md.

@[reducible, inline]

Typed analogue of PyTorch's TensorDataset.

In TorchLean, a "sample" is usually a TList α shapes, i.e. a tuple of tensors whose shapes are tracked by the type-level list shapes.

Instances For
    def NN.API.Data.fromList {a : Type} (xs : List a) :

    Build a dataset from an explicit list of samples.

    Instances For
      def NN.API.Data.requireFiles (exeName : String) (paths : List System.FilePath) (hint : String := "") :

      Require that all paths exist, otherwise raise a user-facing error with a shared hint.

      Instances For
        def NN.API.Data.writeCsv (path : System.FilePath) (header : List String) (rows : List (List String)) :

        Write a small CSV file, creating the parent directory if needed.

        Instances For
          def NN.API.Data.writePredictionCsv1D {n : } (path : System.FilePath) (input target prediction : Spec.Tensor Float (Tensor.Shape.Vec n)) :

          Write a one-dimensional prediction probe CSV.

          Rows are i,x,input,target,prediction, where x = i/(n-1) for n > 1. This is intentionally simple and meant for plotting examples such as 1D operator learning.

          Instances For
            def NN.API.Data.toList {a : Type} (ds : Dataset a) :

            Materialize a dataset as a list.

            Instances For
              @[simp]
              theorem NN.API.Data.toList_fromList {a : Type} (xs : List a) :
              toList (fromList xs) = xs

              Converting a list to a dataset and back yields the original list.

              def NN.API.Data.size {a : Type} (ds : Dataset a) :

              Number of elements in the dataset.

              Instances For
                @[simp]
                theorem NN.API.Data.size_fromList {a : Type} (xs : List a) :

                The size of a dataset built from a list is the list length.

                def NN.API.Data.isEmpty {a : Type} (ds : Dataset a) :

                Whether the dataset is empty.

                Instances For
                  def NN.API.Data.cycleList {a : Type} (xs : List a) (h : xs []) :
                  a

                  Build a cycling index function for a nonempty list.

                  cycleList xs h i returns xs[i % xs.length].

                  This is useful in small in-memory demos where you want a fixed-step “PyTorch-like” loop without repeated Option handling.

                  Instances For
                    def NN.API.Data.cycleListOrError {a : Type} (xs : List a) (err : String := "empty list") :
                    Except String (a)

                    Like cycleList, but fail with a message if the list is empty.

                    This is designed to keep tutorial code tidy: check emptiness once, then index without Option.

                    Instances For
                      def NN.API.Data.cycleDataset {a : Type} (ds : Dataset a) (h : ds.data.size 0) :
                      a

                      Build a cycling index function for a nonempty dataset.

                      cycleDataset ds h i returns ds[i % ds.size].

                      This is the dataset analogue of cycleList. It avoids per-step Option handling in small demos.

                      Instances For
                        def NN.API.Data.cycleDatasetOrError {a : Type} (ds : Dataset a) (err : String := "empty dataset") :
                        Except String (a)

                        Like cycleDataset, but fail with a message if the dataset is empty.

                        This is the preferred helper for “PyTorch-style” fixed-step loops over in-memory datasets.

                        Instances For
                          def NN.API.Data.get? {a : Type} (ds : Dataset a) (i : ) :

                          Safe indexing into a dataset.

                          Instances For
                            def NN.API.Data.map {a b : Type} (f : ab) (ds : Dataset a) :

                            Map a dataset elementwise (pure, deterministic).

                            Instances For
                              def NN.API.Data.append {a : Type} (x y : Dataset a) :

                              Concatenate two datasets.

                              Instances For
                                def NN.API.Data.splitAt {a : Type} (n : ) (ds : Dataset a) :

                                Split a dataset at position n (prefix, suffix).

                                Instances For
                                  def NN.API.Data.shuffle {a : Type} (seed : ) (ds : Dataset a) :

                                  Shuffle a dataset deterministically, returning the updated RNG seed and the shuffled dataset.

                                  This is used to implement DataLoader.shuffle behavior in a purely functional way.

                                  Instances For
                                    def NN.API.Data.shuffled {a : Type} (seed : ) (ds : Dataset a) :

                                    Convenience wrapper around shuffle that discards the updated seed.

                                    Instances For
                                      def NN.API.Data.randomSplitAt {a : Type} (seed n : ) (ds : Dataset a) :

                                      Shuffle once and then split at n.

                                      This is a small building block for train/val splits.

                                      Instances For
                                        def NN.API.Data.batches {a : Type} (tag : String) (batchSize : ) (ds : Dataset a) :

                                        Split a dataset into equal-sized minibatches (as lists), dropping the final partial batch.

                                        This is a low-level helper; most users should use DataLoader.epoch or Data.batchedSupervised.

                                        Instances For
                                          def NN.API.Data.batchesArray {a : Type} (tag : String) (batchSize : ) (ds : Dataset a) :

                                          Like batches, but return each minibatch as an Array instead of a List.

                                          Instances For
                                            @[reducible, inline]

                                            Untyped analogue of PyTorch's torch.utils.data.DataLoader.

                                            This is the deterministic, purely-functional loader provided by the TorchLean runtime.

                                            Instances For
                                              def NN.API.Data.loader {a : Type} (ds : Dataset a) (batchSize : ) (shuffle : Bool := false) (seed : := 0) (dropLast : Bool := false) :

                                              Construct a RawDataLoader from a dataset.

                                              If shuffle := true, shuffling is deterministic w.r.t. seed. If dropLast := true, incomplete final batches are discarded.

                                              Instances For

                                                Run one epoch worth of minibatching and return:

                                                • an updated loader (with the new seed), and
                                                • the list of minibatches.
                                                Instances For
                                                  def NN.API.Data.epochCollate {a b : Type} (name : String) (dl : RawDataLoader a) (collate : List aExcept String b) :

                                                  Like epoch, but apply a user-provided collate function to each minibatch.

                                                  This is the TorchLean analogue of PyTorch's collate_fn= option.

                                                  Instances For
                                                    structure NN.API.Data.BatchLoader (α : Type) (n : ) (σ τ : Spec.Shape) :

                                                    Typed wrapper around RawDataLoader for supervised samples.

                                                    The batch size n is reflected in the type, and BatchLoader.epoch returns fully-collated dim n minibatches (so dropLast=true is required).

                                                    Instances For
                                                      @[reducible, inline]

                                                      Existential wrapper for loaders when the batch size is chosen at runtime.

                                                      Instances For

                                                        Note on default arguments:

                                                        The underlying CSV loaders take an opts : CsvOptions := {} argument. If we write abbrev fromCsvRows := readCsvFloatRows, Lean will apply the default argument and fromCsvRows will no longer accept opts.

                                                        So we eta-expand here to keep the public surface configurable.

                                                        @[reducible, inline]

                                                        Read a CSV file as a list of rows of floats.

                                                        Instances For
                                                          @[reducible, inline]

                                                          Read a CSV file as (x, y) float pairs.

                                                          Instances For
                                                            @[reducible, inline]

                                                            Read a CSV file as length-n float vectors.

                                                            Instances For
                                                              @[reducible, inline]

                                                              Read a .npy file into a TorchLean dataset.

                                                              Instances For
                                                                @[reducible, inline]

                                                                Read a .npy file as a vector dataset.

                                                                Instances For
                                                                  @[reducible, inline]

                                                                  Read a .npy file as a matrix dataset.

                                                                  Instances For

                                                                    Convert a list of (x, y) float tensors into a dataset of TorchLean supervised samples.

                                                                    This casts float data into the selected scalar backend α and packs it into a TList α [σ, τ].

                                                                    Instances For

                                                                      Convert a list of (x, label) pairs into a dataset of one-hot classification samples.

                                                                      Labels are given as Nat and converted to one-hot targets of shape Vec classes.

                                                                      Instances For

                                                                        TensorDataset (dim0 batching) #

                                                                        PyTorch's TensorDataset concept is: given one or more tensors that share the same size(0), build a dataset of samples by slicing each tensor along dimension 0.

                                                                        In TorchLean we do the same thing, but with shapes tracked in the type:

                                                                        Slice a batched TList along dimension 0.

                                                                        If a sample is represented as a shape-indexed tuple TList β ss, then a minibatch of size n is TList β (ss.map (fun s => .dim n s)). This function picks a batch index i : Fin n and returns the corresponding single sample.

                                                                        Instances For

                                                                          Convert a shape-indexed TList of Float tensors to the runtime scalar type α.

                                                                          Instances For

                                                                            Build a dataset by slicing a batched TList along dim0.

                                                                            This is the TorchLean analogue of PyTorch's TensorDataset(t1, t2, ...).

                                                                            Instances For

                                                                              Float-to-α variant of tensorDatasetDim0, for data loaded from disk.

                                                                              Instances For

                                                                                Supervised dataset from two batched tensors X : (n, σ) and Y : (n, τ) by slicing dim0.

                                                                                This is the common regression/supervised-learning case: the TorchLean analogue of TensorDataset(X, Y) in PyTorch.

                                                                                Instances For

                                                                                  Float-to-α variant of supervisedDim0, for data loaded from disk.

                                                                                  Instances For

                                                                                    Higher-level loaders (PyTorch-style ergonomics) #

                                                                                    These are convenience helpers on top of the low-level CSV/NPY readers so example code can stay "data first" without re-implementing row splitting and casting at every call site.

                                                                                    Load an N-D tensor from a .npy file, checking the on-disk shape matches dims.

                                                                                    Instances For

                                                                                      Load an N-D tensor from a .npy file, allowing the file to contain more rows on dim 0.

                                                                                      This is the dataset-loader analogue of taking tensor[:n] in PyTorch. The rank and trailing dimensions must still match exactly; only the leading dimension may be larger than requested.

                                                                                      We use this for dataset sources rather than the stricter fromNpyTensorND because a real exported dataset usually has a fixed full size, while tutorials often ask for a small prefix during smoke tests or quick CUDA checks. For example, a CIFAR file may have shape (50000, 3, 32, 32) while a demo command asks for n = 80; the resulting TorchLean tensor has type-level shape (80, 3, 32, 32).

                                                                                      This is intentionally still a checked loader, not an implicit reshape:

                                                                                      • rank must agree;
                                                                                      • all trailing dimensions must agree;
                                                                                      • the file must contain at least the requested number of rows;
                                                                                      • only C-order NPY files can be prefix-loaded efficiently by the low-level parser.
                                                                                      Instances For

                                                                                        Load an image tensor from a .npy file, checking it has shape (C, H, W).

                                                                                        Instances For

                                                                                          Load a batch of images from a .npy file, checking it has shape (N, C, H, W).

                                                                                          Instances For

                                                                                            Parse a float-encoded class label as a Nat in [0, classes).

                                                                                            Instances For

                                                                                              Labeled dataset from a batched tensor X : (n, σ) and a label vector y : (n,).

                                                                                              Labels are stored as floats (common when exporting from NumPy); we validate each label is an integer in [0, classes), then one-hot encode it.

                                                                                              Instances For

                                                                                                Load a supervised dataset from two .npy files containing batched arrays:

                                                                                                • X.npy has shape (n, xDims...)
                                                                                                • Y.npy has shape (n, yDims...)

                                                                                                and we build a dataset by slicing along dim0.

                                                                                                Instances For

                                                                                                  Load a labeled classification dataset from two .npy files:

                                                                                                  • X.npy has shape (n, xDims...)
                                                                                                  • y.npy has shape (n,) with float-encoded integer labels in [0, classes)

                                                                                                  and we build a dataset by slicing along dim0 and one-hot encoding the labels.

                                                                                                  Instances For

                                                                                                    Load a supervised dataset from a CSV with inDim + outDim columns per row:

                                                                                                    x1, ..., x_inDim, y1, ..., y_outDim.

                                                                                                    Instances For

                                                                                                      Load a labeled dataset from a CSV with inDim + 1 columns per row:

                                                                                                      x1, ..., x_inDim, label where label is in {0, ..., classes-1}.

                                                                                                      Instances For

                                                                                                        Unified file-source layer #

                                                                                                        The lower-level helpers above intentionally stay close to file formats (fromNpyTensorND, fromCsvRows, fromNpySupervised, ...). The definitions below give examples and applications a single scheme:

                                                                                                        1. describe each tensor as a TensorSource;
                                                                                                        2. load it as a typed TorchLean tensor;
                                                                                                        3. build supervised/labeled datasets by slicing dim0, just like PyTorch TensorDataset.

                                                                                                        Policy for external ecosystems:

                                                                                                        File formats supported directly by the Lean-side unified data-source loader.

                                                                                                        Instances For

                                                                                                          Human-facing extension used by messages and examples.

                                                                                                          Instances For

                                                                                                            Description of one tensor stored on disk.

                                                                                                            dims is the expected tensor shape. NPY can load any rank supported by tensorND; CSV is treated as a numeric table and therefore expects dims = [rows, cols].

                                                                                                            Instances For

                                                                                                              Load a numeric CSV table as a tensor.

                                                                                                              Supported shapes:

                                                                                                              • [rows, cols]: ordinary numeric table,
                                                                                                              • [n]: either one column with n rows or one row with n columns.
                                                                                                              Instances For

                                                                                                                Load a Float tensor from a path/format/dimension tuple.

                                                                                                                Instances For

                                                                                                                  Load a Float tensor, allowing NPY files to contain more rows than requested on dim 0.

                                                                                                                  TensorSource.loadFloatAs is exact: the file shape must equal dims. This prefix variant is for dataset-style sources where dims starts with the number of rows requested by the current run. CSV sources remain exact because CSV has no cheap binary prefix contract; NPY sources use fromNpyTensorNDPrefixDim0.

                                                                                                                  Instances For

                                                                                                                    Load a TensorSource as a Float tensor with the statically reflected shapeOfDims src.dims.

                                                                                                                    Instances For

                                                                                                                      Two tensor sources representing supervised data:

                                                                                                                      • x must have shape (n, xDims...),
                                                                                                                      • y must have shape (n, yDims...).
                                                                                                                      • n :

                                                                                                                        Number of samples along dim0.

                                                                                                                      • xDims : List

                                                                                                                        Per-sample input dimensions.

                                                                                                                      • yDims : List

                                                                                                                        Per-sample target dimensions.

                                                                                                                      • Source for the batched input tensor.

                                                                                                                      • Source for the batched target tensor.

                                                                                                                      Instances For
                                                                                                                        def NN.API.Data.SupervisedSource.ofPaths (format : TensorFormat) (xPath yPath : System.FilePath) (n : ) (xDims yDims : List ) (csvOptions : CsvOptions := { }) :

                                                                                                                        Construct a supervised source from paths using the same file format for x and y.

                                                                                                                        Instances For

                                                                                                                          Load a supervised dataset by slicing dim0 from the two tensors.

                                                                                                                          This is the preferred public loader for regression/operator-learning examples, regardless of whether the backing files are .npy or small numeric CSV tables.

                                                                                                                          Instances For

                                                                                                                            Two tensor sources representing labeled classification data:

                                                                                                                            • x must have shape (n, xDims...),
                                                                                                                            • y must have shape (n,) and contain integer-valued labels.
                                                                                                                            • n :

                                                                                                                              Number of samples along dim0.

                                                                                                                            • xDims : List

                                                                                                                              Per-sample input dimensions.

                                                                                                                            • classes :

                                                                                                                              Number of classes for one-hot targets.

                                                                                                                            • Source for the batched input tensor.

                                                                                                                            • Source for the label vector.

                                                                                                                            Instances For
                                                                                                                              def NN.API.Data.LabeledSource.ofPaths (format : TensorFormat) (xPath yPath : System.FilePath) (n : ) (xDims : List ) (classes : ) (csvOptions : CsvOptions := { }) :

                                                                                                                              Construct a labeled source from paths using the same file format for x and y.

                                                                                                                              Instances For

                                                                                                                                Load a labeled classification dataset by slicing dim0 and one-hot encoding labels.

                                                                                                                                For CSV label vectors, store labels as a single-column table with dims = [n, 1] and use a custom TensorSource if needed; the path constructor above is aimed at .npy label vectors.

                                                                                                                                Instances For

                                                                                                                                  Single-table supervised CSV source.

                                                                                                                                  Use this when one CSV row contains both input and target columns: x1, ..., x_inDim, y1, ..., y_outDim.

                                                                                                                                  • CSV file path.

                                                                                                                                  • inDim :

                                                                                                                                    Number of input feature columns.

                                                                                                                                  • outDim :

                                                                                                                                    Number of target columns.

                                                                                                                                  • csvOptions : CsvOptions

                                                                                                                                    CSV parsing options.

                                                                                                                                  Instances For

                                                                                                                                    Build a supervised dataset from two matrices X : n×inDim and Y : n×outDim by pairing rows.

                                                                                                                                    This is the TorchLean analogue of PyTorch's TensorDataset(X, Y) for simple regression.

                                                                                                                                    Instances For

                                                                                                                                      Collate a length-n supervised batch into a single sample with a leading batch axis.

                                                                                                                                      If your samples are (x : σ, y : τ), the collated sample is:

                                                                                                                                      • xBatch : (n × σ) and
                                                                                                                                      • yBatch : (n × τ)

                                                                                                                                      In shapes: TList α [dim n σ, dim n τ].

                                                                                                                                      Instances For
                                                                                                                                        def NN.API.Data.chunkN {a : Type} (n : ) (xs : List a) :

                                                                                                                                        Split a list into consecutive length-n chunks, dropping any final short chunk.

                                                                                                                                        Instances For
                                                                                                                                          def NN.API.Data.chunkN.go {a : Type} (n : ) (xs : List a) (fuel : ) :
                                                                                                                                          Instances For

                                                                                                                                            Turn a per-sample supervised dataset into a dataset of fixed-size minibatches.

                                                                                                                                            This is useful for metrics (meanLossDataset, accuracy, etc.) when your model expects a leading batch axis.

                                                                                                                                            Notes:

                                                                                                                                            • This drops the final partial batch (PyTorch drop_last=True behavior).
                                                                                                                                            • Batches are formed in dataset order (shuffling is the loader's job).
                                                                                                                                            Instances For
                                                                                                                                              def NN.API.Data.BatchLoader.dataset {α : Type} {n : } {σ τ : Spec.Shape} (dl : BatchLoader α n σ τ) :

                                                                                                                                              Extract the underlying per-sample dataset from a typed BatchLoader.

                                                                                                                                              Instances For
                                                                                                                                                def NN.API.Data.BatchLoader.batchSize {α : Type} {n : } {σ τ : Spec.Shape} (_dl : BatchLoader α n σ τ) :

                                                                                                                                                The batch size n carried in the type of a BatchLoader.

                                                                                                                                                Instances For
                                                                                                                                                  def NN.API.Data.BatchLoader.shuffled {α : Type} {n : } {σ τ : Spec.Shape} (dl : BatchLoader α n σ τ) :

                                                                                                                                                  Whether the loader is configured to shuffle samples each epoch.

                                                                                                                                                  Instances For
                                                                                                                                                    def NN.API.Data.BatchLoader.seed {α : Type} {n : } {σ τ : Spec.Shape} (dl : BatchLoader α n σ τ) :

                                                                                                                                                    RNG seed used for shuffling (if enabled).

                                                                                                                                                    Instances For
                                                                                                                                                      def NN.API.Data.BatchLoader.batchDataset {α : Type} {n : } {σ τ : Spec.Shape} (dl : BatchLoader α n σ τ) :

                                                                                                                                                      Materialize the dataset as a dataset of full minibatches (dropping any final partial batch).

                                                                                                                                                      Instances For
                                                                                                                                                        def NN.API.Data.BatchLoader.epoch {α : Type} {n : } {σ τ : Spec.Shape} (name : String) (dl : BatchLoader α n σ τ) :
                                                                                                                                                        Except String (BatchLoader α n σ τ × List (sample.Batch α n σ τ))

                                                                                                                                                        Run one epoch: return the updated loader state and a list of typed minibatches.

                                                                                                                                                        Instances For
                                                                                                                                                          def NN.API.Data.BatchLoader.epochCollate {α β : Type} {n : } {σ τ : Spec.Shape} (name : String) (dl : BatchLoader α n σ τ) (f : sample.Batch α n σ τExcept String β) :
                                                                                                                                                          Except String (BatchLoader α n σ τ × List β)

                                                                                                                                                          Like epoch, but post-process each minibatch with a user-supplied collate/transform f.

                                                                                                                                                          Instances For
                                                                                                                                                            def NN.API.Data.batchLoader {α : Type} {σ τ : Spec.Shape} (ds : Dataset (sample.Supervised α σ τ)) (batchSize : ) (shuffle : Bool := false) (seed : := 0) (dropLast : Bool := true) :
                                                                                                                                                            BatchLoader α batchSize σ τ

                                                                                                                                                            Public loader API: supervised datasets become fixed-size minibatch loaders by default.

                                                                                                                                                            The underlying dataset still stores individual samples; the loader batches them and epoch returns tensors with a leading dim0 batch axis. Because the batch size is reflected in the type, the public batched path requires full batches, so dropLast defaults to true.

                                                                                                                                                            Instances For
                                                                                                                                                              def NN.API.Data.loaderAny {α : Type} {σ τ : Spec.Shape} (ds : Dataset (sample.Supervised α σ τ)) (batchSize : ) (shuffle : Bool := false) (seed : := 0) (dropLast : Bool := true) :

                                                                                                                                                              Build a batch loader when the batch size is only known at runtime.

                                                                                                                                                              Instances For