TorchLean API

NN.Runtime.Autograd.Train.Dataset

Dataset and data loader utilities #

This module defines a small in-memory dataset wrapper together with a deterministic (seeded) loader. The shuffling logic is pure and reproducible.

Design boundary:

Dataset #

A lightweight in-memory dataset wrapper backed by an Array.

  • data : Array a

    Samples stored in insertion order.

Instances For

    Construct a dataset from a list.

    Instances For

      Convert a dataset to a list.

      Instances For
        @[simp]

        Round-tripping a list through Dataset preserves order.

        Number of samples in the dataset.

        Instances For
          @[simp]

          ofList preserves the list length as dataset size.

          Return true iff the dataset has no samples.

          Instances For
            @[simp]

            A dataset built from a nonempty cons list is not empty.

            Safe indexing into the dataset.

            Instances For
              def Runtime.Autograd.Train.Dataset.map {a b : Type} (f : ab) (ds : Dataset a) :

              Map a function over all samples in the dataset.

              Instances For

                Append two datasets.

                Instances For

                  Split a dataset at index n (preserving order).

                  Instances For

                    Deterministic shuffle #

                    We use a small LCG-based shuffle for reproducibility without IO.

                    One step of the simple LCG used to generate a deterministic pseudo-random stream.

                    Instances For

                      Pair each element with a deterministic pseudo-random key.

                      We then sort by the key to obtain a deterministic permutation.

                      Instances For

                        Deterministically shuffle a dataset.

                        Returns the next seed along with the shuffled dataset, so repeated epochs can thread the seed and remain pure/replayable.

                        Instances For

                          Batching #

                          These helpers return batches as lists for easy use with TapeM utilities.

                          def Runtime.Autograd.Train.Dataset.batches {a : Type} (tag : String) (batchSize : ) (ds : Dataset a) :

                          Split a dataset into non-empty batches of size at most batchSize.

                          Errors if batchSize = 0 or the dataset is empty.

                          Instances For
                            def Runtime.Autograd.Train.Dataset.batchesArray {a : Type} (tag : String) (batchSize : ) (ds : Dataset a) :

                            Like batches, but return each batch as an Array.

                            Instances For

                              DataLoader #

                              DataLoader.epoch optionally shuffles and returns a list of batches.

                              Deterministic data loader configuration.

                              epoch threads the seed and (optionally) shuffles to produce a list of batches. This is a deliberately small, pure analogue of a PyTorch DataLoader.

                              • dataset : Dataset a

                                Dataset to batch. The loader keeps this value pure and explicit.

                              • batchSize :

                                Batch size.

                              • shuffle : Bool

                                If true, run the deterministic seeded shuffle before each epoch.

                              • seed :

                                Seed threaded through deterministic shuffles.

                              • dropLast : Bool

                                Drop the final partial batch when it is shorter than batchSize.

                              Instances For

                                Run one epoch: optionally shuffle and return the list of batches.

                                The returned DataLoader has its seed updated so you can call epoch repeatedly to get different but reproducible shuffles.

                                Instances For

                                  Like epoch, but return each batch as an Array.

                                  Instances For

                                    Run one epoch and collate each batch into a single value.

                                    This is the main building block for minibatch training where your model expects tensors with a leading batch axis, but your dataset is stored as individual samples.

                                    Instances For