TorchLean API

NN.API.Public.Facade.Data.Text

TorchLean Public Text Data #

Causal language-model sample and dataset constructors.

def TorchLean.Data.regression2to1Grid (lo hi : Float) (count : ) (target : FloatFloatFloat) :
Instances For
    def TorchLean.Data.causalLmOneHotSample {α : Type} [Runtime.SemanticScalar α] [Runtime.Scalar α] (batch seqLen vocab : ) (tokens : List ) (padId : := 0) :
    SupervisedSample α (NN.Tensor.shapeOfDims [batch, seqLen, vocab]) (NN.Tensor.shapeOfDims [batch, seqLen, vocab])

    Build a batched one-hot causal-language-model sample by repeating one token window across every batch row.

    The token list represents a seqLen + 1 window. Shorter lists are padded and longer lists are truncated by the causal-LM construction.

    Instances For
      def TorchLean.Data.causalLmOneHotSampleRows {α : Type} [Runtime.SemanticScalar α] [Runtime.Scalar α] (batch seqLen vocab : ) (tokensAt : Fin batchList ) (padId : := 0) :
      SupervisedSample α (NN.Tensor.shapeOfDims [batch, seqLen, vocab]) (NN.Tensor.shapeOfDims [batch, seqLen, vocab])

      Build a batched one-hot causal-language-model sample from one token window per batch row.

      Use this for GPT-style examples that already know the per-row (seqLen + 1) token window they want each batch row to see.

      Instances For
        def TorchLean.Data.causalLmOneHotSampleRowsFromArray {α : Type} [Runtime.SemanticScalar α] [Runtime.Scalar α] (batch seqLen vocab : ) (windows : Array (List )) (fallback : List ) (padId : := 0) :
        SupervisedSample α (NN.Tensor.shapeOfDims [batch, seqLen, vocab]) (NN.Tensor.shapeOfDims [batch, seqLen, vocab])

        Build a batched one-hot causal-language-model sample from an array of per-row token windows.

        Rows past the end of the array use the explicit fallback window, so partial-batch behavior stays visible at the call site.

        Instances For
          def TorchLean.Data.causalLmOneHotSampleRowsFromTokenArray {α : Type} [Runtime.SemanticScalar α] [Runtime.Scalar α] (batch seqLen vocab : ) (tokens : Array ) (seed step : ) (padId : := 0) :
          SupervisedSample α (NN.Tensor.shapeOfDims [batch, seqLen, vocab]) (NN.Tensor.shapeOfDims [batch, seqLen, vocab])

          Build a batched one-hot causal-language-model sample from a token array by choosing one deterministic (seqLen + 1) window per batch row.

          Use this for GPT-style trainers that keep a tokenized corpus in memory and derive each batch from the same (tokens, seed, step) rule.

          Instances For
            def TorchLean.Data.causalLmOneHotMatSample {α : Type} [Runtime.SemanticScalar α] [Runtime.Scalar α] (seqLen vocab : ) (tokens : List ) :
            SupervisedSample α (Shape.mat seqLen vocab) (Shape.mat seqLen vocab)

            Build one unbatched one-hot causal-language-model sample directly from a token list.

            The token list represents a seqLen + 1 window. Shorter lists are padded and longer lists are truncated by the causal-LM construction.

            Instances For
              def TorchLean.Data.textCausalSample {α : Type} [Runtime.SemanticScalar α] [Runtime.Scalar α] (seqLen vocab : ) (input : String) :
              SupervisedSample α (Shape.mat seqLen vocab) (Shape.mat seqLen vocab)

              Build one unbatched one-hot causal-language-model sample from a text corpus string.

              This takes one (seqLen + 1) byte window from the UTF-8 bytes of input, converts it to one-hot x/y matrices, and casts the result into the runtime-selected scalar.

              Instances For
                def TorchLean.Data.textCausalBatchSample {α : Type} [Runtime.SemanticScalar α] [Runtime.Scalar α] (batch seqLen vocab : ) (input : String) :
                SupervisedSample α (Spec.Shape.dim batch (Shape.mat seqLen vocab)) (Spec.Shape.dim batch (Shape.mat seqLen vocab))

                Build one fixed-batch one-hot causal-language-model sample from a text corpus string by repeating the same text window across every batch row.

                Instances For
                  def TorchLean.Data.textCausalDataset (seqLen vocab : ) (input : String) :
                  Trainer.Dataset (Shape.mat seqLen vocab) (Shape.mat seqLen vocab)

                  Build a runtime-polymorphic dataset containing one unbatched causal-language-model sample from a text corpus string.

                  Instances For
                    def TorchLean.Data.textCausalBatchDataset (batch seqLen vocab : ) (input : String) :
                    Trainer.Dataset (Spec.Shape.dim batch (Shape.mat seqLen vocab)) (Spec.Shape.dim batch (Shape.mat seqLen vocab))

                    Build a runtime-polymorphic dataset containing one causal-language-model sample repeated across a fixed batch axis.

                    Use this when the model itself owns the batch dimension but the example naturally starts from one text window.

                    Instances For