TorchLean Public Text Data #

Causal language-model sample and dataset constructors.

def TorchLean.Data.regression2to1Grid (lo hi : Float) (count : ℕ) (target : Float → Float → Float) :

Trainer.Dataset (Shape.vec 2) (Shape.vec 1)

Instances For

def TorchLean.Data.causalLmOneHotSample {α : Type} [Runtime.SemanticScalar α] [Runtime.Scalar α] (batch seqLen vocab : ℕ) (tokens : List ℕ) (padId : ℕ := 0) :

SupervisedSample α (NN.Tensor.shapeOfDims [batch, seqLen, vocab]) (NN.Tensor.shapeOfDims [batch, seqLen, vocab])

Build a batched one-hot causal-language-model sample by repeating one token window across every batch row.

The token list represents a seqLen + 1 window. Shorter lists are padded and longer lists are truncated by the causal-LM construction.

Instances For

source

def TorchLean.Data.causalLmOneHotSampleRows {α : Type} [Runtime.SemanticScalar α] [Runtime.Scalar α] (batch seqLen vocab : ℕ) (tokensAt : Fin batch → List ℕ) (padId : ℕ := 0) :

SupervisedSample α (NN.Tensor.shapeOfDims [batch, seqLen, vocab]) (NN.Tensor.shapeOfDims [batch, seqLen, vocab])

Build a batched one-hot causal-language-model sample from one token window per batch row.

Use this for GPT-style examples that already know the per-row (seqLen + 1) token window they want each batch row to see.

Instances For

source

def TorchLean.Data.causalLmOneHotSampleRowsFromArray {α : Type} [Runtime.SemanticScalar α] [Runtime.Scalar α] (batch seqLen vocab : ℕ) (windows : Array (List ℕ)) (fallback : List ℕ) (padId : ℕ := 0) :

SupervisedSample α (NN.Tensor.shapeOfDims [batch, seqLen, vocab]) (NN.Tensor.shapeOfDims [batch, seqLen, vocab])

Build a batched one-hot causal-language-model sample from an array of per-row token windows.

Rows past the end of the array use the explicit fallback window, so partial-batch behavior stays visible at the call site.

Instances For

source

def TorchLean.Data.causalLmOneHotSampleRowsFromTokenArray {α : Type} [Runtime.SemanticScalar α] [Runtime.Scalar α] (batch seqLen vocab : ℕ) (tokens : Array ℕ) (seed step : ℕ) (padId : ℕ := 0) :

SupervisedSample α (NN.Tensor.shapeOfDims [batch, seqLen, vocab]) (NN.Tensor.shapeOfDims [batch, seqLen, vocab])

Build a batched one-hot causal-language-model sample from a token array by choosing one deterministic (seqLen + 1) window per batch row.

Use this for GPT-style trainers that keep a tokenized corpus in memory and derive each batch from the same (tokens, seed, step) rule.

Instances For

source

def TorchLean.Data.causalLmOneHotMatSample {α : Type} [Runtime.SemanticScalar α] [Runtime.Scalar α] (seqLen vocab : ℕ) (tokens : List ℕ) :

SupervisedSample α (Shape.mat seqLen vocab) (Shape.mat seqLen vocab)

Build one unbatched one-hot causal-language-model sample directly from a token list.

The token list represents a seqLen + 1 window. Shorter lists are padded and longer lists are truncated by the causal-LM construction.

Instances For

source

def TorchLean.Data.textCausalSample {α : Type} [Runtime.SemanticScalar α] [Runtime.Scalar α] (seqLen vocab : ℕ) (input : String) :

SupervisedSample α (Shape.mat seqLen vocab) (Shape.mat seqLen vocab)

Build one unbatched one-hot causal-language-model sample from a text corpus string.

This takes one (seqLen + 1) byte window from the UTF-8 bytes of input, converts it to one-hot x/y matrices, and casts the result into the runtime-selected scalar.

Instances For

source

def TorchLean.Data.textCausalBatchSample {α : Type} [Runtime.SemanticScalar α] [Runtime.Scalar α] (batch seqLen vocab : ℕ) (input : String) :

SupervisedSample α (Spec.Shape.dim batch (Shape.mat seqLen vocab)) (Spec.Shape.dim batch (Shape.mat seqLen vocab))

Build one fixed-batch one-hot causal-language-model sample from a text corpus string by repeating the same text window across every batch row.

Instances For

source

def TorchLean.Data.textCausalDataset (seqLen vocab : ℕ) (input : String) :

Trainer.Dataset (Shape.mat seqLen vocab) (Shape.mat seqLen vocab)

Build a runtime-polymorphic dataset containing one unbatched causal-language-model sample from a text corpus string.

Instances For

source

def TorchLean.Data.textCausalBatchDataset (batch seqLen vocab : ℕ) (input : String) :

Trainer.Dataset (Spec.Shape.dim batch (Shape.mat seqLen vocab)) (Spec.Shape.dim batch (Shape.mat seqLen vocab))

Build a runtime-polymorphic dataset containing one causal-language-model sample repeated across a fixed batch axis.

Use this when the model itself owns the batch dimension but the example naturally starts from one text window.

Instances For

TorchLean API

NN.API.Public.Facade.Data.Text

TorchLean Public Text Data #