TorchLean API

Docs Home Guide Examples Graphs

NN.API.Text

API Text #

Small text / NLP helpers for TorchLean examples.

TorchLean’s executable runtime expects inputs as floating tensors, so runtime and autograd code can handle them with the same typed tensor APIs as parameters. For language models this means we commonly represent token ids as one-hot / token-distribution tensors of shape:

(batch × seqLen × vocab)

and implement “token embeddings” as a matrix multiply against an embedding table.

This module provides:

a tokenizer interface (with a byte-level tokenizer),
helpers to turn token streams into one-hot tensors,
small “next-token prediction” sample builders used by GPT-style demos,
display helpers for turning model logits back into readable token predictions.

Tokenizers #

structure NN.API.text.Tokenizer :

Tokenizer interface (encode/decode).

vocabSize : ℕ
Vocabulary size (token ids are expected to be in [0, vocabSize)).
encode : String → List ℕ
Encode a string into token ids.
decode : List ℕ → String
Decode token ids back into a string.

Instances For

def NN.API.text.Tokenizer.byteArrayOfIds (ids : List ℕ) :

Convert token ids to bytes, truncating each id modulo 256.

Instances For

def NN.API.text.Tokenizer.decodeByteIds (ids : List ℕ) :

Decode byte tokens as UTF-8 when possible, falling back to a byte-wise display mode for invalid generated byte streams. This keeps decode (encode s) = s for arbitrary UTF-8 strings while remaining total for demos that sample invalid byte sequences.

Instances For

def NN.API.text.Tokenizer.byte :

Byte-level UTF-8 tokenizer: each byte is one token in [0,256).

Instances For

def NN.API.text.Tokenizer.ofAlphabet (alphabet : Array Char) (unkId : ℕ := 0) (unkChar : Char := '?') :

Build a character-level tokenizer from an explicit alphabet.

This is the TorchLean analogue of the stoi/itos tables used in many educational GPT demos (including Karpathy's "char-gpt" / minGPT walkthroughs): encode maps characters to ids 0..alphabet.size-1, and decode maps ids back to characters.

Notes:

This tokenizer is deterministic given alphabet; callers are responsible for choosing how to construct the alphabet (e.g. sorted(set(data))).
Characters not present in the alphabet map to unkId (default 0). This keeps encode total.
Ids outside [0, vocabSize) decode to the unkChar (default ?).

Instances For

def NN.API.text.Tokenizer.encodeVec (t : Tokenizer) (n : ℕ) (s : String) (padId : ℕ := 0) :

Encode and pad/truncate to a fixed length, returning a length-indexed Vector.

Instances For

def NN.API.text.Tokenizer.encodeBatchVec (t : Tokenizer) (batch seqLen : ℕ) (ss : List String) (padId : ℕ := 0) :

Vector (Vector ℕ seqLen) batch

Encode a batch of strings, padding/truncating each to length seqLen.

Instances For

One-Hot Token Tensors #

def NN.API.text.oneHotTokenFloat (vocab tokenId : ℕ) :

Tensor.Tensor Float (Shape.Vec vocab)

One-hot vector for a single token id (Vec vocab). Out-of-range ids map to all-zeros.

Instances For

def NN.API.text.tokensToOneHotMatFloat {seqLen vocab : ℕ} (tokens : Vector ℕ seqLen) :

Tensor.Tensor Float (Shape.Mat seqLen vocab)

One-hot encode a fixed-length token sequence as a matrix (seqLen × vocab).

Instances For

def NN.API.text.tokensToOneHotBatchFloat {batch seqLen vocab : ℕ} (tokens : Vector (Vector ℕ seqLen) batch) :

Tensor.Tensor Float (Spec.Shape.dim batch (Shape.Mat seqLen vocab))

One-hot encode a fixed-size batch of token sequences as (batch × seqLen × vocab).

Instances For

Causal LM Samples #

def NN.API.text.causalLmXYOneHotMatFloat (seqLen vocab : ℕ) (tokens : List ℕ) (padId : ℕ := 0) :

Tensor.Tensor Float (Shape.Mat seqLen vocab) × Tensor.Tensor Float (Shape.Mat seqLen vocab)

Build a (x, y) pair for next-token prediction from a token stream.

x[t] = oneHot(tokens[t]) y[t] = oneHot(tokens[t+1])

If the stream is too short, we pad with padId.

Instances For

def NN.API.text.causalLmXYOneHotBatchRowsFloat (batch seqLen vocab : ℕ) (tokensAt : Fin batch → List ℕ) (padId : ℕ := 0) :

Tensor.Tensor Float (Spec.Shape.dim batch (Shape.Mat seqLen vocab)) × Tensor.Tensor Float (Spec.Shape.dim batch (Shape.Mat seqLen vocab))

Build a batched causal-LM (x, y) pair from one token window per batch row.

This is the text analogue of image/tabular minibatching:

row i receives its own token window tokensAt i;
x[i,t] is tokensAt i[t];
y[i,t] is tokensAt i[t+1];
short rows are padded with padId.

GPT-style examples share this batching logic. The contract is explicit: a text batch is a typed tensor of shape (batch, seqLen, vocab), just like the vision loader collates rows into (batch, C, H, W).

Instances For

def NN.API.text.causalLmXOneHotBatch {α : Type} [Semantics.Scalar α] [Runtime.Scalar α] (batch seqLen vocab : ℕ) (tokens : List ℕ) (padId : ℕ := 0) :

Tensor.Tensor α (Spec.Shape.dim batch (Shape.Mat seqLen vocab))

One-hot encode a causal-LM input window as a batched tensor.

This is the reusable version of the hand-written helpers used by the GPT-style demos: token ids are read from tokens, missing positions use padId, and every batch row receives the same window.

Instances For

def NN.API.text.causalLmSampleOneHotBatch {α : Type} [Semantics.Scalar α] [Runtime.Scalar α] (batch seqLen vocab : ℕ) (tokens : List ℕ) (padId : ℕ := 0) :

sample.Supervised α (Spec.Shape.dim batch (Shape.Mat seqLen vocab)) (Spec.Shape.dim batch (Shape.Mat seqLen vocab))

Build a batched supervised next-token sample from a token stream.

The target is shifted by one position: x[t] = tokens[t], y[t] = tokens[t+1]. This is deliberately small and deterministic, which makes it useful for examples such as byte-level GPT, BPE smoke tests, and the minGPT-style addition task.

Instances For

def NN.API.text.causalLmSampleOneHotBatchRows {α : Type} [Semantics.Scalar α] [Runtime.Scalar α] (batch seqLen vocab : ℕ) (tokensAt : Fin batch → List ℕ) (padId : ℕ := 0) :

sample.Supervised α (Spec.Shape.dim batch (Shape.Mat seqLen vocab)) (Spec.Shape.dim batch (Shape.Mat seqLen vocab))

Build a batched supervised causal-LM sample from one token window per batch row.

Use this for real GPT-style minibatches. causalLmSampleOneHotBatch is still useful for prompt probes or smoke tests where every batch row intentionally repeats the same window; this function is the one model trainers should call when they want distinct windows inside the batch.

Instances For

Byte-Corpus Windows #

def NN.API.text.byteAtD (bytes : ByteArray) (i : ℕ) (padId : ℕ := 0) :

Read one byte token from a raw corpus, returning padId past the end.

This is intentionally byte-level rather than BPE-level: examples can train small causal language models directly from a text file without depending on an external tokenizer artifact. GPT-2 BPE support lives in NN.API.Text.Bpe.

Instances For

def NN.API.text.byteTokenWindow (bytes : ByteArray) (n : ℕ) (offset padId : ℕ := 0) :

Extract a fixed-length byte-token window from a raw corpus.

offset is measured in bytes, not Unicode characters. That is the right behavior for byte-level causal language modeling and avoids hidden UTF-8 slicing assumptions.

Instances For

Corpus Helpers #

def NN.API.text.Corpus.readUtf8File (exeName : String) (path : System.FilePath) (missingHint : String) :

Read a UTF-8 text file with a caller-supplied preparation hint.

The examples pass their executable name and a concrete hint so failures point users to the exact download/conversion command for that demo.

Instances For

def NN.API.text.Corpus.readByteFile (exeName : String) (path : System.FilePath) (allowSmallData : Bool) (minBytes seqLen : ℕ) :

Read a raw byte corpus and optionally enforce a minimum size.

allowSmallData is for smoke tests. Serious corpus-training examples can set minBytes to a large value and require users to pass an explicit opt-in flag for small local files.

Instances For

partial def NN.API.text.Corpus.takeUtf8Input (exeName : String) (defaultPath : System.FilePath) (aliases : List (String × System.FilePath)) (missingHint : String) :

List String → IO (String × List String)

Parse a text-corpus flag set and return (text, remainingArgs).

Supported forms:

--data-file PATH
any named alias in aliases, such as ("--tiny-shakespeare", path)
no data flag, which uses defaultPath

def NN.API.text.Corpus.byteOffset (bytes : ByteArray) (i seqLen : ℕ) :

Deterministic sliding-window offset for a byte corpus.

Instances For

def NN.API.text.Corpus.tokenOffset (tokens : Array ℕ) (i seqLen : ℕ) :

Deterministic sliding-window offset for an already-tokenized corpus.

Instances For

def NN.API.text.Corpus.usableTokenStarts (tokenCount seqLen : ℕ) :

Number of legal start positions for a (seqLen + 1) next-token window.

We return at least one start position so small smoke corpora stay total; callers still usually validate corpus size separately before real training.

Instances For

def NN.API.text.Corpus.tokenArrayWindow (tokens : Array ℕ) (n offset : ℕ) (padId : ℕ := 0) :

Extract a fixed token window from an array-backed token corpus.

Instances For

def NN.API.text.Corpus.randomBatchOffsets (tokenCount seqLen batch seed step : ℕ) :

Fin batch → ℕ

Deterministic minGPT-style random offsets for one training batch.

The result is a function Fin batch → Nat: one corpus start offset per row. We derive the random key from (seed, step) and then draw row offsets by the row index, so the run is reproducible without using ambient IO randomness. This is the text equivalent of a shuffled DataLoader epoch.

Instances For

def NN.API.text.Corpus.randomBatchTokenWindows (tokens : Array ℕ) (batch seqLen seed step : ℕ) (padId : ℕ := 0) :

Fin batch → List ℕ

Build token windows for one deterministic random text batch.

Each row gets seqLen + 1 ids so downstream causal-LM helpers can form both x and shifted y. This helper is intentionally token-array based: byte, character, BPE, and synthetic tokenizers can all produce an Array Nat and reuse the same batching semantics.

Instances For

def NN.API.text.Corpus.startsWithAt (xs pat : Array ℕ) (off : ℕ) :

Check whether pat occurs in xs at offset off.

Instances For

def NN.API.text.Corpus.findWindow? (xs pat : Array ℕ) :

Find the first offset where pat appears in xs.

Instances For

def NN.API.text.Corpus.promptAwareOffsets (tokenCount seqLen windows : ℕ) (promptOffset? : Option ℕ) :

Choose training-window offsets, biased toward a prompt occurrence when the corpus contains it.

This makes tutorial generation probes less random: if the prompt is present in the corpus, a chunk of the sampled windows will cover nearby text.

Instances For

Causal LM Display Helpers #

def NN.API.text.tokenWindow (t : Tokenizer) (n : ℕ) (input : String) (offset padId : ℕ := 0) :

Return a fixed-length token window from a text string.

offset = 0 is the model prompt window; offset = 1 is the usual next-token target window for causal language modeling. Missing tokens are padded with padId, matching causalLmXYOneHotMatFloat.

Instances For

def NN.API.text.decodeWindow (t : Tokenizer) (n : ℕ) (input : String) (offset padId : ℕ := 0) :

Decode a fixed token window extracted by tokenWindow.

Instances For

def NN.API.text.escapeForDisplay (s : String) :

Escape a short text fragment for one-line demo output.

This is intentionally display-only: it does not change tokenizer semantics. It just keeps examples readable when the predicted byte sequence contains quotes, backslashes, tabs, or newlines.

Instances For

Simple Sampling Helpers (Top-k) #

def NN.API.text.topKIndices (scores : Array Float) (k : ℕ) :

Return the indices of the top k scores (largest first).

This is a small, deterministic utility used by the GPT-style examples. It is intentionally simple: k is small in typical demo settings, and the vocab sizes are modest enough that O(k*vocab) is fine for interactive sampling.

Instances For

def NN.API.text.greedyIndex (scores : Array Float) :

Greedy argmax index.

Instances For

def NN.API.text.penalizeRepeats (scores : Array Float) (recent : List ℕ) (repeatPenalty : Float) :

Apply a lightweight repetition penalty by subtracting repeatPenalty * count(token) for tokens appearing in recent.

This is a deliberately simple heuristic; it is not the same as presence/frequency penalties in commercial APIs, but it is enough to make tutorial sampling less degenerate.

Instances For

def NN.API.text.restrictScores (scores : Array Float) (allowId : ℕ → Bool) :

Mask scores by an allow-list predicate (disallowed ids get a very negative score).

This is mainly used by byte-level demos to optionally restrict output to printable ASCII.

Instances For

def NN.API.text.sampleTopKIndex (scores : Array Float) (temperature : Float) (topK seed counter : ℕ) :

Sample one token id from scores using temperature + top-k sampling.

The randomness is deterministic given (seed, counter). This is important for examples: we can re-run a tutorial with the same flags and get the same sampled text.

Instances For

def NN.API.text.argmaxTokenIdsFromLogits {α : Type} [LT α] [DecidableRel fun (x1 x2 : α) => x1 > x2] {seqLen vocab : ℕ} (logits : Tensor.Tensor α (Shape.Mat seqLen vocab)) :

Decode a matrix of token logits by taking argmax independently at each sequence position.

The shape is (seqLen × vocab), i.e. one logits vector per token position. This helper is for inspection/debugging and is not differentiable.

Instances For

def NN.API.text.decodeArgmaxLogits {α : Type} [LT α] [DecidableRel fun (x1 x2 : α) => x1 > x2] (t : Tokenizer) {seqLen vocab : ℕ} (logits : Tensor.Tensor α (Shape.Mat seqLen vocab)) :

Decode (seqLen × vocab) logits as text using a tokenizer.

Instances For

def NN.API.text.argmaxTokenIdsFromBatchLogits {α : Type} [LT α] [DecidableRel fun (x1 x2 : α) => x1 > x2] {batch seqLen vocab : ℕ} (logits : Tensor.Tensor α (Spec.Shape.dim batch (Shape.Mat seqLen vocab))) (batchIdx : Fin batch) :

Extract batchIdx from batched logits and return the per-position argmax token ids.

Instances For

def NN.API.text.decodeArgmaxBatchLogits {α : Type} [LT α] [DecidableRel fun (x1 x2 : α) => x1 > x2] (t : Tokenizer) {batch seqLen vocab : ℕ} (logits : Tensor.Tensor α (Spec.Shape.dim batch (Shape.Mat seqLen vocab))) (batchIdx : Fin batch) :

Decode one batch row of (batch × seqLen × vocab) logits as text.

Instances For

def NN.API.text.causalMask (seqLen : ℕ) :

Tensor.Tensor Bool (Spec.Shape.dim seqLen (Spec.Shape.dim seqLen Spec.Shape.scalar))

Causal (autoregressive) attention mask of shape (seqLen × seqLen).

Entry (i, j) is true iff j ≤ i, meaning position i may attend to itself and earlier positions but not to future positions.

Instances For