TorchLean Public Text Data #
Causal language-model sample and dataset constructors.
Instances For
Build a batched one-hot causal-language-model sample by repeating one token window across every batch row.
The token list represents a seqLen + 1 window. Shorter lists are padded and longer lists are
truncated by the causal-LM construction.
Instances For
Build a batched one-hot causal-language-model sample from one token window per batch row.
Use this for GPT-style examples that already know the per-row (seqLen + 1) token window they want
each batch row to see.
Instances For
Build a batched one-hot causal-language-model sample from an array of per-row token windows.
Rows past the end of the array use the explicit fallback window, so partial-batch behavior stays
visible at the call site.
Instances For
Build a batched one-hot causal-language-model sample from a token array by choosing one
deterministic (seqLen + 1) window per batch row.
Use this for GPT-style trainers that keep a tokenized corpus in memory and derive each batch from
the same (tokens, seed, step) rule.
Instances For
Build one unbatched one-hot causal-language-model sample directly from a token list.
The token list represents a seqLen + 1 window. Shorter lists are padded and longer lists are
truncated by the causal-LM construction.
Instances For
Build one unbatched one-hot causal-language-model sample from a text corpus string.
This takes one (seqLen + 1) byte window from the UTF-8 bytes of input, converts it to one-hot
x/y matrices, and casts the result into the runtime-selected scalar.
Instances For
Build one fixed-batch one-hot causal-language-model sample from a text corpus string by repeating the same text window across every batch row.
Instances For
Build a runtime-polymorphic dataset containing one unbatched causal-language-model sample from a text corpus string.
Instances For
Build a runtime-polymorphic dataset containing one causal-language-model sample repeated across a fixed batch axis.
Use this when the model itself owns the batch dimension but the example naturally starts from one text window.