TorchLean API

NN.API.Models.Gpt2

GPT-2-Style Model Helpers (API) #

This module collects compact, reusable GPT-2-style building blocks for TorchLean examples:

These helpers live in the API layer so runnable examples can stay focused on: data prep, training loops, and text decoding, rather than repeating the same embedding → positional embedding → Transformer stack → LayerNorm → linear boilerplate.

Important scope note:

Configuration for a small GPT-2-style causal language model over one-hot token inputs.

The model has the common GPT-2 “shape”:

embedding → learned positional embedding → (masked self-attention + FFN)×layers → LayerNorm → linear

The input and output shapes are (batch × seqLen × vocab) one-hot/logit tensors.

  • batch :
  • seqLen :
  • vocab :
  • numHeads :
  • headDim :
  • ffnHidden :
  • layers :
  • seedStride :

    Seed stride used when initializing repeated blocks.

Instances For

    Transformer width implied by numHeads * headDim.

    Instances For
      @[reducible, inline]

      Input/output tensor shape (batch × seqLen × vocab) for a one-hot causal LM.

      Instances For
        @[reducible, inline]

        Embedded-token tensor shape (batch × seqLen × dModel).

        Instances For
          def NN.API.nn.models.causalTransformerFromEmbeddings (cfg : CausalOneHotConfig) (h_seqLen : cfg.seqLen 0 := by decide) (h_dModel : cfg.dModel 0 := by decide) :

          GPT-2-style causal Transformer body after token embeddings have already been computed.

          This is the shared body used by both one-hot-token models and indexed-token experiments. Keeping it separate avoids duplicating the Transformer stack when callers use a different token representation: the input boundary changes, while positional embeddings, masked self-attention blocks, layer norm, and the language-model head stay the same.

          Instances For
            def NN.API.nn.models.causalTransformerOneHot (cfg : CausalOneHotConfig) (h_seqLen : cfg.seqLen 0 := by decide) (h_dModel : cfg.dModel 0 := by decide) :

            Build a GPT-2-style causal language model over one-hot tokens.

            This is the shared constructor used by the runnable GPT-2 examples. It stays in nn.M so it composes with the rest of the API-layer model-building interface.

            Instances For

              Scalar loss for causal language modeling with integer token ids.

              The public one-hot constructor above is useful for small teaching examples because the input is an ordinary Float tensor. File-backed tokenized datasets use the representation found in language-model training systems: token ids are Nats, the embedding table is a trainable Float parameter, and the loss gathers the target classes directly instead of building one-hot targets.

              tokens and targets are flattened (batch * seqLen) vectors. This matches the backend gather ops and keeps dataset storage simple; the embedding helper reshapes gathered rows back to (batch, seqLen, dModel) before running the Transformer body.

              Instances For