Vit #

Vision Transformer (ViT) model.

This is a compact “ViT-style” specification:

patch embedding via Conv2D (kernel = patch size),
flatten patches into a token sequence,
add a learnable positional encoding,
run a Transformer encoder,
mean-pool tokens and apply a linear classifier head.

Notes:

PyTorch mental model: this corresponds to the core dataflow of torchvision.models.vit_*, but written without batching: tensors are (C,H,W) images and (T,D) token sequences.
This file provides both mean-pool (ViTSpec) and CLS-token (ViTClsSpec) variants. The CLS-token variant prepends one learnable token before the encoder and pools by taking token 0.
We intentionally keep the patch embedding as a Conv2d with kernel_size=(patchH,patchW). When stride=(patchH,patchW) and padding=0, that matches the usual "non-overlapping patches" embedding used in many ViT implementations.

source

@[reducible, inline]

abbrev Models.ViTPatchOutH (inH patchH stride padding : ℕ) :

ℕ

Output height of the patch-embedding convolution in ViT.

Instances For

source

@[reducible, inline]

abbrev Models.ViTPatchOutW (inW patchW stride padding : ℕ) :

ℕ

Output width of the patch-embedding convolution in ViT.

Instances For

source

@[reducible, inline]

abbrev Models.ViTPatchCount (inH inW patchH patchW stride padding : ℕ) :

ℕ

Number of patch tokens T = outH*outW produced by the patch embedding.

Instances For

Configuration #

We keep ViT architectural hyperparameters in a dedicated config record so the model definition does not hide numeric choices in its types. This mirrors the usual config-object pattern in PyTorch/torchvision model-zoo code.

source

structure Models.ViTConfig :

Type

ViT architectural hyperparameters (spec layer).

patchH : ℕ
Patch height (kernel height for the patch-embedding conv).
patchW : ℕ
Patch width (kernel width for the patch-embedding conv).
stride : ℕ
Stride for the patch-embedding conv (typical: equal to patch size for non-overlapping patches).
padding : ℕ
Padding for the patch-embedding conv (typical: 0).
embedDim : ℕ
Transformer embedding dimension (d_model).
hiddenDim : ℕ
Transformer feedforward hidden dimension (d_ff).
headCount : ℕ
Number of attention heads.
numLayers : ℕ
Number of encoder layers.
numClasses : ℕ
Output classes for the classifier head.

Instances For

source

structure Models.ViTConfig.WF (cfg : ViTConfig) :

Prop

Well-formedness conditions for ViTConfig (the nonzero facts needed by some layer specs).

patchH_ne0 : cfg.patchH ≠ 0
patchW_ne0 : cfg.patchW ≠ 0
embedDim_pos : cfg.embedDim > 0
hiddenDim_pos : cfg.hiddenDim > 0
headCount_pos : cfg.headCount > 0
numClasses_ne0 : cfg.numClasses ≠ 0

Instances For

source

def Models.vitBasePatch16Config :

ViTConfig

Classic ViT-Base/16-ish hyperparameters (mean-pool variant; spec layer).

Instances For

source

theorem Models.vitBasePatch16Config_wf :

vitBasePatch16Config.WF

vitBasePatch16Config satisfies ViTConfig.WF.

source

def Models.vitLargePatch16Config :

ViTConfig

Classic ViT-Large/16-ish hyperparameters (mean-pool variant; spec layer).

Instances For

source

theorem Models.vitLargePatch16Config_wf :

vitLargePatch16Config.WF

vitLargePatch16Config satisfies ViTConfig.WF.

source

structure Models.ViTSpec (cfg : ViTConfig) (inC inH inW : ℕ) (α : Type) [Context α] [DecidableRel fun (x1 x2 : α) => x1 > x2] (h_inC : inC ≠ 0) (hCfg : cfg.WF) :

Type

ViT parameter bundle (patch embedding + positional encoding + transformer + head).

patchEmbed : Spec.Conv2DSpec inC cfg.embedDim cfg.patchH cfg.patchW cfg.stride cfg.padding α h_inC ⋯ ⋯
posEnc : Spec.PositionalEncodingSpec (ViTPatchCount inH inW cfg.patchH cfg.patchW cfg.stride cfg.padding) cfg.embedDim α
encoder : Spec.TransformerEncoder cfg.numLayers cfg.headCount cfg.embedDim cfg.hiddenDim α
head : Spec.LinearSpec α cfg.embedDim cfg.numClasses

Instances For

Forward pass (patches -> tokens -> encoder -> head) #

The forward is the standard ViT dataflow, but with explicit shape transforms so it stays obvious what each axis means:

conv2d_spec produces a feature map (embedDim, outH, outW).
We flatten (outH, outW) into a single token axis tokN = outH*outW.
We swap to token-major layout (tokN, embedDim) (this is the usual transformer convention).
We add positional embeddings and run the transformer encoder.
We mean-pool tokens and apply a final linear classifier.

PyTorch analogy (no batch axis here):

patch embedding: Conv2d(inC, embedDim, kernel_size=patch, stride=stride, padding=padding)
flatten: x.flatten(1).transpose(0, 1) to get (T,D) depending on your convention
encoder: TransformerEncoder(...)
pooling + head: encoded.mean(dim=0) then Linear(embedDim, numClasses)

source

structure Models.ViTGrads (cfg : ViTConfig) (inC inH inW : ℕ) (α : Type) :

Type

Gradients for the compact ViT spec (matching ViTSpec).

d_patch_kernel : Spec.Tensor α (Spec.Shape.dim cfg.embedDim (Spec.Shape.dim inC (Spec.Shape.dim cfg.patchH (Spec.Shape.dim cfg.patchW Spec.Shape.scalar))))
d_patch_bias : Spec.Tensor α (Spec.Shape.dim cfg.embedDim Spec.Shape.scalar)
d_pos : Spec.Tensor α (Spec.Shape.dim (ViTPatchCount inH inW cfg.patchH cfg.patchW cfg.stride cfg.padding) (Spec.Shape.dim cfg.embedDim Spec.Shape.scalar))
d_encoder : List (Spec.TransformerEncoderLayerGrads cfg.headCount cfg.embedDim cfg.hiddenDim α)
d_head_W : Spec.Tensor α (Spec.Shape.dim cfg.numClasses (Spec.Shape.dim cfg.embedDim Spec.Shape.scalar))
d_head_b : Spec.Tensor α (Spec.Shape.dim cfg.numClasses Spec.Shape.scalar)

Instances For

source

def Models.ViTSpec.forward {α : Type} [Context α] [DecidableRel fun (x1 x2 : α) => x1 > x2] {cfg : ViTConfig} {inC inH inW : ℕ} {h_inC : inC ≠ 0} {hCfg : cfg.WF} (m : ViTSpec cfg inC inH inW α h_inC hCfg) (x : Spec.MultiChannelImage inC inH inW α) (h_tok : ViTPatchCount inH inW cfg.patchH cfg.patchW cfg.stride cfg.padding > 0) :

Spec.Tensor α (Spec.Shape.dim cfg.numClasses Spec.Shape.scalar)

ViT forward pass (patch embedding → tokens → transformer encoder → pool → head).

Instances For

Backward pass #

This is a fully explicit reverse-mode spec (no meta-autograd):

patch embedding: Conv2D backward gives ∂kernel, ∂bias, and ∂image,
positional encoding: addition splits gradient (∂pos = ∂tokens),
transformer encoder: TransformerEncoder.backward (in NN/Spec/Models/Transformer.lean),
mean pooling over tokens: broadcast + scale by 1/tokN,
classifier head: linear_backward_spec.

We recompute intermediates locally; this keeps the spec self-contained and avoids adding a global "tape" type for every model.

source

def Models.ViTSpec.backward {α : Type} [Context α] [DecidableRel fun (x1 x2 : α) => x1 > x2] {cfg : ViTConfig} {inC inH inW : ℕ} {h_inC : inC ≠ 0} {hCfg : cfg.WF} (m : ViTSpec cfg inC inH inW α h_inC hCfg) (x : Spec.MultiChannelImage inC inH inW α) (grad_output : Spec.Tensor α (Spec.Shape.dim cfg.numClasses Spec.Shape.scalar)) (h_tok : ViTPatchCount inH inW cfg.patchH cfg.patchW cfg.stride cfg.padding > 0) :

ViTGrads cfg inC inH inW α × Spec.MultiChannelImage inC inH inW α

Fully explicit reverse-mode backward pass for ViTSpec.forward.

Instances For

CLS-token ViT variant (classic pooling) #

Many ViT implementations (including the original ViT paper and torchvision.models.vit_*) use a learnable CLS token:

prepend clsToken to the patch-token sequence,
use positional encodings of length tokN + 1,
run the encoder on a sequence of length tokN + 1,
take token 0 after the encoder as the pooled representation, then apply the head.

We keep the existing mean-pool ViTSpec unchanged; this is a separate parameter bundle and explicit backward pass.

source

structure Models.ViTClsSpec (cfg : ViTConfig) (inC inH inW : ℕ) (α : Type) [Context α] [DecidableRel fun (x1 x2 : α) => x1 > x2] (h_inC : inC ≠ 0) (hCfg : cfg.WF) :

Type

ViT parameter bundle with a learnable CLS token (classic ViT variant).

patchEmbed : Spec.Conv2DSpec inC cfg.embedDim cfg.patchH cfg.patchW cfg.stride cfg.padding α h_inC ⋯ ⋯
clsToken : Spec.Tensor α (Spec.Shape.dim cfg.embedDim Spec.Shape.scalar)
Learnable CLS token embedding (prepended as token 0).
posEnc : Spec.PositionalEncodingSpec (ViTPatchCount inH inW cfg.patchH cfg.patchW cfg.stride cfg.padding + 1) cfg.embedDim α
encoder : Spec.TransformerEncoder cfg.numLayers cfg.headCount cfg.embedDim cfg.hiddenDim α
head : Spec.LinearSpec α cfg.embedDim cfg.numClasses

Instances For

source

structure Models.ViTClsGrads (cfg : ViTConfig) (inC inH inW : ℕ) (α : Type) :

Type

Gradients for the CLS-token ViT spec (matching ViTClsSpec).

d_patch_kernel : Spec.Tensor α (Spec.Shape.dim cfg.embedDim (Spec.Shape.dim inC (Spec.Shape.dim cfg.patchH (Spec.Shape.dim cfg.patchW Spec.Shape.scalar))))
d_patch_bias : Spec.Tensor α (Spec.Shape.dim cfg.embedDim Spec.Shape.scalar)
d_clsToken : Spec.Tensor α (Spec.Shape.dim cfg.embedDim Spec.Shape.scalar)
d_pos : Spec.Tensor α (Spec.Shape.dim (ViTPatchCount inH inW cfg.patchH cfg.patchW cfg.stride cfg.padding + 1) (Spec.Shape.dim cfg.embedDim Spec.Shape.scalar))
d_encoder : List (Spec.TransformerEncoderLayerGrads cfg.headCount cfg.embedDim cfg.hiddenDim α)
d_head_W : Spec.Tensor α (Spec.Shape.dim cfg.numClasses (Spec.Shape.dim cfg.embedDim Spec.Shape.scalar))
d_head_b : Spec.Tensor α (Spec.Shape.dim cfg.numClasses Spec.Shape.scalar)

Instances For

source

def Models.ViTClsSpec.forward {α : Type} [Context α] [DecidableRel fun (x1 x2 : α) => x1 > x2] {cfg : ViTConfig} {inC inH inW : ℕ} {h_inC : inC ≠ 0} {hCfg : cfg.WF} (m : ViTClsSpec cfg inC inH inW α h_inC hCfg) (x : Spec.MultiChannelImage inC inH inW α) :

Spec.Tensor α (Spec.Shape.dim cfg.numClasses Spec.Shape.scalar)

CLS-token ViT forward pass (prepend CLS → transformer encoder → take token 0 → head).

Instances For

source

def Models.ViTClsSpec.backward {α : Type} [Context α] [DecidableRel fun (x1 x2 : α) => x1 > x2] {cfg : ViTConfig} {inC inH inW : ℕ} {h_inC : inC ≠ 0} {hCfg : cfg.WF} (m : ViTClsSpec cfg inC inH inW α h_inC hCfg) (x : Spec.MultiChannelImage inC inH inW α) (grad_output : Spec.Tensor α (Spec.Shape.dim cfg.numClasses Spec.Shape.scalar)) :

ViTClsGrads cfg inC inH inW α × Spec.MultiChannelImage inC inH inW α

Fully explicit reverse-mode backward pass for ViTClsSpec.forward.

Instances For

TorchLean API

NN.Spec.Models.Vit

Vit #

Configuration #

Forward pass (patches -> tokens -> encoder -> head) #

Backward pass #

CLS-token ViT variant (classic pooling) #