TorchLean API

Docs Home Guide Examples Graphs

NN.Runtime.Autograd.Engine.Cuda.Ops.Attention

CUDA Tape Operations: Attention #

Multi-head self-attention #

Forward structure matches Spec.MultiHeadAttention.forward:

Q = x @ Wq, K = x @ Wk, V = x @ Wv
reshape to heads (numHeads, n, headDim)
attention per head (batched): softmax(Q Kᵀ / sqrt(headDim)) @ V
combine heads, then output projection @ Wo

Masking:

If mask is provided, we upload it as a float32 {0,1} matrix and apply the same semantics as the spec: masked logits are replaced by -1000.0 before softmax, and gradients through masked entries are zeroed.
This incurs a host-to-device copy for the mask (since the mask is a host Tensor Bool).

def Runtime.Autograd.Cuda.Tape.multiHeadAttention {n numHeads dModel headDim : ℕ} (h1 : n ≠ 0) (t : Tape) (wqId wkId wvId woId xId : ℕ) (mask : Option (Spec.Tensor Bool (Spec.Shape.dim n (Spec.Shape.dim n Spec.Shape.scalar))) := none) (useFlash : Bool := true) :

Result (Tape × ℕ)

Instances For