TorchLean API

NN.Runtime.Autograd.Engine.Cuda.Ops.Attention

CUDA Tape Operations: Attention #

Multi-head self-attention #

Forward structure matches Spec.MultiHeadAttention.forward:

  1. Q = x @ Wq, K = x @ Wk, V = x @ Wv
  2. reshape to heads (numHeads, n, headDim)
  3. attention per head (batched): softmax(Q Kᵀ / sqrt(headDim)) @ V
  4. combine heads, then output projection @ Wo

Masking:

def Runtime.Autograd.Cuda.Tape.multiHeadAttention {n numHeads dModel headDim : } (h1 : n 0) (t : Tape) (wqId wkId wvId woId xId : ) (mask : Option (Spec.Tensor Bool (Spec.Shape.dim n (Spec.Shape.dim n Spec.Shape.scalar))) := none) (useFlash : Bool := true) :
Instances For