CUDA Tape Operations: Attention #
Multi-head self-attention #
Forward structure matches Spec.MultiHeadAttention.forward:
Q = x @ Wq,K = x @ Wk,V = x @ Wv- reshape to heads
(numHeads, n, headDim) - attention per head (batched):
softmax(Q Kᵀ / sqrt(headDim)) @ V - combine heads, then output projection
@ Wo
Masking:
- If
maskis provided, we upload it as a float32{0,1}matrix and apply the same semantics as the spec: masked logits are replaced by-1000.0before softmax, and gradients through masked entries are zeroed. - This incurs a host-to-device copy for the mask (since the mask is a host
Tensor Bool).
def
Runtime.Autograd.Cuda.Tape.multiHeadAttention
{n numHeads dModel headDim : ℕ}
(h1 : n ≠ 0)
(t : Tape)
(wqId wkId wvId woId xId : ℕ)
(mask : Option (Spec.Tensor Bool (Spec.Shape.dim n (Spec.Shape.dim n Spec.Shape.scalar))) := none)
(useFlash : Bool := true)
: