CUDA DGEMM FFI #

Foreign-function declaration for the host FloatArray FP64 matrix multiply path backed by cublasDgemm when CUDA is enabled and by a CPU stub otherwise. The float32 buffer matmul path lives in NN.Runtime.Autograd.Engine.Cuda.Kernels.

This intentionally stays in its own small module instead of Cuda.Kernels:

Cuda.Kernels is the float32 Cuda.Buffer surface used by the CUDA eager tape.
DGemm is a host FloatArray → FloatArray bridge for Lean Float tensors and the FastKernels CPU-tape acceleration path.
It links through a separate native archive (torchlean_dgemm_cuda) because the implementation is a cuBLAS-DGEMM wrapper rather than a tensor-buffer kernel.

source

@[extern torchlean_dgemm_cuda]

opaque Runtime.Autograd.Cuda.torchleanDgemmCuda (A B : FloatArray) (m n p : UInt32) :

FloatArray

TorchLean API

NN.Runtime.Autograd.Engine.Cuda.DGemm

CUDA DGEMM FFI #