CUDA DGEMM FFI #
Foreign-function declaration for the host FloatArray FP64 matrix multiply path backed by
cublasDgemm when CUDA is enabled and by a CPU stub otherwise. The float32 buffer matmul path lives
in NN.Runtime.Autograd.Engine.Cuda.Kernels.
This intentionally stays in its own small module instead of Cuda.Kernels:
Cuda.Kernelsis the float32Cuda.Buffersurface used by the CUDA eager tape.DGemmis a hostFloatArray → FloatArraybridge for LeanFloattensors and theFastKernelsCPU-tape acceleration path.- It links through a separate native archive (
torchlean_dgemm_cuda) because the implementation is a cuBLAS-DGEMM wrapper rather than a tensor-buffer kernel.
@[extern torchlean_dgemm_cuda]