7.1. Modern Models and Training
The model zoo is where the typed API meets the messier parts of ML systems: image patches, attention masks, scan state, stochastic schedules, spectral kernels, and environment rollouts. A two-layer MLP is enough to explain typed tensors and autograd. It is not enough to test whether the framework handles the shapes that modern ML actually uses.
The examples in this section are best read as semantic stress tests. Each model family exercises a different part of the stack: KAN-style edge-basis expansion, residual branches, patch tokens, causal masks, scan state, diffusion schedules, spectral kernels, RL trajectories, and CUDA runtime paths.
7.1.1. What Each Family Stresses
The examples are compact, but each one touches a real source of ML complexity:
| Model family | What it stresses | |---|---| | MLP / KAN / CNN | typed tensors, edge bases, losses, optimizers, data loaders | | Residual / ResNet specs | residual shape agreement and branch joins | | ViT | image patches, token dimensions, attention blocks | | GPT-style models | causal windows, token ids, masks, save/load | | Mamba | recurrent state, selective scan, prefix causality | | Diffusion | stochastic schedules, denoising objectives, sampling artifacts | | FNO | spectral convolution, scientific data, cuFFT boundary | | PPO / RL | trajectories, environment boundary, policy/value losses |
KANs are included as a model family rather than a task wrapper: the model supplies learned one-dimensional edge functions, while the public trainer still chooses regression, classification, or a custom loss. The current built-in basis is triangular and piecewise linear; future spline families should fit the same edge-family slot instead of adding task-specific KAN constructors.
A formal ML library that only works for one MLP is easy to make look clean. The harder test is whether the same ideas survive edge-basis models, residual sharing, attention masks, token windows, recurrent state, stochastic sampling, spectral transforms, and external environments. That is what the model zoo is for.
7.1.2. Start With The Runner
Most model examples go through one command shape:
lake exe torchlean <example> [flags...]
The subcommands are registered in the model runner API, and the model implementations are collected by the model examples API.
The runner is there so users can learn one command shape instead of remembering twenty separate
lean --run invocations.
7.1.3. A Small Tour
The best first run is still a small one:
lake exe torchlean mlp --cpu --steps 10
Once CUDA has been built, the same style works on GPU:
lake build -R -K cuda=true lake exe -K cuda=true torchlean mlp --cuda --steps 20
That one command runs through the public API, the eager tape, the optimizer, and the selected runtime backend. The rest of the zoo grows from that same shape.
7.1.4. Feedforward, Convolutional, And Vision Models
The classical supervised examples are useful because their expected behavior is easy to recognize.
lake exe torchlean mlp --cpu --steps 10 lake exe -K cuda=true torchlean cnn --cuda --n-total 1 --steps 1 lake exe -K cuda=true torchlean vit --cuda --n-total 1 --steps 1
What these demonstrate:
-
the same training loop works for vectors, images, residual blocks, and patch/attention-style vision models;
-
model parameters are ordinary TorchLean parameters, not hidden Python objects;
-
the examples can use prepared real data under
data/realwhen available.
The commands above are compact. They are runtime checks, not leaderboard experiments.
The model code follows the same pattern as the running example:
import NN
open TorchLean
def smallVisionHead : nn.M (nn.Sequential (Shape.vec 64) (Shape.vec 10)) :=
nn.Sequential![
nn.Linear 64 32,
nn.ReLU,
nn.Linear 32 10
]
The larger CNN and ViT commands replace this small head with convolution or patch/attention blocks. Residual blocks use the same typed model-building surface, even though the maintained model-zoo runtime command set currently keeps vision training to CNN and ViT.
For residual models, the semantic shape is:
\operatorname{block}(x)=x+F_\theta(x)
The theorem burden, when we prove one, is to show both branches have the same shape and that the runtime graph really computes that sum rather than merely producing a tensor with compatible size.
7.1.5. Sequence Models: RNN, LSTM, Transformer
The sequence examples give a gentle path from recurrent state to attention:
lake exe -K cuda=true torchlean rnn --cuda --tiny-shakespeare --steps 1 lake exe -K cuda=true torchlean lstm --cuda --tiny-shakespeare --steps 1 lake exe -K cuda=true torchlean transformer --cuda --tiny-shakespeare --steps 1
A one-step run does not learn language. The useful fact is that the full data path exists: text is loaded, token windows are constructed, tensors are built in Lean, and the model trains through TorchLean's runtime.
7.1.6. GPT-Style Language Models
TorchLean has two GPT-facing examples:
-
gpt2a small GPT model suitable for runtime checks; -
text_gpt2a file-backed corpus trainer, including a path that can use GPT-2 BPE vocabulary and merges.
Typical commands:
lake exe -K cuda=true torchlean gpt2 --cuda --steps 1 lake exe -K cuda=true torchlean text_gpt2 --cuda \ --data-file data/real/text/tinystories_valid.txt --allow-small-data --steps 100
The text_gpt2 example is explicit about scale. It is a TorchLean-native miniature
language model. It can overfit small windows and exercise a real tokenizer/data path, but it is not
OpenAI GPT-2-small and should not be described as pretrained GPT-2.
The important data path is:
-
read a corpus file;
-
tokenize into bounded ids;
-
form causal windows;
-
train next-token prediction with integer labels;
-
optionally save parameters;
-
reload the parameter file and prompt the saved model.
That is the part worth studying. A tiny local run will not chat like a large pretrained model, but it does exercise the same kind of boundary TorchLean needs for serious sequence-model work.
The language model objective is the familiar next token map:
\operatorname{logits}_t =
\operatorname{model}_\theta(tokens_{<t}),
\qquad
L = -\sum_t \log p_\theta(tokens_t \mid tokens_{<t})
The compact examples make token ids, causal windows, parameter files, and generation traces explicit objects so later correctness work has something concrete to talk about.
The API supports two token representations. One-hot tokens are convenient for bounded examples and teaching. Integer token ids are the representation used by file-backed language-model runs. The GPT helper layer therefore separates the embedding boundary from the Transformer body:
nn.causalTransformerFromEmbeddings nn.causalTransformerOneHot nn.causalTransformerTokenScalarModuleDef
The integer-token loss uses row-wise class labels:
TorchLean.F.embeddingBatchSeqNat TorchLean.Loss.crossEntropyRowsNat
So a language-model batch can store flattened token ids and targets instead of materializing a large one-hot target tensor. The model body is the same causal Transformer shape; only the input boundary changes.
7.1.7. Mamba-Style State-Space Models
The Mamba example exercises selective-scan style computation:
lake exe -K cuda=true torchlean mamba --cuda --tiny-shakespeare --steps 1 --windows 1 --generate 0
Under the CUDA backend, TorchLean includes native selective-scan kernels for the float32 path. The model is small, but the architectural lesson is real: not every modern sequence model is attention, and TorchLean's runtime has begun to cover that broader operator family.
The state space shape is:
h_{t+1}=A_t h_t+B_t x_t,
\qquad
y_t=C_t h_t+D_t x_t
That equation is why scan order and prefix causality matter. The state space proof chapters state that future tokens do not change prefix outputs for the supported scan definitions.
7.1.8. Diffusion
The diffusion example is a compact denoising training loop:
lake exe -K cuda=true torchlean diffusion --cuda --dataset cifar10 --n-total 1 --steps 1 --hidden-c 1 --T 2
Diffusion is included because it stresses a different style of ML program: a stochastic-noise schedule, a denoising objective, and a generative sampling path. It is a compact reference implementation for the runtime and training API.
A useful way to read the diffusion example is as a four-step pipeline:
-
load real image tensors or a small fallback dataset;
-
sample a timestep and add noise according to the schedule;
-
train a denoiser to predict the noise or clean signal;
-
run a sampler such as DDIM-style reverse steps to produce an image artifact.
Data and logging belong beside the tutorial because a poor sample usually raises ordinary ML debugging questions first: dataset scale, resolution, schedule length, model capacity, training time, and sampler settings.
7.1.9. FNO And Burgers Operator Learning
The FNO example is the best current example of TorchLean training on a real scientific-ML dataset. The Python helper only downloads/converts data and plots the result; the model and training loop are native TorchLean.
Prepare data with the Burgers data helper:
python3 NN/Examples/Data/prepare_fno1d_burgers.py --download --grid 32 --ntrain 128 --ntest 32
Train on CUDA:
lake exe -K cuda=true torchlean fno1d_burgers --cuda --fast-kernels --steps 700 --lr 0.003 \ --plot-csv data/real/fno/predictions.csv
Plot one held-out prediction with the plot helper:
python3 NN/Examples/Data/plot_fno1d_burgers.py --csv data/real/fno/predictions.csv
CUDA mode uses the fused cuFFT-backed spectralConv1dRfft autograd primitive. CPU mode keeps a
portable dense-DFT reference path. We use that split because one path is convenient for
inspection and the other is practical for training.
The operator learning claim has a different type from classification:
\mathcal{G}_\theta :
\operatorname{function\ samples\ on\ a\ grid}
\to
\operatorname{function\ samples\ on\ a\ grid}
The FNO example is valuable because it exercises scientific data, spectral transforms, and CUDA interop while still using the same typed tensor and runtime layer.
7.1.10. Reinforcement Learning
The PPO examples show that TorchLean is not limited to supervised losses:
lake exe -K cuda=true torchlean ppo_gridworld --cuda --updates 1 --eval-every 1 --eval-episodes 1 --eval-max-steps 8 lake exe -K cuda=true torchlean ppo_cartpole --cuda --updates 1 --eval-every 1 --eval-episodes 1 --eval-max-steps 8
The GridWorld example is written in Lean and has proof hooks. The Gymnasium examples cross a Python environment boundary and therefore use a runtime contract to check observations, actions, rewards, and termination flags before they enter the learner.
For RL, the useful TorchLean idea is not "PPO exists." It is that an episode can be treated as data: observations, actions, rewards, done flags, log probabilities, and value estimates are explicit records that can be replayed, logged, checked, and visualized.
7.1.11. Data Is Part Of The Example
Several examples use real or semi-real data paths:
-
CSV and
.npyloaders for tabular and tensor datasets; -
CIFAR-style prepared shards under
data/real; -
Tiny Shakespeare and TinyStories text;
-
Burgers
.matconversion into.npytensors.
The example-data helper prepares common tiny datasets:
python3 scripts/datasets/download_example_data.py --tiny-shakespeare --tinystories-valid --cifar10
TorchLean examples should make the data source visible. If a file is generated, downloaded, or converted, put the command beside the example or in the module header. Reproducibility starts before the first tensor is allocated.
7.1.12. Scope Of The Example Runs
A successful model zoo run establishes a concrete execution fact:
-
the code builds;
-
the selected runtime path executes;
-
the loss and optimizer connect correctly;
-
for training examples, the loss usually moves in the expected direction.
Architectural optimality, GPU-kernel agreement, and mathematical model theorems are separate claims. Those belong in Verification, Floating-Point Semantics, and GPU and CUDA.
7.1.13. References
-
He et al., Deep Residual Learning for Image Recognition, 2015.
-
Vaswani et al., Attention Is All You Need, 2017.
-
Radford et al., GPT-2 technical report / model card line, 2019.
-
Dosovitskiy et al., An Image is Worth 16x16 Words, 2020.
-
Ho et al., Denoising Diffusion Probabilistic Models, 2020.
-
Li et al., Fourier Neural Operator for Parametric Partial Differential Equations, 2020/2021.
-
Gu and Dao, Mamba: Linear-Time Sequence Modeling with Selective State Spaces, 2023.
-
Schulman et al., Proximal Policy Optimization Algorithms, 2017.