7.3. Generative Models and ML Theory
Generative models are a useful stress test for TorchLean because they refuse to stay in one layer of the system. A diffusion example wants stochastic schedules, noising kernels, a denoising network, and a sampler. A VAE wants a runtime encoder/decoder but also a statement of the ELBO and the diagonal Gaussian KL that theorem statements can use. A VQ-VAE wants a runnable reconstruction path, plus a discrete codebook objective. A GAN wants two networks, two losses, and an adversarial interpretation that is easy to overclaim without a clear boundary.
We built the generative examples with that boundary in mind. The generative model examples show that the TorchLean API and runtime can train compact programs shaped like diffusion models, autoencoders, VAEs, VQ-VAEs, and GANs. The generative theory API records the pieces we can presently cite as checked Lean statements: Gaussian forward noising, sampler Lipschitz facts, ELBO algebra, KL nonnegativity, nearest code facts, and generator/discriminator objective decompositions.
7.3.1. Runtime Examples And Theory Statements
The generative stack is split deliberately. The runnable examples exercise the training and data path; the theorem declarations state the objective or sampler facts that can be checked in Lean.
-
Diffusion: the diffusion command has timestep sampling, epsilon prediction, logs, and PPM samples; the theory side records a forward Gaussian law plus sampler Lipschitz and contraction facts.
-
VAE: the compact VAE example runs over flattened image tensors; the theory side records ELBO algebra, beta weighting, and diagonal Gaussian KL nonnegativity.
-
VQ-VAE: the VQ-VAE example gives a reconstruction path; the theory side records codebook loss decomposition and nearest-code facts.
-
GAN / LSGAN: the GAN warmup runs a small generator/discriminator objective; the theory side packages generator and discriminator losses.
-
Autoencoder / MAE: the autoencoder and masked autoencoder demos connect to reconstruction and masked objective algebra.
This division is important. A successful lake exe torchlean diffusion --steps 5 shows that the
program was typechecked, linked, and executed through the selected runtime. It does not prove that
the learned distribution is close to the data distribution. Conversely, a theorem such as
betaVae_loss_eq_weightedTwoTerm is a checked algebraic fact about the named objective; it is not a
claim that the tiny CIFAR example learned a useful latent representation.
7.3.2. Diffusion
Diffusion is the most visibly generative example in the current zoo. The executable is exposed by the diffusion example API, and the API model constructor is in NN.API.Models.Diffusion API. The example uses a compact residual convolutional epsilon predictor because it gives users a real noising and denoising training loop with an inspectable sampling artifact.
A small CPU smoke run looks like:
lake exe torchlean diffusion --cpu --steps 5
The richer path uses prepared image arrays:
python3 scripts/datasets/download_example_data.py --cifar10 lake exe -K cuda=true torchlean diffusion --cuda --fast-kernels \ --dataset cifar10 --n-total 128 --steps 50 --hidden-c 8
For 64 by 64 local image folders, the example accepts converted arrays in the ImageNet style:
python3 scripts/datasets/torchlean_data_convert.py image-folder \ --input /path/to/imagenet/train \ --x-output data/real/imagenet64/imagenet64_train_X.npy \ --y-output data/real/imagenet64/imagenet64_train_y.npy \ --height 64 --width 64 --labels-from-dirs --limit 2000 lake exe -K cuda=true torchlean diffusion --cuda --fast-kernels \ --dataset imagenet64 --n-total 800 --steps 200 --hidden-c 8 \ --sample-ppm data/model_zoo/imagenet64_sample.ppm
The theory layer covers a different, more mathematical part of the stack:
-
The forward Gaussian API defines
forwardGaussianand provesforwardGaussian_isGaussian, so the forward noising distribution is recorded as a Gaussian law in Lean. -
The diffusion samplers API states base case identities for DDPM, DDIM, and probability flow Euler samplers, then proves Lipschitz and contraction style facts such as
eulerStep_l2_distance_bound,eulerStep_l2_lipschitz_of_rhs_lipschitz,ddimStepSystem_contracts_of_step_contracts, andpfOdeEulerSystem_contracts_of_step_lipschitz.
Those are local sampler facts, not a full diffusion convergence theorem. The executable still owns the empirical ML workflow: data loading, timestep sampling, optimizer steps, logs, and image artifacts. The theory layer owns citeable statements about the mathematical objects that the runtime example is named after.
The informal definition is:
\operatorname{forwardGaussian}(c_0,c_1,x_0)
= \mathcal{L}\!\left(c_0x_0+c_1Z\right),
\qquad Z\sim\mathcal{N}(0,I)
and the theorem forwardGaussian_isGaussian says that this law is Gaussian with the expected mean
and covariance transformation. Sampler facts then talk about one step maps:
x_{t-1} = \operatorname{step}(t,x_t,\operatorname{scoreOrNoise})
under Lipschitz or contraction hypotheses. The proof does not inspect image quality; it proves properties of the mathematical step map.
The usual paper lineage is DDPM by Ho, Jain, and Abbeel, "Denoising Diffusion Probabilistic Models" (NeurIPS 2020, https://arxiv.org/abs/2006.11239), DDIM by Song, Meng, and Ermon, "Denoising Diffusion Implicit Models" (ICLR 2021, https://arxiv.org/abs/2010.02502), and score based SDEs by Song et al., "Score-Based Generative Modeling through Stochastic Differential Equations" (ICLR 2021, https://arxiv.org/abs/2011.13456).
7.3.3. VAE And The ELBO Boundary
The VAE example is intentionally modest. In NN.Examples.Models.Generative.Vae API, the model trains on flattened CIFAR images with a supervised target that keeps reconstruction channels near the input and latent mean/log variance proxy channels near zero. That is a runnable beta VAE style path through the TorchLean model API; it is not a full stochastic variational inference experiment.
The proof statements are sharper:
-
NN.Spec.Models.Vae API names the VAE forward pass and loss shape.
-
NN.MLTheory.Generative.Latent.VAE API proves objective decompositions such as
betaVae_loss_eq_weightedTwoTerm,betaVae_loss_eq_reconstruction_of_zero_kl, andbetaVae_loss_mono_beta_of_kl_nonneg. -
The same file proves diagonal Gaussian KL facts, including
coordinateKlToStandard_nonneg,diagonalGaussianKlToStandardReal_nonneg, anddiagonalGaussianKlToStandardReal_eq_zero_iff. -
It also records the reparameterization side with
scalar_reparameterization_lawanddiagonal_reparameterization_coordinate_law, plus named ELBO helpersnegativeElboandbetaNegativeElbo.
The precise VAE claim is: the runtime example trains a compact model shaped like a VAE, while the ML theory declarations prove algebraic and distributional facts about the VAE objective and diagonal reparameterization. That distinction is healthier than claiming that a tiny CIFAR smoke run formalizes all of variational inference.
The basic objective shape is:
\operatorname{negativeElbo}(terms)
= \operatorname{reconstruction}(terms)+\operatorname{KL}(terms)
\operatorname{betaNegativeElbo}(\beta,terms)
= \operatorname{reconstruction}(terms)+\beta\operatorname{KL}(terms)
The KL theorem diagonalGaussianKlToStandardReal_nonneg is the guarantee to remember: the diagonal
Gaussian KL to the standard normal is nonnegative, and it is zero exactly in the standard normal case
stated by diagonalGaussianKlToStandardReal_eq_zero_iff.
The paper anchor is Kingma and Welling, "Auto-Encoding Variational Bayes" (ICLR 2014, https://arxiv.org/abs/1312.6114). For beta-VAE terminology, the common citation is Higgins et al., "beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework" (ICLR 2017, https://openreview.net/forum?id=Sy2fzU9gl).
7.3.4. VQ-VAE
The VQ-VAE runtime file, NN.Examples.Models.Generative.VqVae API, uses a small vector reconstruction model with a narrow bottleneck. That shape keeps the example runnable while making the intended VQ-VAE connection explicit in the nearby spec and theory modules.
The theorem surface is in:
-
NN.Spec.Models.VqVae API, which states quantization and loss equations for a codebook model.
-
NN.MLTheory.Generative.Latent.VQVAE API, which proves
vqvae_loss_eq_weightedThreeTerm,vqvae_loss_eq_reconstruction_of_zero_quantization,vqvae_loss_mono_beta_of_commitment_nonneg, and nearest code facts such asexactCodeMatch_isNearestCodeandnearestCode_minimizes_quantization_loss.
Here again, the boundary is precise. The runnable example exercises TorchLean's training surface for a reconstruction model shaped like a VQ-VAE. The Lean theory proves that the named VQ-VAE objective splits into reconstruction, codebook, and beta weighted commitment terms, and that selected nearest codes minimize the stated finite squared distance objective.
The nearest code predicate is the key mathematical object:
\begin{aligned}
\operatorname{IsNearestCode}(codebook,z,k)
:= \forall j,\;&
\operatorname{squaredL2}(z,codebook(k))\\
&\le \operatorname{squaredL2}(z,codebook(j))
\end{aligned}
From that, nearestCode_minimizes_quantization_loss turns the finite codebook choice into the
stated quantization loss minimum.
The paper anchor is van den Oord, Vinyals, and Kavukcuoglu, "Neural Discrete Representation Learning" (NeurIPS 2017, https://arxiv.org/abs/1711.00937).
7.3.5. GAN And LSGAN
GANs are especially easy to overstate. The NN.Examples.Models.Generative.Gan API is a small LSGAN style executable path. It trains a generator from latent noise toward a CIFAR minibatch as a stable warm up objective, and it trains a discriminator on real CIFAR images and deterministic noise images. It is not a full alternating adversarial training recipe with all of the empirical care that production GANs require.
The formal side is captured by two declaration groups:
-
NN.Spec.Models.Gan API, which names generation, fake scores, and discriminator losses.
-
NN.MLTheory.Generative.Latent.GAN API, which proves objective packaging lemmas such as
generatorLoss_eq_weightedTwoTermanddiscriminatorLoss_eq_weightedThreeTerm, plus zero loss factsgeneratorLoss_zero_of_fake_score_realanddiscriminatorLoss_zero_of_perfect_scores.
This gives us a clean citation line: TorchLean contains checked algebra for LSGAN style scalar objectives and a runnable generator/discriminator demo, but not a theorem that adversarial training converges or that generated samples match the CIFAR distribution.
The objective packaging is deliberately simple:
\begin{aligned}
\operatorname{generatorLoss}
&= \text{weighted two-term objective},\\
\operatorname{discriminatorLoss}
&= \text{weighted three-term objective}
\end{aligned}
\begin{aligned}
\text{perfect discriminator scores}
&\Longrightarrow \operatorname{discriminatorLoss}=0,\\
\text{fake score marked real}
&\Longrightarrow \operatorname{generatorLoss}=0
\end{aligned}
Those are small theorems, but they matter because they pin down which scalar objective the demo is claiming to optimize.
The original GAN reference is Goodfellow et al., "Generative Adversarial Nets" (NeurIPS 2014, paper page). The least squares variant is Mao et al., "Least Squares Generative Adversarial Networks" (ICCV 2017, https://arxiv.org/abs/1611.04076).
7.3.6. Autoencoders And Masked Autoencoders
The plain autoencoder in the autoencoder example API is the simplest reconstruction baseline: load a CIFAR vector batch, run an encoder/decoder, and optimize MSE. It is useful because it removes the probabilistic and adversarial interpretation and lets us inspect the runtime path itself.
The masked autoencoder in the MAE example API is more architectural. It loads CIFAR image tensors, masks deterministic patches, embeds patch tokens with a small ViT style encoder, and trains a decoder head to reconstruct a flattened prefix of the image. This is a deliberately tiny ViT-MAE style path: real masking and patch tokens, but not a large asymmetric MAE pretraining run.
The paper anchor for MAE is He et al., "Masked Autoencoders Are Scalable Vision Learners" (CVPR 2022, https://arxiv.org/abs/2111.06377).
7.3.7. What Is Proved Vs What Is Run
For generative models, the safest citation discipline is:
-
For a command that trains or samples through TorchLean, cite the corresponding file in
NN/Examples/Models/Generative. -
For the diffusion forward law being Gaussian, cite
forwardGaussian_isGaussianin the forward Gaussian API. -
For a sampler step being Lipschitz or contractive under hypotheses, cite the named theorems in the diffusion samplers API.
-
For a VAE objective decomposing into reconstruction plus beta-weighted KL, cite
betaVae_loss_eq_weightedTwoTerm. -
For diagonal Gaussian KL nonnegativity, cite
diagonalGaussianKlToStandardReal_nonneg. -
For a VQ-VAE loss splitting into reconstruction, codebook, and commitment terms, cite
vqvae_loss_eq_weightedThreeTerm. -
For nearest-code minimization in a finite codebook, cite
nearestCode_minimizes_quantization_loss. -
For LSGAN generator and discriminator losses as weighted objectives, cite
generatorLoss_eq_weightedTwoTermanddiscriminatorLoss_eq_weightedThreeTerm.
The current development does not prove global distribution matching, convergence of SGD, image quality, FID, or a complete equivalence between every executable training step and every theorem in the ML theory layer. That boundary is part of the claim: TorchLean can run programs shaped like ML systems, and we can grow the formal island around the parts whose meaning we are ready to state precisely.