Masked Autoencoder Objective Semantics #
This module formalizes the finite patch/token core of a masked autoencoder (MAE):
- an input is a finite collection of patches
Fin n → Patch; - an encoder/decoder stack is abstracted to a reconstruction function;
- the objective is a sum of per-patch losses over the masked indices.
The formalization is intentionally small. It captures the semantics that examples and future model helpers should preserve, while leaving ViT blocks, convolutional patch embeddings, and image IO in the executable API layer.
Paper anchor: “Masked Autoencoders Are Scalable Vision Learners” (He, Chen, Xie, Li, Dollár,
Girshick, 2021), arXiv:2111.06377. The key objective-level fact we encode is that the
reconstruction loss is taken over the masked patch set. Therefore the objective should not depend
on an arbitrary ordering of masked patch indices; maeLoss_reverse is the small finite theorem
capturing that property.
A finite patch collection.
Instances For
Reconstruct every patch using a reconstruction function.
Instances For
Exact reconstruction predicate for all patches.
Instances For
MAE-style masked reconstruction loss over an explicit masked-index list.
The list is the serialized representation of the masked set. The theorems below prove that the objective behaves like a set sum for basic reorderings/decompositions.
Instances For
The MAE loss is invariant under reversing the order of the masked-index list. This is the small formal version of “masked reconstruction is a set objective, not an ordering objective.”
If every selected patch has zero reconstruction loss, the masked MAE loss is zero.
Reconstructing with the identity decoder/prediction is exact.