Global pooling (spec layer) #
Global pooling reduces spatial dimensions (H×W) either to 1×1 (retain channel axis) or
to a flat vector of size inC. This file provides both average and max variants, together with
explicit backward rules.
We tried to mimic PyTorch closely:
- The common pattern is
AdaptiveAvgPool2d((1,1))/AdaptiveMaxPool2d((1,1)), then flatten to a length-Cvector before a classifier. - We usually work with a single image
(C,H,W)(no batch dimension) here to keep the API small.
Forward generalizes cleanly (and we intentionally structure the code that way):
- Global pooling is "reduce each channel over (H,W)".
- The helpers
global_pool2d_1x1andglobal_pool2d_flatalready capture the reusable shape and indexing discipline; the only thing that changes between avg/max/min/etc. is thereduce : Image inH inW α → Tensor α .scalar.
Max-pooling subtlety:
- If there are multiple spatial positions achieving the same maximum, the backward pass needs a
tie-breaking convention. This file provides both:
- a "mask all max positions" rule (sending the full gradient to every max), and
- a "distributed" rule (split the gradient evenly among max positions). PyTorch's exact tie behavior is an implementation detail; the important thing is to make the choice explicit in the spec.
Why the backward does not unify for free:
- Different reductions have genuinely different adjoints. Average pooling sends the upstream gradient uniformly to every spatial position; max/min pooling routes gradients only to the argmax/argmin set and must choose a tie convention.
- So while the forward can be abstracted over a
reduce, a fully generic backward would need extra structure (basically "a reduce + its VJP"). That is why we keep explicit backward specs for the concrete ops we care about.
Layer tags #
Global pooling has no trainable parameters. We still keep a compact "layer spec" record so call sites can carry a tag (and so the API matches the style of other layer files).
Tag structure for global average pooling (no trainable parameters).
Instances For
Tag structure for global max pooling (no trainable parameters).
Instances For
Output shape for global pooling that keeps a 1 x 1 spatial grid: (C,H,W) -> (C,1,1).
Instances For
Output shape for global pooling that flattens spatial dims away: (C,H,W) -> (C).
Instances For
Helper: reduce a single channel over its spatial grid #
This is the shared "walk the (H,W) grid" loop used by avg/max pooling.
Reduce a single channel Image inH inW α down to a scalar using a fold over (H,W).
Instances For
Alias for reduce_spatial (kept to make call sites read like "reduce this channel").
Instances For
Helper: "wrap a scalar result back into an image" #
PyTorch mental picture: after pooling you conceptually have a scalar per channel; these helpers put that scalar back into a scalar tensor shape.
Broadcast a scalar into a 1 x 1 image.
Instances For
Generic global pooling helper producing (C,1,1).
Instances For
Generic global pooling helper producing (C).
Instances For
Forward specs #
These are the layer-level forward meanings, written in the same style as PyTorch.
Global average pooling: (C,H,W) -> (C,1,1).
Instances For
Global max pooling: (C,H,W) -> (C,1,1).
Instances For
Global average pooling (flattened): (C,H,W) -> (C).
Instances For
Global max pooling (flattened): (C,H,W) -> (C).
Instances For
Backward/VJP specs #
These are reverse-mode rules that match the intended math:
- avg pooling: distribute the upstream gradient evenly over all
(H,W)positions; - max pooling: route the upstream gradient to the max locations (with a tie convention).
Backward/VJP for global average pooling (C,1,1) output.
Instances For
Backward/VJP for flattened global average pooling (C) output.
Instances For
Backward/VJP for global max pooling (C,1,1) output.
Tie convention: every spatial position equal to the maximum receives the full upstream gradient.
Instances For
Backward/VJP for flattened global max pooling (C) output.
Tie convention: every spatial position equal to the maximum receives the full upstream gradient.
Instances For
Alternative max-pooling backward that splits the gradient evenly across max positions.
This is often a nicer mathematical choice when the max is not unique.