Gradient boosted trees (spec model) #
This is a math/reference specification of gradient boosting using decision trees.
Important caveat:
- Many computations here are written in a straightforward, proof-friendly style rather than as a tuned implementation.
References (classical):
- CART: Breiman, Friedman, Olshen, Stone, "Classification and Regression Trees", 1984.
- Gradient boosting: Friedman, "Greedy Function Approximation: A Gradient Boosting Machine", 2001.
- XGBoost: Chen and Guestrin, "XGBoost: A Scalable Tree Boosting System", 2016.
- LightGBM: Ke et al., "LightGBM: A Highly Efficient Gradient Boosting Decision Tree", 2017.
Tree representation #
We represent a decision tree as a small inductive datatype:
leaf valuestores the prediction for that leafsplit feature threshold left rightbranches on a single feature
This is deliberately compact: it is easy to interpret (forward pass) and easy to fit with a simple greedy CART-style algorithm (implemented below).
Note on comparisons:
The spec layer’s scalar interface (Context α) gives us a decidable > (via
Context.decidable_gt)
but does not promise a decidable < for every backend. To stay portable, the tree uses the rule:
goRight := (x_feature > threshold)
and goes left otherwise (which matches “≤ threshold” for the usual numeric orders).
A regression-tree node for the typed GBDT specification.
leaf valuestores the prediction for that leaf.split feature threshold left rightbranches on a single feature using the rulegoRight := (x_feature > threshold).
This keeps the representation small and easy to interpret.
- leaf {α : Type} (value : α) : TreeNode α
- split {α : Type} (feature : ℕ) (threshold : α) (left right : TreeNode α) : TreeNode α
Instances For
Decision-tree spec wrapper with an explicit max-depth budget.
The max_depth field is redundant (it matches the type index) but is convenient in downstream
code that wants a value-level knob.
Instances For
Instances For
Gradient boosted tree ensemble (regression-style) specification.
We keep the model as an explicit tensor of trees plus a shrinkage parameter learning_rate and an
initial_prediction bias term.
- trees : Tensor (DecisionTreeSpec α maxDepth) (Shape.dim nTrees Shape.scalar)
trees.
- learning_rate : α
learning rate.
- initial_prediction : α
initial prediction.
Instances For
Forward pass #
All forward passes in this file are explicit about the feature dimension nFeatures. Tree depth
(maxDepth) limits how many splits a tree may contain; it has nothing to do with how many input
features exist.
Forward pass for a single decision tree on an input vector of nFeatures features.
Instances For
Instances For
Batched forward pass for a single decision tree.
Instances For
Forward pass for a gradient boosted ensemble on a single input.
This simply accumulates initial_prediction + learning_rate * sum(tree_i(x)).
Instances For
Instances For
Batched forward pass for a gradient boosted ensemble.
Instances For
"Gradient" w.r.t. a tree's prediction.
Decision trees are piecewise-constant in the inputs, so we do not attempt to define meaningful derivatives through their internal decisions here. For boosting, this convention makes the intended dataflow explicit: gradient information is used to fit subsequent trees, not to differentiate through split predicates.
Instances For
Approximate gradient w.r.t. input features for a tree.
In this spec we return 0 gradients (trees are treated as non-differentiable).
Instances For
Zero input-gradient convention for the ensemble.
This file treats boosted trees as a classical model: we do not backpropagate through tree
structure. Instead, residuals/gradients are used to fit new trees. This helper is intentionally
not wired into an OpSpec; callers should not mistake it for a differentiable surrogate.
Instances For
Classical training: CART-style regression trees (MSE) #
Tree-based models here are intended as baselines and reference points. For neural models we implement reverse-mode explicitly; for trees we instead provide a classical (non-gradient) training routine.
This section implements a small greedy CART-like procedure for regression:
- choose the split
(feature, threshold)that minimizesSSE(left) + SSE(right)(sum of squared errors around each side’s mean) - recurse until depth runs out or the split becomes degenerate
This is deterministic by construction:
- thresholds are chosen from the observed feature values
- ties are broken by the “first best” encountered during folding
This implementation prioritizes clarity and determinism over performance.
A single regression training example: feature vector x and scalar target y.
- x : Tensor α (Shape.dim nFeatures Shape.scalar)
x.
- y : α
y.
Instances For
Sum all elements of a list.
Instances For
Mean of a list, with 0 as a convenient default for the empty list.
Instances For
Sum of squared deviations from the mean (SSE).
Instances For
Decide whether a sample goes to the right branch for a (feature, threshold) split.
Instances For
Partition samples into (left, right) for a given split.
Instances For
Extract the regression targets from a list of examples.
Instances For
Score a candidate split by sum of squared errors (SSE).
Returns none for degenerate splits (all samples go to one side).
Instances For
Find the best split (feature, threshold) by exhaustive search over observed thresholds.
Instances For
Leaf prediction value for regression: the mean target.
Instances For
Fit a regression tree by greedy CART-style splitting (MSE/SSE), with a depth budget.
depthLeft counts how many splits we are still allowed to make.
Instances For
Fit a regression decision tree from a batched dataset.
Instances For
Classical training: CART-style classification trees (Gini impurity) #
For classification we often want the leaf prediction to be a label (e.g. String or Nat),
while split thresholds remain numeric. To avoid forcing labels into the numeric scalar type α,
we define a separate classifier tree type parameterized by the label type β.
The training algorithm mirrors the regression case:
- enumerate candidate thresholds from observed feature values
- pick the split that minimizes weighted Gini impurity
- recurse until depth runs out or no improvement is possible
- leaf prediction is the majority class (deterministic tie-breaking via first occurrence)
Why β is separate from α:
- Splits compare numeric features (
α), so they need ordering/decidable comparison. - Leaf predictions are usually discrete (
β), and we do not want to pretend that labels form a numeric scalar domain.
PyTorch / sklearn analogies:
- This is closest in spirit to
sklearn.tree.DecisionTreeClassifierwithcriterion="gini", expressed as a small pure spec. - The boosting semantics (adding many trees sequentially) matches the high-level idea of
sklearn.ensemble.GradientBoostingClassifier, but TorchLean does not try to reproduce all of sklearn’s engineering details (regularization knobs, histogram binning, etc.).
A classifier tree node: numeric splits, label-valued leaves.
- leaf {α β : Type} (label : β) : ClassifierTreeNode α β
- split {α β : Type} (feature : ℕ) (threshold : α) (left right : ClassifierTreeNode α β) : ClassifierTreeNode α β
Instances For
Instances For
Specification wrapper for a classification decision tree (numeric splits, label-valued leaves).
- root : ClassifierTreeNode α β
root.
- max_depth : ℕ
max depth.
Instances For
Instances For
Forward pass for a classifier decision tree on an input vector of nFeatures features.
Branching convention:
- go right iff
(x[feature] > threshold), - otherwise go left.
This mirrors a common "≤ goes left / > goes right" convention, but avoids needing a decidable <
for every Context α backend.
Instances For
Instances For
Count how many times lbl appears in ys.
Instances For
Majority label with deterministic tie-breaking.
If there is a tie, we keep the earlier winner from the fold. This is intentional: it avoids non-determinism and keeps the spec stable across backends.
Instances For
Gini impurity of a multiset of labels.
gini(ys) = 1 - Σ_c p(c)^2 where p(c) is the empirical class frequency.
This is the standard CART impurity used by many tree classifiers.
Instances For
Weighted Gini impurity: |ys| * gini(ys).
Instances For
Extract the labels from a list of classification examples.
Instances For
Decide whether a classification sample goes right for a (feature, threshold) split.
Instances For
Partition classification samples into (left, right) for a candidate split.
Instances For
Score a candidate classification split (feature, threshold) by weighted Gini impurity.
Returns none when the split is degenerate (one side is empty); otherwise returns the score and
the (left, right) partitions.
Instances For
Find the best classification split (feature, threshold) by exhaustive search.
Thresholds are drawn from the observed feature values in the dataset.
Instances For
Fit a classification tree node by greedy CART-style splitting (Gini impurity).
depthLeft counts how many more splits we are allowed to make.
Instances For
Fit a classification decision tree (CART-style) using Gini impurity.
Labels are supplied as a list of length batch.
- If
yis shorter thanbatch, missing entries usedefault(so the function remains total). - If
yis longer thanbatch, extra labels are ignored.
This “list-based labels” API is meant for demos and small experiments. If you have labels already as a tensor or an index type, it is usually better to convert them to a list explicitly at the boundary of your program so the conversion is visible.
Instances For
Mean squared error (MSE) loss for regression, reduced to a scalar by averaging over the batch.
Instances For
Binary cross-entropy loss for classification (with a sigmoid), reduced to a scalar by averaging.
This is a direct probability-space loss helper. For numerically sensitive classification pipelines, prefer a stable "BCE with logits" implementation in the runtime/training layer.
Instances For
Gradient of MSE loss w.r.t. predictions (elementwise).
Instances For
Gradient of sigmoid binary cross-entropy loss w.r.t. predictions (elementwise).
Instances For
Residual computation for gradient boosting.
For squared-error regression, the residual is target - prediction.
Instances For
One gradient-boosting "add a tree" step, given a pre-fit new_tree.
This returns the current loss and the updated model with new_tree appended.
The residuals computed here are illustrative; the "fit a tree to residuals" variant below is
usually the more self-contained baseline.
Instances For
Gradient boosting: a "fit-one-more-tree" step #
The original gradient_boosted_trees_train_step_spec expects a pre-fit new_tree. For a more
complete baseline, we also provide a deterministic step that fits that tree to the residuals.
Fit a new tree to residuals and append it to the ensemble.
Instances For
Increment a single feature counter by 1 inside a length-nFeatures vector.
This is used by the split-count feature-importance computation below.
Instances For
Count how many times each feature index appears in split nodes of a tree.
This mirrors a very common "split count" importance heuristic.
Instances For
Simple split-count feature importance for an ensemble.
This mirrors the common "how often was a feature used in a split?" heuristic. It is not the same as gain-based importance in XGBoost/LightGBM, but it is deterministic and easy to interpret.
Instances For
Instances For
Coefficient of determination (R^2) for regression.
This uses the standard formula 1 - ss_res / ss_tot, written as (ss_tot - ss_res) / ss_tot
to avoid an explicit 1 - ... when working in an abstract scalar context.
Instances For
Mean absolute error (MAE) for regression.
Instances For
Root mean squared error (RMSE) for regression.
Instances For
Loss-margin early-stopping predicate for gradient boosting.
This compares a training loss and validation loss with a margin min_delta.
The caller is responsible for tracking the patience counter; this predicate only checks one
train/validation loss pair.
Instances For
Adjust the ensemble learning rate (shrinkage) while keeping the same trees.
Instances For
Deterministic prefix selection used as a proof-friendly stand-in for stochastic subsampling.
Real stochastic GBDT implementations sample rows using randomness. This helper instead takes the
first newBatch rows and uses h_new_batch to make that access total, so it is deterministic and
does not silently pad with zeros.
Instances For
XGBoost-style squared-error proxy with an L2-shaped scalar penalty.
This objective is a typed loss for an already-materialized ensemble. It is not a full XGBoost split-gain objective; tree-builder policies such as histogram binning and split search are represented elsewhere by the tree-fitting routines.
Instances For
LightGBM-style squared-error proxy with L1/L2-shaped scalar penalties.
This objective is deterministic because it operates on a fixed ensemble and batch. It records the loss shape used by examples rather than the full LightGBM histogram/split objective.