Online Learning Basics
OnlineCML is built on online machine learning — a paradigm where models are updated one observation at a time rather than fitting on a complete dataset. This page explains the core concepts.
The Online Learning Loop
for x, w, y, true_cate in stream:
# 1. Predict BEFORE learning (no look-ahead bias)
cate_hat = model.predict_one(x)
# 2. Update with the new observation
model.learn_one(x, w, y)
This predict-then-learn protocol is fundamental to online causal inference. It ensures that the model's prediction for observation \(i\) is based only on observations \(1, \ldots, i-1\), approximating the cross-fitting that batch methods use to avoid look-ahead bias in pseudo-outcomes.
Welford's Algorithm
All running statistics in OnlineCML use Welford's (1962) online algorithm for numerically stable one-pass mean and variance estimation:
Sample variance: \(\hat{\sigma}^2 = S_n / (n-1)\)
This runs in O(1) space and O(1) time per observation, with superior
numerical stability to the naive two-pass algorithm. See RunningStats.
Hoeffding Bound
The Hoeffding bound is used by CausalHoeffdingTree to decide when there
is enough evidence to split a node. For a random variable in range \([a, b]\),
after \(n\) observations the bound is:
With probability \(1 - \delta\), the true mean of any function of the data does not deviate from its sample mean by more than \(\varepsilon\). A split is triggered when the best split beats the second-best by more than \(\varepsilon\).
ADWIN (Adaptive Windowing)
ConceptDriftMonitor uses River's ADWIN (Bifet & Gavalda, 2007) algorithm.
ADWIN maintains a sliding window of variable length and detects drift when the
means of two sub-windows are statistically different, using the Hoeffding bound
as the test statistic.
ADWIN provides theoretical guarantees: - False positive rate bounded by \(\delta\) per test - Detection delay proportional to \(1 / \Delta\mu^2\) where \(\Delta\mu\) is the magnitude of the drift
Online Bagging
OnlineCausalForest uses Oza's (2001) online bagging: each observation is
presented to tree \(k\) exactly \(\text{Poisson}(\lambda)\) times, where \(\lambda\)
is the subsampling rate. When \(\lambda = 1\), this simulates bootstrap resampling
in the online setting and gives approximately the same variance reduction as
batch random forests.
References
- Welford, B.P. (1962). Note on a method for calculating corrected sums of squares and products. Technometrics, 4(3), 419–420.
- Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301), 13–30.
- Bifet, A. and Gavalda, R. (2007). Learning from time-changing data with adaptive windowing. Proceedings of SDM, 443–448.
- Oza, N.C. (2001). Online bagging and boosting. Proceedings of the American Statistical Association, 229–234.
- Domingos, P. and Hulten, G. (2000). Mining high-speed data streams. Proceedings of KDD, 71–80.