Skip to content

Online Learning Basics

OnlineCML is built on online machine learning — a paradigm where models are updated one observation at a time rather than fitting on a complete dataset. This page explains the core concepts.

The Online Learning Loop

for x, w, y, true_cate in stream:
    # 1. Predict BEFORE learning (no look-ahead bias)
    cate_hat = model.predict_one(x)

    # 2. Update with the new observation
    model.learn_one(x, w, y)

This predict-then-learn protocol is fundamental to online causal inference. It ensures that the model's prediction for observation \(i\) is based only on observations \(1, \ldots, i-1\), approximating the cross-fitting that batch methods use to avoid look-ahead bias in pseudo-outcomes.

Welford's Algorithm

All running statistics in OnlineCML use Welford's (1962) online algorithm for numerically stable one-pass mean and variance estimation:

\[M_n = M_{n-1} + \frac{x_n - M_{n-1}}{n}$$ $$S_n = S_{n-1} + (x_n - M_{n-1})(x_n - M_n)\]

Sample variance: \(\hat{\sigma}^2 = S_n / (n-1)\)

This runs in O(1) space and O(1) time per observation, with superior numerical stability to the naive two-pass algorithm. See RunningStats.

Hoeffding Bound

The Hoeffding bound is used by CausalHoeffdingTree to decide when there is enough evidence to split a node. For a random variable in range \([a, b]\), after \(n\) observations the bound is:

\[\varepsilon(n, \delta) = \sqrt{\frac{(b-a)^2 \ln(1/\delta)}{2n}}\]

With probability \(1 - \delta\), the true mean of any function of the data does not deviate from its sample mean by more than \(\varepsilon\). A split is triggered when the best split beats the second-best by more than \(\varepsilon\).

ADWIN (Adaptive Windowing)

ConceptDriftMonitor uses River's ADWIN (Bifet & Gavalda, 2007) algorithm. ADWIN maintains a sliding window of variable length and detects drift when the means of two sub-windows are statistically different, using the Hoeffding bound as the test statistic.

ADWIN provides theoretical guarantees: - False positive rate bounded by \(\delta\) per test - Detection delay proportional to \(1 / \Delta\mu^2\) where \(\Delta\mu\) is the magnitude of the drift

Online Bagging

OnlineCausalForest uses Oza's (2001) online bagging: each observation is presented to tree \(k\) exactly \(\text{Poisson}(\lambda)\) times, where \(\lambda\) is the subsampling rate. When \(\lambda = 1\), this simulates bootstrap resampling in the online setting and gives approximately the same variance reduction as batch random forests.

References

  • Welford, B.P. (1962). Note on a method for calculating corrected sums of squares and products. Technometrics, 4(3), 419–420.
  • Hoeffding, W. (1963). Probability inequalities for sums of bounded random variables. Journal of the American Statistical Association, 58(301), 13–30.
  • Bifet, A. and Gavalda, R. (2007). Learning from time-changing data with adaptive windowing. Proceedings of SDM, 443–448.
  • Oza, N.C. (2001). Online bagging and boosting. Proceedings of the American Statistical Association, 229–234.
  • Domingos, P. and Hulten, G. (2000). Mining high-speed data streams. Proceedings of KDD, 71–80.