Evaluation
onlinecml.evaluation.progressive.progressive_causal_score(stream, model, metrics, step=100)
Evaluate a causal model progressively over a streaming dataset.
Implements the predict-before-learn protocol: for each observation, the model is scored before it sees the label, then updated. This gives an unbiased estimate of online generalisation performance.
At every step observations, each metric's current score is recorded.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
stream
|
iterable of (x, treatment, outcome, true_cate)
|
Any OnlineCML dataset or iterable yielding 4-tuples. |
required |
model
|
BaseOnlineEstimator
|
An unfitted (or partially fitted) causal estimator. Must implement
|
required |
metrics
|
list
|
List of metric objects (e.g., |
required |
step
|
int
|
Record metric scores every |
100
|
Returns:
| Name | Type | Description |
|---|---|---|
results |
dict
|
Dictionary with key |
Examples:
>>> from onlinecml.datasets import LinearCausalStream
>>> from onlinecml.reweighting import OnlineIPW
>>> from onlinecml.evaluation import progressive_causal_score
>>> from onlinecml.evaluation.metrics import ATEError, PEHE
>>>
>>> results = progressive_causal_score(
... stream = LinearCausalStream(n=500, seed=0),
... model = OnlineIPW(),
... metrics = [ATEError(), PEHE()],
... step = 100,
... )
>>> len(results["steps"])
5
Source code in onlinecml/evaluation/progressive.py
Metrics
onlinecml.evaluation.metrics.ATEError
Running absolute error between the estimated and true ATE.
Accumulates the true CATE via a running mean (so it works for both
constant-ATE and heterogeneous streams) and computes
|model.predict_ate() - mean(true_cate)| at each checkpoint.
Examples:
Source code in onlinecml/evaluation/metrics.py
score
property
Current |ATE_hat - ATE_true|.
Returns 0.0 before any data has been seen.
reset()
update(x, w, y, true_cate, cate_hat, model)
Accumulate one observation.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
x
|
dict
|
Covariate dict (unused by this metric). |
required |
w
|
int
|
Treatment indicator (unused by this metric). |
required |
y
|
float
|
Observed outcome (unused by this metric). |
required |
true_cate
|
float
|
True CATE for this unit. Used to build a running mean of the population ATE. |
required |
cate_hat
|
float
|
Predicted CATE (unused by this metric; uses model.predict_ate()). |
required |
model
|
The causal estimator. Must implement |
required |
Source code in onlinecml/evaluation/metrics.py
onlinecml.evaluation.metrics.PEHE
Running Precision in Estimation of Heterogeneous Effects.
Computes sqrt(mean((cate_hat - cate_true)^2)) incrementally using
Welford's algorithm. The cate_hat passed by progressive_causal_score
is the predict-before-learn CATE prediction from model.predict_one(x).
References
Hill, J. (2011). Bayesian nonparametric modeling for causal inference. Journal of Computational and Graphical Statistics, 20(1), 217-240.
Examples:
Source code in onlinecml/evaluation/metrics.py
score
property
Current PEHE (sqrt of running mean squared CATE error).
Returns 0.0 before any data has been seen.
reset()
update(x, w, y, true_cate, cate_hat, model)
Accumulate one observation.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
x
|
dict
|
Covariate dict (unused by this metric). |
required |
w
|
int
|
Treatment indicator (unused by this metric). |
required |
y
|
float
|
Observed outcome (unused by this metric). |
required |
true_cate
|
float
|
True CATE for this unit. |
required |
cate_hat
|
float
|
Predicted CATE from |
required |
model
|
The causal estimator (unused by this metric). |
required |
Source code in onlinecml/evaluation/metrics.py
onlinecml.evaluation.metrics.UpliftAUC
Area under the uplift curve (AUUC).
Accumulates (cate_hat, treatment, outcome) triples. At each call to
score, it sorts the buffer by predicted CATE descending, computes the
cumulative uplift curve, and returns the area via the trapezoidal rule,
normalized to [0, 1].
The uplift at depth k is::
uplift(k) = mean_outcome_treated(top_k) - mean_outcome_control(top_k)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
max_buffer
|
int
|
Maximum number of recent observations to retain. Older observations are dropped to keep memory bounded. Default 5000. |
5000
|
Examples:
Source code in onlinecml/evaluation/metrics.py
139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 | |
score
property
Current AUUC, normalized to [0, 1].
Returns 0.0 when fewer than two observations have been seen or
when all units are in one arm.
reset()
update(x, w, y, true_cate, cate_hat, model)
Accumulate one observation.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
x
|
dict
|
Covariate dict (unused by this metric). |
required |
w
|
int
|
Treatment indicator (0 or 1). |
required |
y
|
float
|
Observed outcome. |
required |
true_cate
|
float
|
True CATE (unused by this metric). |
required |
cate_hat
|
float
|
Predicted CATE used to rank units. |
required |
model
|
The causal estimator (unused by this metric). |
required |
Source code in onlinecml/evaluation/metrics.py
onlinecml.evaluation.metrics.QiniCoefficient
Qini coefficient (area under the Qini curve).
The Qini curve plots cumulative incremental gains vs. cumulative population fraction when units are ranked by predicted CATE descending. The Qini coefficient is the area under this curve minus the area under the random policy line, normalized by the maximum achievable Qini.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
max_buffer
|
int
|
Maximum number of recent observations to retain. Default 5000. |
5000
|
References
Radcliffe, N.J. (2007). Using control groups to target on predicted lift. Direct Marketing Analytics Journal, 14-21.
Examples:
Source code in onlinecml/evaluation/metrics.py
241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 | |
score
property
Current normalized Qini coefficient.
Returns 0.0 when fewer than two observations have been seen or
when either arm is empty.
reset()
update(x, w, y, true_cate, cate_hat, model)
Accumulate one observation.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
x
|
dict
|
Covariate dict (unused by this metric). |
required |
w
|
int
|
Treatment indicator (0 or 1). |
required |
y
|
float
|
Observed outcome. |
required |
true_cate
|
float
|
True CATE (unused by this metric). |
required |
cate_hat
|
float
|
Predicted CATE used to rank units. |
required |
model
|
The causal estimator (unused by this metric). |
required |
Source code in onlinecml/evaluation/metrics.py
onlinecml.evaluation.metrics.CIWidth
Running mean width of the confidence interval on the ATE.
Computes the mean of (upper - lower) for each CI returned by
model.predict_ci(alpha) at each observation.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
alpha
|
float
|
Significance level for the CI. Default 0.05 (95% CI). |
0.05
|
Examples:
Source code in onlinecml/evaluation/metrics.py
score
property
Current mean CI width.
Returns 0.0 before any data has been seen.
reset()
update(x, w, y, true_cate, cate_hat, model)
Accumulate one observation.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
x
|
dict
|
Covariate dict (unused by this metric). |
required |
w
|
int
|
Treatment indicator (unused by this metric). |
required |
y
|
float
|
Observed outcome (unused by this metric). |
required |
true_cate
|
float
|
True CATE (unused by this metric). |
required |
cate_hat
|
float
|
Predicted CATE (unused by this metric). |
required |
model
|
The causal estimator. Must implement |
required |
Source code in onlinecml/evaluation/metrics.py
onlinecml.evaluation.metrics.CIcoverage
Empirical coverage of the ATE confidence interval.
At each checkpoint, checks whether the true population ATE (running mean
of true_cate) falls within model.predict_ci(alpha). Returns the
running fraction of checkpoints where the CI covered the true ATE.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
alpha
|
float
|
Significance level. Default 0.05 (95% CI; target coverage = 0.95). |
0.05
|
Examples:
Source code in onlinecml/evaluation/metrics.py
score
property
Fraction of checkpoints where the CI covered the true ATE.
Returns 0.0 before any data has been seen.
reset()
update(x, w, y, true_cate, cate_hat, model)
Accumulate one observation and check CI coverage.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
x
|
dict
|
Covariate dict (unused by this metric). |
required |
w
|
int
|
Treatment indicator (unused by this metric). |
required |
y
|
float
|
Observed outcome (unused by this metric). |
required |
true_cate
|
float
|
True CATE. Used to estimate the population ATE via running mean. |
required |
cate_hat
|
float
|
Predicted CATE (unused by this metric). |
required |
model
|
The causal estimator. Must implement |
required |