Datasets
Synthetic Streams
onlinecml.datasets.linear_causal.LinearCausalStream
Synthetic streaming dataset with a linear DGP and constant ATE.
Generates a stream of (features, treatment, outcome, true_cate) tuples
from a linear data-generating process with confounding. The true CATE
is constant (equal to true_ate) for all units.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
n
|
int
|
Number of observations to generate. Default 1000. |
1000
|
n_features
|
int
|
Number of covariates. Default 5. |
5
|
true_ate
|
float
|
The true constant Average Treatment Effect. Default 2.0. |
2.0
|
confounding_strength
|
float
|
Controls how strongly covariates predict treatment assignment. 0.0 = no confounding (RCT), 1.0 = strong confounding. Default 0.5. |
0.5
|
noise_std
|
float
|
Standard deviation of the outcome noise. Default 1.0. |
1.0
|
seed
|
int or None
|
Random seed for reproducibility. If None, results are random. |
None
|
Notes
Data-generating process:
beta ~ N(0, I_p)— fixed coefficient vector per stream iterationX_i ~ N(0, I_p)— independent covariateslogit(P(W=1|X)) = confounding_strength * X @ beta / sqrt(p)W_i ~ Bernoulli(sigmoid(logit_p))Y_i = X @ beta + W_i * true_ate + eps,eps ~ N(0, noise_std^2)
The coefficient vector beta is re-sampled once per __iter__
call. Re-iterating with the same seed gives the same stream.
Because CATE is constant, true_cate == true_ate for all units.
Examples:
>>> stream = LinearCausalStream(n=500, n_features=3, true_ate=2.0, seed=42)
>>> for x, w, y, tau in stream:
... pass # process one observation at a time
>>> len(stream)
500
Source code in onlinecml/datasets/linear_causal.py
__iter__()
Iterate over the stream, yielding one observation at a time.
Yields:
| Name | Type | Description |
|---|---|---|
x |
dict
|
Feature dictionary with keys |
treatment |
int
|
Treatment indicator (0 or 1). |
outcome |
float
|
Observed outcome. |
true_cate |
float
|
The true CATE for this unit (equals |
Source code in onlinecml/datasets/linear_causal.py
onlinecml.datasets.heterogeneous_causal.HeterogeneousCausalStream
Synthetic streaming dataset with heterogeneous treatment effects.
Generates a stream of (features, treatment, outcome, true_cate) tuples where the CATE varies across units according to a specified functional form. Designed for benchmarking CATE estimators.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
n
|
int
|
Number of observations to generate. Default 1000. |
1000
|
n_features
|
int
|
Number of covariates (at least 2 required for nonlinear DGP). Default 5. |
5
|
true_ate
|
float
|
Base treatment effect. The marginal average |
2.0
|
heterogeneity
|
str
|
Functional form of the CATE:
Default |
'nonlinear'
|
confounding_strength
|
float
|
Controls how strongly covariates predict treatment assignment. Default 0.5. |
0.5
|
noise_std
|
float
|
Standard deviation of the outcome noise. Default 1.0. |
1.0
|
seed
|
int or None
|
Random seed for reproducibility. |
None
|
Notes
For standard-normal covariates, E[tau(X)] = true_ate for all
three heterogeneity types:
'linear':E[true_ate * (1 + 0.5*X0)] = true_ate * (1 + 0) = true_ate'nonlinear':E[true_ate + X0 + sin(X1)*0.5] = true_ate + 0 + 0 = true_ate'step':E[2*tau * 1{X0>0} + 0.5*tau * 1{X0<=0}] = true_ate * 1.25(step DGP does NOT have population ATE exactly equal totrue_ate)
Use population_ate() to get the theoretical marginal ATE.
Examples:
>>> stream = HeterogeneousCausalStream(n=500, true_ate=2.0, seed=42)
>>> x, w, y, tau = next(iter(stream))
>>> isinstance(tau, float)
True
Source code in onlinecml/datasets/heterogeneous_causal.py
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 | |
__iter__()
Iterate over the stream, yielding one observation at a time.
Yields:
| Name | Type | Description |
|---|---|---|
x |
dict
|
Feature dictionary with keys |
treatment |
int
|
Treatment indicator (0 or 1). |
outcome |
float
|
Observed outcome. |
true_cate |
float
|
The true individual CATE for this unit. |
Source code in onlinecml/datasets/heterogeneous_causal.py
__len__()
population_ate()
Return the theoretical marginal ATE, E[tau(X)].
For 'linear' and 'nonlinear' DGPs with standard-normal
covariates, this equals true_ate. For 'step', it equals
1.25 * true_ate.
Returns:
| Type | Description |
|---|---|
float
|
Theoretical population average treatment effect. |
Source code in onlinecml/datasets/heterogeneous_causal.py
onlinecml.datasets.drifting_causal.DriftingCausalStream
Synthetic streaming dataset where the ATE shifts at a known changepoint.
Generates observations from a linear DGP with confounding. The true ATE
is true_ate for the first changepoint observations, then shifts
to shifted_ate for the remainder. The changepoint is announced via the
changepoint attribute so that downstream drift monitors can be evaluated
against the known ground truth.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
n
|
int
|
Total number of observations to generate. Default 2000. |
2000
|
n_features
|
int
|
Number of covariates. Default 5. |
5
|
true_ate
|
float
|
ATE before the changepoint. Default 2.0. |
2.0
|
shifted_ate
|
float
|
ATE after the changepoint. Default -1.0. |
-1.0
|
changepoint
|
int or None
|
Index (0-based) at which the ATE shifts. Defaults to |
None
|
confounding_strength
|
float
|
Strength of confounding. 0.0 = RCT, 1.0 = strong. Default 0.5. |
0.5
|
noise_std
|
float
|
Standard deviation of outcome noise. Default 1.0. |
1.0
|
seed
|
int or None
|
Random seed for reproducibility. |
None
|
Notes
The true CATE returned per observation reflects the current ATE segment:
true_ate before the changepoint and shifted_ate after.
Examples:
>>> stream = DriftingCausalStream(n=1000, true_ate=2.0, shifted_ate=-1.0, seed=0)
>>> for x, w, y, tau in stream:
... pass # tau shifts from 2.0 to -1.0 at step 500
Source code in onlinecml/datasets/drifting_causal.py
__iter__()
Iterate over the stream, yielding one observation at a time.
Yields:
| Name | Type | Description |
|---|---|---|
x |
dict
|
Feature dictionary with keys |
treatment |
int
|
Treatment indicator (0 or 1). |
outcome |
float
|
Observed outcome. |
true_cate |
float
|
True CATE for this unit. Equals |
Source code in onlinecml/datasets/drifting_causal.py
onlinecml.datasets.unbalanced_causal.UnbalancedCausalStream
Synthetic streaming dataset with extreme treatment probabilities.
Generates observations where most units are assigned to one arm, creating near-positivity violations. Designed to stress-test overlap diagnostics and the stability of IPW-based estimators under extreme propensity scores.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
n
|
int
|
Number of observations to generate. Default 1000. |
1000
|
n_features
|
int
|
Number of covariates. Default 5. |
5
|
true_ate
|
float
|
True Average Treatment Effect. Default 2.0. |
2.0
|
treatment_rate
|
float
|
Target marginal probability of treatment. Values close to 0 or 1 create the most severe positivity violations. Default 0.1. |
0.1
|
confounding_strength
|
float
|
Controls how strongly X predicts treatment on top of the marginal imbalance. 0.0 = only marginal imbalance, 1.0 = strong additional confounding. Default 1.5. |
1.5
|
noise_std
|
float
|
Standard deviation of outcome noise. Default 1.0. |
1.0
|
seed
|
int or None
|
Random seed for reproducibility. |
None
|
Notes
The logit for treatment assignment is:
.. math::
\text{logit}(P(W=1|X)) = \text{logit}(\text{treatment\_rate})
+ \text{confounding\_strength} \cdot X\beta / \sqrt{p}
This ensures that the marginal treatment rate is approximately
treatment_rate while adding covariate-driven confounding.
Examples:
>>> stream = UnbalancedCausalStream(n=500, treatment_rate=0.1, seed=42)
>>> rates = [w for _, w, _, _ in stream]
>>> abs(sum(rates) / len(rates) - 0.1) < 0.1 # roughly 10% treated
True
Source code in onlinecml/datasets/unbalanced_causal.py
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 | |
__iter__()
Iterate over the stream, yielding one observation at a time.
Yields:
| Name | Type | Description |
|---|---|---|
x |
dict
|
Feature dictionary with keys |
treatment |
int
|
Treatment indicator (0 or 1). |
outcome |
float
|
Observed outcome. |
true_cate |
float
|
True CATE for this unit (constant, equals |
Source code in onlinecml/datasets/unbalanced_causal.py
onlinecml.datasets.continuous_treatment.ContinuousTreatmentStream
Synthetic streaming dataset with a continuous treatment (dose-response).
Generates observations where the treatment W is a continuous random
variable (uniform or normal) rather than binary. The outcome follows a
dose-response model Y = X @ beta + g(W) + noise, where g is a
known dose-response function.
The fourth element yielded per observation is the marginal causal
effect dE[Y]/dW at the observed dose W:
'linear'→g(W) = true_effect * W; marginal =true_effect'quadratic'→g(W) = true_effect * W^2; marginal =2 * true_effect * W'threshold'→g(W) = true_effect * (W > 0.0); marginal =0(non-differentiable, yieldsg(W))
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
n
|
int
|
Number of observations. Default 1000. |
1000
|
n_features
|
int
|
Number of covariates. Default 5. |
5
|
true_effect
|
float
|
Scaling of the dose-response function. Default 1.0. |
1.0
|
dose_response
|
str
|
One of |
'linear'
|
w_distribution
|
str
|
Treatment distribution: |
'uniform'
|
w_min
|
float
|
Lower bound for uniform treatment draw. Default -1.0. |
-1.0
|
w_max
|
float
|
Upper bound for uniform treatment draw. Default 1.0. |
1.0
|
w_mean
|
float
|
Mean for normal treatment draw. Default 0.0. |
0.0
|
w_std
|
float
|
Standard deviation for normal treatment draw. Default 1.0. |
1.0
|
confounding_strength
|
float
|
How much |
0.3
|
noise_std
|
float
|
Outcome noise standard deviation. Default 1.0. |
1.0
|
seed
|
int or None
|
Random seed for reproducibility. |
None
|
Examples:
>>> stream = ContinuousTreatmentStream(n=200, dose_response='linear', seed=0)
>>> for x, w, y, marginal in stream:
... assert isinstance(w, float)
... assert isinstance(marginal, float)
Source code in onlinecml/datasets/continuous_treatment.py
9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 | |
__iter__()
Iterate over the stream, yielding one observation at a time.
Yields:
| Name | Type | Description |
|---|---|---|
x |
dict
|
Feature dictionary |
w |
float
|
Continuous treatment dose. |
y |
float
|
Observed outcome. |
marginal_effect |
float
|
True marginal causal effect |
Source code in onlinecml/datasets/continuous_treatment.py
Real-World Loaders
onlinecml.datasets.real_world
Real-world causal inference benchmark dataset loaders.
All loaders return River-compatible iterators that yield
(x_dict, treatment, outcome, true_cate) tuples, where true_cate
is None for datasets without known individual treatment effects.
Datasets are downloaded on first use and cached in ~/.onlinecml/data/.
load_ihdp(split=1, shuffle=False, seed=None)
Load the IHDP (Infant Health Development Program) dataset.
Semi-synthetic benchmark for CATE estimation. True individual treatment effects are available for this dataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
split
|
int
|
Dataset split (1–10). Default 1. |
1
|
shuffle
|
bool
|
If True, shuffle rows before iterating. |
False
|
seed
|
int or None
|
Random seed for shuffling. |
None
|
Yields:
| Name | Type | Description |
|---|---|---|
x |
dict
|
25 covariates (x1–x25). |
treatment |
int
|
1 = received intensive intervention, 0 = control. |
outcome |
float
|
Observed outcome (cognitive test score). |
true_cate |
float
|
True individual treatment effect for this unit. |
References
Hill, J.L. (2011). Bayesian nonparametric modeling for causal inference. Journal of Computational and Graphical Statistics, 20(1), 217-240.
Source code in onlinecml/datasets/real_world.py
load_lalonde(shuffle=False, seed=None)
Load the LaLonde (1986) National Supported Work dataset.
A classic benchmark for causal inference. The treatment is participation in a job training program (NSW). The outcome is real earnings in 1978.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
shuffle
|
bool
|
If True, shuffle the rows before iterating. Default False. |
False
|
seed
|
int or None
|
Random seed for shuffling. |
None
|
Yields:
| Name | Type | Description |
|---|---|---|
x |
dict
|
Covariates: age, education, black, hispanic, married, nodegree, re74 (earnings 1974), re75 (earnings 1975). |
treatment |
int
|
1 = participated in job training, 0 = control. |
outcome |
float
|
Real earnings in 1978. |
true_cate |
None
|
Individual CATE is not known for observational studies. |
Notes
The dataset is downloaded on first use and cached in
~/.onlinecml/data/.
References
LaLonde, R.J. (1986). Evaluating the econometric evaluations of training programs with experimental data. American Economic Review, 76(4), 604-620.
Source code in onlinecml/datasets/real_world.py
load_news(n=None, seed=None)
Load a synthetic high-dimensional approximation of the News dataset.
Because the original News dataset requires proprietary preprocessing, this loader generates a synthetic version with matching statistical properties: ~3000 features (sparse), binary treatment, continuous outcome, and heterogeneous treatment effects.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
n
|
int or None
|
Number of observations. Default 5000. |
None
|
seed
|
int or None
|
Random seed. |
None
|
Yields:
| Name | Type | Description |
|---|---|---|
x |
dict
|
Sparse feature dictionary (100 non-zero features out of 3000). |
treatment |
int
|
Binary treatment indicator. |
outcome |
float
|
Continuous outcome. |
true_cate |
float
|
True individual CATE. |
Notes
This is a synthetic proxy. For the original dataset processing code, see Johansson et al. (2016).
Source code in onlinecml/datasets/real_world.py
load_twins(shuffle=False, seed=None)
Load the Twin births dataset.
Each observation is a pair of twins. The treatment is being the heavier twin (weight ≥ 2000g). The outcome is 1-year mortality. True individual treatment effects are not available.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
shuffle
|
bool
|
If True, shuffle rows before iterating. |
False
|
seed
|
int or None
|
Random seed for shuffling. |
None
|
Yields:
| Name | Type | Description |
|---|---|---|
x |
dict
|
30 covariates (birth characteristics, demographics). |
treatment |
int
|
1 = heavier twin, 0 = lighter twin. |
outcome |
float
|
1-year mortality (0 = alive, 1 = deceased). |
true_cate |
None
|
Not available for real-world twin data. |
References
Louizos, C. et al. (2017). Causal effect inference with deep latent-variable models. NeurIPS 2017.