Skip to content

Evaluation

onlinecml.evaluation.progressive.progressive_causal_score(stream, model, metrics, step=100)

Evaluate a causal model progressively over a streaming dataset.

Implements the predict-before-learn protocol: for each observation, the model is scored before it sees the label, then updated. This gives an unbiased estimate of online generalisation performance.

At every step observations, each metric's current score is recorded.

Parameters:

Name Type Description Default
stream iterable of (x, treatment, outcome, true_cate)

Any OnlineCML dataset or iterable yielding 4-tuples.

required
model BaseOnlineEstimator

An unfitted (or partially fitted) causal estimator. Must implement learn_one(x, treatment, outcome) and predict_one(x).

required
metrics list

List of metric objects (e.g., [ATEError(), PEHE()]). Each must implement update(x, w, y, true_cate, cate_hat, model) and a score property.

required
step int

Record metric scores every step observations. Default 100.

100

Returns:

Name Type Description
results dict

Dictionary with key "steps" (list of checkpoint indices) and one key per metric class name (list of scores at each checkpoint).

Examples:

>>> from onlinecml.datasets import LinearCausalStream
>>> from onlinecml.reweighting import OnlineIPW
>>> from onlinecml.evaluation import progressive_causal_score
>>> from onlinecml.evaluation.metrics import ATEError, PEHE
>>>
>>> results = progressive_causal_score(
...     stream  = LinearCausalStream(n=500, seed=0),
...     model   = OnlineIPW(),
...     metrics = [ATEError(), PEHE()],
...     step    = 100,
... )
>>> len(results["steps"])
5
Source code in onlinecml/evaluation/progressive.py
def progressive_causal_score(
    stream,  # noqa: ANN001
    model: "BaseOnlineEstimator",
    metrics: list,
    step: int = 100,
) -> dict[str, Any]:
    """Evaluate a causal model progressively over a streaming dataset.

    Implements the predict-before-learn protocol: for each observation, the
    model is scored **before** it sees the label, then updated. This gives an
    unbiased estimate of online generalisation performance.

    At every ``step`` observations, each metric's current ``score`` is recorded.

    Parameters
    ----------
    stream : iterable of (x, treatment, outcome, true_cate)
        Any OnlineCML dataset or iterable yielding 4-tuples.
    model : BaseOnlineEstimator
        An unfitted (or partially fitted) causal estimator. Must implement
        ``learn_one(x, treatment, outcome)`` and ``predict_one(x)``.
    metrics : list
        List of metric objects (e.g., ``[ATEError(), PEHE()]``). Each must
        implement ``update(x, w, y, true_cate, cate_hat, model)`` and a
        ``score`` property.
    step : int
        Record metric scores every ``step`` observations. Default 100.

    Returns
    -------
    results : dict
        Dictionary with key ``"steps"`` (list of checkpoint indices) and one
        key per metric class name (list of scores at each checkpoint).

    Examples
    --------
    >>> from onlinecml.datasets import LinearCausalStream
    >>> from onlinecml.reweighting import OnlineIPW
    >>> from onlinecml.evaluation import progressive_causal_score
    >>> from onlinecml.evaluation.metrics import ATEError, PEHE
    >>>
    >>> results = progressive_causal_score(
    ...     stream  = LinearCausalStream(n=500, seed=0),
    ...     model   = OnlineIPW(),
    ...     metrics = [ATEError(), PEHE()],
    ...     step    = 100,
    ... )
    >>> len(results["steps"])
    5
    """
    history: dict[str, list[float]] = {m.__class__.__name__: [] for m in metrics}
    steps_list: list[int] = []

    for i, (x, w, y, true_cate) in enumerate(stream):
        # Predict-before-learn
        cate_hat = model.predict_one(x)

        # Update each metric with the current prediction
        for m in metrics:
            m.update(x, w, y, true_cate, cate_hat, model)

        # Train the model
        model.learn_one(x, w, y)

        # Record checkpoint
        if (i + 1) % step == 0:
            steps_list.append(i + 1)
            for m in metrics:
                history[m.__class__.__name__].append(m.score)

    return {"steps": steps_list, **history}

Metrics

onlinecml.evaluation.metrics.ATEError

Running absolute error between the estimated and true ATE.

Accumulates the true CATE via a running mean (so it works for both constant-ATE and heterogeneous streams) and computes |model.predict_ate() - mean(true_cate)| at each checkpoint.

Examples:

>>> m = ATEError()
>>> m.score
0.0
Source code in onlinecml/evaluation/metrics.py
class ATEError:
    """Running absolute error between the estimated and true ATE.

    Accumulates the true CATE via a running mean (so it works for both
    constant-ATE and heterogeneous streams) and computes
    ``|model.predict_ate() - mean(true_cate)|`` at each checkpoint.

    Examples
    --------
    >>> m = ATEError()
    >>> m.score
    0.0
    """

    def __init__(self) -> None:
        self._n: int = 0
        self._sum_true: float = 0.0
        self._last_ate_hat: float = 0.0

    def update(
        self,
        x: dict,
        w: int,
        y: float,
        true_cate: float,
        cate_hat: float,
        model,  # noqa: ANN001
    ) -> None:
        """Accumulate one observation.

        Parameters
        ----------
        x : dict
            Covariate dict (unused by this metric).
        w : int
            Treatment indicator (unused by this metric).
        y : float
            Observed outcome (unused by this metric).
        true_cate : float
            True CATE for this unit. Used to build a running mean of the
            population ATE.
        cate_hat : float
            Predicted CATE (unused by this metric; uses model.predict_ate()).
        model :
            The causal estimator. Must implement ``predict_ate() -> float``.
        """
        self._n += 1
        self._sum_true += true_cate
        self._last_ate_hat = model.predict_ate()

    @property
    def score(self) -> float:
        """Current |ATE_hat - ATE_true|.

        Returns ``0.0`` before any data has been seen.
        """
        if self._n == 0:
            return 0.0
        return abs(self._last_ate_hat - self._sum_true / self._n)

    def reset(self) -> None:
        """Reset all accumulated state."""
        self.__init__()  # type: ignore[misc]

score property

Current |ATE_hat - ATE_true|.

Returns 0.0 before any data has been seen.

reset()

Reset all accumulated state.

Source code in onlinecml/evaluation/metrics.py
def reset(self) -> None:
    """Reset all accumulated state."""
    self.__init__()  # type: ignore[misc]

update(x, w, y, true_cate, cate_hat, model)

Accumulate one observation.

Parameters:

Name Type Description Default
x dict

Covariate dict (unused by this metric).

required
w int

Treatment indicator (unused by this metric).

required
y float

Observed outcome (unused by this metric).

required
true_cate float

True CATE for this unit. Used to build a running mean of the population ATE.

required
cate_hat float

Predicted CATE (unused by this metric; uses model.predict_ate()).

required
model

The causal estimator. Must implement predict_ate() -> float.

required
Source code in onlinecml/evaluation/metrics.py
def update(
    self,
    x: dict,
    w: int,
    y: float,
    true_cate: float,
    cate_hat: float,
    model,  # noqa: ANN001
) -> None:
    """Accumulate one observation.

    Parameters
    ----------
    x : dict
        Covariate dict (unused by this metric).
    w : int
        Treatment indicator (unused by this metric).
    y : float
        Observed outcome (unused by this metric).
    true_cate : float
        True CATE for this unit. Used to build a running mean of the
        population ATE.
    cate_hat : float
        Predicted CATE (unused by this metric; uses model.predict_ate()).
    model :
        The causal estimator. Must implement ``predict_ate() -> float``.
    """
    self._n += 1
    self._sum_true += true_cate
    self._last_ate_hat = model.predict_ate()

onlinecml.evaluation.metrics.PEHE

Running Precision in Estimation of Heterogeneous Effects.

Computes sqrt(mean((cate_hat - cate_true)^2)) incrementally using Welford's algorithm. The cate_hat passed by progressive_causal_score is the predict-before-learn CATE prediction from model.predict_one(x).

References

Hill, J. (2011). Bayesian nonparametric modeling for causal inference. Journal of Computational and Graphical Statistics, 20(1), 217-240.

Examples:

>>> m = PEHE()
>>> m.score
0.0
Source code in onlinecml/evaluation/metrics.py
class PEHE:
    """Running Precision in Estimation of Heterogeneous Effects.

    Computes ``sqrt(mean((cate_hat - cate_true)^2))`` incrementally using
    Welford's algorithm. The ``cate_hat`` passed by ``progressive_causal_score``
    is the predict-before-learn CATE prediction from ``model.predict_one(x)``.

    References
    ----------
    Hill, J. (2011). Bayesian nonparametric modeling for causal inference.
    Journal of Computational and Graphical Statistics, 20(1), 217-240.

    Examples
    --------
    >>> m = PEHE()
    >>> m.score
    0.0
    """

    def __init__(self) -> None:
        self._n: int = 0
        self._mean_sq: float = 0.0

    def update(
        self,
        x: dict,
        w: int,
        y: float,
        true_cate: float,
        cate_hat: float,
        model,  # noqa: ANN001
    ) -> None:
        """Accumulate one observation.

        Parameters
        ----------
        x : dict
            Covariate dict (unused by this metric).
        w : int
            Treatment indicator (unused by this metric).
        y : float
            Observed outcome (unused by this metric).
        true_cate : float
            True CATE for this unit.
        cate_hat : float
            Predicted CATE from ``model.predict_one(x)`` before learning.
        model :
            The causal estimator (unused by this metric).
        """
        self._n += 1
        err_sq = (cate_hat - true_cate) ** 2
        self._mean_sq += (err_sq - self._mean_sq) / self._n

    @property
    def score(self) -> float:
        """Current PEHE (sqrt of running mean squared CATE error).

        Returns ``0.0`` before any data has been seen.
        """
        if self._n == 0:
            return 0.0
        return math.sqrt(self._mean_sq)

    def reset(self) -> None:
        """Reset all accumulated state."""
        self.__init__()  # type: ignore[misc]

score property

Current PEHE (sqrt of running mean squared CATE error).

Returns 0.0 before any data has been seen.

reset()

Reset all accumulated state.

Source code in onlinecml/evaluation/metrics.py
def reset(self) -> None:
    """Reset all accumulated state."""
    self.__init__()  # type: ignore[misc]

update(x, w, y, true_cate, cate_hat, model)

Accumulate one observation.

Parameters:

Name Type Description Default
x dict

Covariate dict (unused by this metric).

required
w int

Treatment indicator (unused by this metric).

required
y float

Observed outcome (unused by this metric).

required
true_cate float

True CATE for this unit.

required
cate_hat float

Predicted CATE from model.predict_one(x) before learning.

required
model

The causal estimator (unused by this metric).

required
Source code in onlinecml/evaluation/metrics.py
def update(
    self,
    x: dict,
    w: int,
    y: float,
    true_cate: float,
    cate_hat: float,
    model,  # noqa: ANN001
) -> None:
    """Accumulate one observation.

    Parameters
    ----------
    x : dict
        Covariate dict (unused by this metric).
    w : int
        Treatment indicator (unused by this metric).
    y : float
        Observed outcome (unused by this metric).
    true_cate : float
        True CATE for this unit.
    cate_hat : float
        Predicted CATE from ``model.predict_one(x)`` before learning.
    model :
        The causal estimator (unused by this metric).
    """
    self._n += 1
    err_sq = (cate_hat - true_cate) ** 2
    self._mean_sq += (err_sq - self._mean_sq) / self._n

onlinecml.evaluation.metrics.UpliftAUC

Area under the uplift curve (AUUC).

Accumulates (cate_hat, treatment, outcome) triples. At each call to score, it sorts the buffer by predicted CATE descending, computes the cumulative uplift curve, and returns the area via the trapezoidal rule, normalized to [0, 1].

The uplift at depth k is::

uplift(k) = mean_outcome_treated(top_k) - mean_outcome_control(top_k)

Parameters:

Name Type Description Default
max_buffer int

Maximum number of recent observations to retain. Older observations are dropped to keep memory bounded. Default 5000.

5000

Examples:

>>> m = UpliftAUC()
>>> m.score
0.0
Source code in onlinecml/evaluation/metrics.py
class UpliftAUC:
    """Area under the uplift curve (AUUC).

    Accumulates ``(cate_hat, treatment, outcome)`` triples. At each call to
    ``score``, it sorts the buffer by predicted CATE descending, computes the
    cumulative uplift curve, and returns the area via the trapezoidal rule,
    normalized to ``[0, 1]``.

    The uplift at depth ``k`` is::

        uplift(k) = mean_outcome_treated(top_k) - mean_outcome_control(top_k)

    Parameters
    ----------
    max_buffer : int
        Maximum number of recent observations to retain. Older observations
        are dropped to keep memory bounded. Default 5000.

    Examples
    --------
    >>> m = UpliftAUC()
    >>> m.score
    0.0
    """

    def __init__(self, max_buffer: int = 5000) -> None:
        self.max_buffer = max_buffer
        self._buffer: list[tuple[float, int, float]] = []

    def update(
        self,
        x: dict,
        w: int,
        y: float,
        true_cate: float,
        cate_hat: float,
        model,  # noqa: ANN001
    ) -> None:
        """Accumulate one observation.

        Parameters
        ----------
        x : dict
            Covariate dict (unused by this metric).
        w : int
            Treatment indicator (0 or 1).
        y : float
            Observed outcome.
        true_cate : float
            True CATE (unused by this metric).
        cate_hat : float
            Predicted CATE used to rank units.
        model :
            The causal estimator (unused by this metric).
        """
        self._buffer.append((cate_hat, w, y))
        if len(self._buffer) > self.max_buffer:
            self._buffer.pop(0)

    @property
    def score(self) -> float:
        """Current AUUC, normalized to ``[0, 1]``.

        Returns ``0.0`` when fewer than two observations have been seen or
        when all units are in one arm.
        """
        if len(self._buffer) < 2:
            return 0.0
        sorted_buf = sorted(self._buffer, key=lambda t: t[0], reverse=True)
        n = len(sorted_buf)

        cum_t_sum, cum_c_sum = 0.0, 0.0
        cum_t_n,   cum_c_n   = 0,   0
        uplift_vals = []

        for _, w, y in sorted_buf:
            if w == 1:
                cum_t_sum += y
                cum_t_n   += 1
            else:
                cum_c_sum += y
                cum_c_n   += 1
            mean_t = cum_t_sum / cum_t_n if cum_t_n > 0 else 0.0
            mean_c = cum_c_sum / cum_c_n if cum_c_n > 0 else 0.0
            uplift_vals.append(mean_t - mean_c)

        if not uplift_vals:
            return 0.0
        # Trapezoidal AUC over depth fractions [0, 1]
        auc = sum(uplift_vals) / n
        # Normalize by the range of outcomes so score is on a comparable scale
        all_y = [y for _, _, y in sorted_buf]
        y_range = max(all_y) - min(all_y)
        if y_range == 0.0:
            return 0.0
        return max(0.0, auc / y_range)

    def reset(self) -> None:
        """Reset all accumulated state."""
        self._buffer.clear()

score property

Current AUUC, normalized to [0, 1].

Returns 0.0 when fewer than two observations have been seen or when all units are in one arm.

reset()

Reset all accumulated state.

Source code in onlinecml/evaluation/metrics.py
def reset(self) -> None:
    """Reset all accumulated state."""
    self._buffer.clear()

update(x, w, y, true_cate, cate_hat, model)

Accumulate one observation.

Parameters:

Name Type Description Default
x dict

Covariate dict (unused by this metric).

required
w int

Treatment indicator (0 or 1).

required
y float

Observed outcome.

required
true_cate float

True CATE (unused by this metric).

required
cate_hat float

Predicted CATE used to rank units.

required
model

The causal estimator (unused by this metric).

required
Source code in onlinecml/evaluation/metrics.py
def update(
    self,
    x: dict,
    w: int,
    y: float,
    true_cate: float,
    cate_hat: float,
    model,  # noqa: ANN001
) -> None:
    """Accumulate one observation.

    Parameters
    ----------
    x : dict
        Covariate dict (unused by this metric).
    w : int
        Treatment indicator (0 or 1).
    y : float
        Observed outcome.
    true_cate : float
        True CATE (unused by this metric).
    cate_hat : float
        Predicted CATE used to rank units.
    model :
        The causal estimator (unused by this metric).
    """
    self._buffer.append((cate_hat, w, y))
    if len(self._buffer) > self.max_buffer:
        self._buffer.pop(0)

onlinecml.evaluation.metrics.QiniCoefficient

Qini coefficient (area under the Qini curve).

The Qini curve plots cumulative incremental gains vs. cumulative population fraction when units are ranked by predicted CATE descending. The Qini coefficient is the area under this curve minus the area under the random policy line, normalized by the maximum achievable Qini.

Parameters:

Name Type Description Default
max_buffer int

Maximum number of recent observations to retain. Default 5000.

5000
References

Radcliffe, N.J. (2007). Using control groups to target on predicted lift. Direct Marketing Analytics Journal, 14-21.

Examples:

>>> m = QiniCoefficient()
>>> m.score
0.0
Source code in onlinecml/evaluation/metrics.py
class QiniCoefficient:
    """Qini coefficient (area under the Qini curve).

    The Qini curve plots cumulative incremental gains vs. cumulative population
    fraction when units are ranked by predicted CATE descending. The Qini
    coefficient is the area under this curve minus the area under the random
    policy line, normalized by the maximum achievable Qini.

    Parameters
    ----------
    max_buffer : int
        Maximum number of recent observations to retain. Default 5000.

    References
    ----------
    Radcliffe, N.J. (2007). Using control groups to target on predicted lift.
    Direct Marketing Analytics Journal, 14-21.

    Examples
    --------
    >>> m = QiniCoefficient()
    >>> m.score
    0.0
    """

    def __init__(self, max_buffer: int = 5000) -> None:
        self.max_buffer = max_buffer
        self._buffer: list[tuple[float, int, float]] = []

    def update(
        self,
        x: dict,
        w: int,
        y: float,
        true_cate: float,
        cate_hat: float,
        model,  # noqa: ANN001
    ) -> None:
        """Accumulate one observation.

        Parameters
        ----------
        x : dict
            Covariate dict (unused by this metric).
        w : int
            Treatment indicator (0 or 1).
        y : float
            Observed outcome.
        true_cate : float
            True CATE (unused by this metric).
        cate_hat : float
            Predicted CATE used to rank units.
        model :
            The causal estimator (unused by this metric).
        """
        self._buffer.append((cate_hat, w, y))
        if len(self._buffer) > self.max_buffer:
            self._buffer.pop(0)

    @property
    def score(self) -> float:
        """Current normalized Qini coefficient.

        Returns ``0.0`` when fewer than two observations have been seen or
        when either arm is empty.
        """
        if len(self._buffer) < 2:
            return 0.0
        sorted_buf = sorted(self._buffer, key=lambda t: t[0], reverse=True)
        n = len(sorted_buf)

        n_t_total = sum(1 for _, w, _ in sorted_buf if w == 1)
        n_c_total = n - n_t_total
        if n_t_total == 0 or n_c_total == 0:
            return 0.0

        cum_t, cum_c = 0, 0
        qini_vals = [0.0]  # starts at 0

        for _, w, _ in sorted_buf:
            if w == 1:
                cum_t += 1
            else:
                cum_c += 1
            # Qini at this depth: treated_rate - control_rate * (n_t_total / n_c_total)
            qini_vals.append(cum_t / n_t_total - cum_c / n_c_total)

        # Area under Qini curve (trapezoidal)
        depths = [i / n for i in range(n + 1)]
        auc = sum(
            0.5 * (qini_vals[i] + qini_vals[i + 1]) * (depths[i + 1] - depths[i])
            for i in range(n)
        )
        # Normalize: maximum area is 0.5 (perfect model)
        return auc / 0.5

    def reset(self) -> None:
        """Reset all accumulated state."""
        self._buffer.clear()

score property

Current normalized Qini coefficient.

Returns 0.0 when fewer than two observations have been seen or when either arm is empty.

reset()

Reset all accumulated state.

Source code in onlinecml/evaluation/metrics.py
def reset(self) -> None:
    """Reset all accumulated state."""
    self._buffer.clear()

update(x, w, y, true_cate, cate_hat, model)

Accumulate one observation.

Parameters:

Name Type Description Default
x dict

Covariate dict (unused by this metric).

required
w int

Treatment indicator (0 or 1).

required
y float

Observed outcome.

required
true_cate float

True CATE (unused by this metric).

required
cate_hat float

Predicted CATE used to rank units.

required
model

The causal estimator (unused by this metric).

required
Source code in onlinecml/evaluation/metrics.py
def update(
    self,
    x: dict,
    w: int,
    y: float,
    true_cate: float,
    cate_hat: float,
    model,  # noqa: ANN001
) -> None:
    """Accumulate one observation.

    Parameters
    ----------
    x : dict
        Covariate dict (unused by this metric).
    w : int
        Treatment indicator (0 or 1).
    y : float
        Observed outcome.
    true_cate : float
        True CATE (unused by this metric).
    cate_hat : float
        Predicted CATE used to rank units.
    model :
        The causal estimator (unused by this metric).
    """
    self._buffer.append((cate_hat, w, y))
    if len(self._buffer) > self.max_buffer:
        self._buffer.pop(0)

onlinecml.evaluation.metrics.CIWidth

Running mean width of the confidence interval on the ATE.

Computes the mean of (upper - lower) for each CI returned by model.predict_ci(alpha) at each observation.

Parameters:

Name Type Description Default
alpha float

Significance level for the CI. Default 0.05 (95% CI).

0.05

Examples:

>>> m = CIWidth()
>>> m.score
0.0
Source code in onlinecml/evaluation/metrics.py
class CIWidth:
    """Running mean width of the confidence interval on the ATE.

    Computes the mean of ``(upper - lower)`` for each CI returned by
    ``model.predict_ci(alpha)`` at each observation.

    Parameters
    ----------
    alpha : float
        Significance level for the CI. Default 0.05 (95% CI).

    Examples
    --------
    >>> m = CIWidth()
    >>> m.score
    0.0
    """

    def __init__(self, alpha: float = 0.05) -> None:
        self.alpha = alpha
        self._n: int = 0
        self._mean_width: float = 0.0

    def update(
        self,
        x: dict,
        w: int,
        y: float,
        true_cate: float,
        cate_hat: float,
        model,  # noqa: ANN001
    ) -> None:
        """Accumulate one observation.

        Parameters
        ----------
        x : dict
            Covariate dict (unused by this metric).
        w : int
            Treatment indicator (unused by this metric).
        y : float
            Observed outcome (unused by this metric).
        true_cate : float
            True CATE (unused by this metric).
        cate_hat : float
            Predicted CATE (unused by this metric).
        model :
            The causal estimator. Must implement ``predict_ci(alpha) -> tuple``.
        """
        lo, hi = model.predict_ci(alpha=self.alpha)
        width = hi - lo
        self._n += 1
        self._mean_width += (width - self._mean_width) / self._n

    @property
    def score(self) -> float:
        """Current mean CI width.

        Returns ``0.0`` before any data has been seen.
        """
        return self._mean_width

    def reset(self) -> None:
        """Reset all accumulated state."""
        self.__init__(alpha=self.alpha)  # type: ignore[misc]

score property

Current mean CI width.

Returns 0.0 before any data has been seen.

reset()

Reset all accumulated state.

Source code in onlinecml/evaluation/metrics.py
def reset(self) -> None:
    """Reset all accumulated state."""
    self.__init__(alpha=self.alpha)  # type: ignore[misc]

update(x, w, y, true_cate, cate_hat, model)

Accumulate one observation.

Parameters:

Name Type Description Default
x dict

Covariate dict (unused by this metric).

required
w int

Treatment indicator (unused by this metric).

required
y float

Observed outcome (unused by this metric).

required
true_cate float

True CATE (unused by this metric).

required
cate_hat float

Predicted CATE (unused by this metric).

required
model

The causal estimator. Must implement predict_ci(alpha) -> tuple.

required
Source code in onlinecml/evaluation/metrics.py
def update(
    self,
    x: dict,
    w: int,
    y: float,
    true_cate: float,
    cate_hat: float,
    model,  # noqa: ANN001
) -> None:
    """Accumulate one observation.

    Parameters
    ----------
    x : dict
        Covariate dict (unused by this metric).
    w : int
        Treatment indicator (unused by this metric).
    y : float
        Observed outcome (unused by this metric).
    true_cate : float
        True CATE (unused by this metric).
    cate_hat : float
        Predicted CATE (unused by this metric).
    model :
        The causal estimator. Must implement ``predict_ci(alpha) -> tuple``.
    """
    lo, hi = model.predict_ci(alpha=self.alpha)
    width = hi - lo
    self._n += 1
    self._mean_width += (width - self._mean_width) / self._n

onlinecml.evaluation.metrics.CIcoverage

Empirical coverage of the ATE confidence interval.

At each checkpoint, checks whether the true population ATE (running mean of true_cate) falls within model.predict_ci(alpha). Returns the running fraction of checkpoints where the CI covered the true ATE.

Parameters:

Name Type Description Default
alpha float

Significance level. Default 0.05 (95% CI; target coverage = 0.95).

0.05

Examples:

>>> m = CIcoverage()
>>> m.score
0.0
Source code in onlinecml/evaluation/metrics.py
class CIcoverage:
    """Empirical coverage of the ATE confidence interval.

    At each checkpoint, checks whether the true population ATE (running mean
    of ``true_cate``) falls within ``model.predict_ci(alpha)``. Returns the
    running fraction of checkpoints where the CI covered the true ATE.

    Parameters
    ----------
    alpha : float
        Significance level. Default 0.05 (95% CI; target coverage = 0.95).

    Examples
    --------
    >>> m = CIcoverage()
    >>> m.score
    0.0
    """

    def __init__(self, alpha: float = 0.05) -> None:
        self.alpha = alpha
        self._n_obs: int = 0
        self._sum_true: float = 0.0
        self._n_checks: int = 0
        self._n_covered: int = 0

    def update(
        self,
        x: dict,
        w: int,
        y: float,
        true_cate: float,
        cate_hat: float,
        model,  # noqa: ANN001
    ) -> None:
        """Accumulate one observation and check CI coverage.

        Parameters
        ----------
        x : dict
            Covariate dict (unused by this metric).
        w : int
            Treatment indicator (unused by this metric).
        y : float
            Observed outcome (unused by this metric).
        true_cate : float
            True CATE. Used to estimate the population ATE via running mean.
        cate_hat : float
            Predicted CATE (unused by this metric).
        model :
            The causal estimator. Must implement ``predict_ci(alpha) -> tuple``.
        """
        self._n_obs += 1
        self._sum_true += true_cate
        true_ate = self._sum_true / self._n_obs
        lo, hi = model.predict_ci(alpha=self.alpha)
        self._n_checks += 1
        if lo <= true_ate <= hi:
            self._n_covered += 1

    @property
    def score(self) -> float:
        """Fraction of checkpoints where the CI covered the true ATE.

        Returns ``0.0`` before any data has been seen.
        """
        if self._n_checks == 0:
            return 0.0
        return self._n_covered / self._n_checks

    def reset(self) -> None:
        """Reset all accumulated state."""
        self.__init__(alpha=self.alpha)  # type: ignore[misc]

score property

Fraction of checkpoints where the CI covered the true ATE.

Returns 0.0 before any data has been seen.

reset()

Reset all accumulated state.

Source code in onlinecml/evaluation/metrics.py
def reset(self) -> None:
    """Reset all accumulated state."""
    self.__init__(alpha=self.alpha)  # type: ignore[misc]

update(x, w, y, true_cate, cate_hat, model)

Accumulate one observation and check CI coverage.

Parameters:

Name Type Description Default
x dict

Covariate dict (unused by this metric).

required
w int

Treatment indicator (unused by this metric).

required
y float

Observed outcome (unused by this metric).

required
true_cate float

True CATE. Used to estimate the population ATE via running mean.

required
cate_hat float

Predicted CATE (unused by this metric).

required
model

The causal estimator. Must implement predict_ci(alpha) -> tuple.

required
Source code in onlinecml/evaluation/metrics.py
def update(
    self,
    x: dict,
    w: int,
    y: float,
    true_cate: float,
    cate_hat: float,
    model,  # noqa: ANN001
) -> None:
    """Accumulate one observation and check CI coverage.

    Parameters
    ----------
    x : dict
        Covariate dict (unused by this metric).
    w : int
        Treatment indicator (unused by this metric).
    y : float
        Observed outcome (unused by this metric).
    true_cate : float
        True CATE. Used to estimate the population ATE via running mean.
    cate_hat : float
        Predicted CATE (unused by this metric).
    model :
        The causal estimator. Must implement ``predict_ci(alpha) -> tuple``.
    """
    self._n_obs += 1
    self._sum_true += true_cate
    true_ate = self._sum_true / self._n_obs
    lo, hi = model.predict_ci(alpha=self.alpha)
    self._n_checks += 1
    if lo <= true_ate <= hi:
        self._n_covered += 1