Exploration Policies

`onlinecml.policy.epsilon_greedy.EpsilonGreedy`

Bases: BasePolicy

Epsilon-greedy treatment policy with exponential epsilon decay.

Randomly explores (assigns random treatment) with probability epsilon, and exploits (assigns the treatment with the highest estimated effect) with probability 1 - epsilon. Epsilon decays exponentially from eps_start toward eps_end over time.

Parameters:

Name	Type	Description	Default
`eps_start`	`float`	Initial exploration rate. Default 0.5.	`0.5`
`eps_end`	`float`	Minimum exploration rate (asymptote). Default 0.05.	`0.05`
`decay`	`int`	Decay timescale in steps. Larger values = slower decay. Default 2000.	`2000`
`seed`	`int or None`	Random seed for reproducibility. Uses standard library `random` to avoid numpy serialization issues.	`None`

Notes

Epsilon at step t is:

.. math::

\epsilon_t = \epsilon_{\text{end}} +
(\epsilon_{\text{start}} - \epsilon_{\text{end}}) \cdot
e^{-t / \text{decay}}

Explore: With probability eps, a random treatment is chosen with propensity 0.5.

Exploit: With probability 1 - eps, the treatment with the higher estimated CATE is chosen. The propensity of the chosen treatment under the greedy policy is 1 - eps (since we always choose the same arm when exploiting).

Examples:

>>> policy = EpsilonGreedy(eps_start=0.5, eps_end=0.05, decay=100, seed=0)
>>> treatment, propensity = policy.choose(cate_score=1.5, step=0)
>>> treatment in (0, 1)
True
>>> 0.0 < propensity <= 1.0
True

Source code in onlinecml/policy/epsilon_greedy.py

class EpsilonGreedy(BasePolicy):
    """Epsilon-greedy treatment policy with exponential epsilon decay.

    Randomly explores (assigns random treatment) with probability epsilon,
    and exploits (assigns the treatment with the highest estimated effect)
    with probability 1 - epsilon. Epsilon decays exponentially from
    ``eps_start`` toward ``eps_end`` over time.

    Parameters
    ----------
    eps_start : float
        Initial exploration rate. Default 0.5.
    eps_end : float
        Minimum exploration rate (asymptote). Default 0.05.
    decay : int
        Decay timescale in steps. Larger values = slower decay. Default 2000.
    seed : int or None
        Random seed for reproducibility. Uses standard library ``random``
        to avoid numpy serialization issues.

    Notes
    -----
    Epsilon at step ``t`` is:

    .. math::

        \\epsilon_t = \\epsilon_{\\text{end}} +
        (\\epsilon_{\\text{start}} - \\epsilon_{\\text{end}}) \\cdot
        e^{-t / \\text{decay}}

    **Explore:** With probability ``eps``, a random treatment is chosen
    with propensity 0.5.

    **Exploit:** With probability ``1 - eps``, the treatment with the
    higher estimated CATE is chosen. The propensity of the chosen
    treatment under the greedy policy is ``1 - eps`` (since we always
    choose the same arm when exploiting).

    Examples
    --------
    >>> policy = EpsilonGreedy(eps_start=0.5, eps_end=0.05, decay=100, seed=0)
    >>> treatment, propensity = policy.choose(cate_score=1.5, step=0)
    >>> treatment in (0, 1)
    True
    >>> 0.0 < propensity <= 1.0
    True
    """

    def __init__(
        self,
        eps_start: float = 0.5,
        eps_end: float = 0.05,
        decay: int = 2000,
        seed: int | None = None,
    ) -> None:
        self.eps_start = eps_start
        self.eps_end = eps_end
        self.decay = decay
        self.seed = seed
        # Standard library random — not numpy (avoids serialization issues)
        self._rng = random.Random(seed)
        # Internal arm reward tracking (used when cate_score is not provided)
        self._arm_sums: list[float] = [0.0, 0.0]
        self._arm_counts: list[int] = [0, 0]
        self._last_arm: int | None = None

    def choose(self, cate_score: float, step: int) -> tuple[int, float]:
        """Choose a treatment assignment.

        Parameters
        ----------
        cate_score : float
            Current CATE estimate. Positive = treatment beneficial.
        step : int
            Current time step (used for epsilon decay).

        Returns
        -------
        treatment : int
            Chosen treatment (0 or 1).
        propensity : float
            Probability of the chosen treatment under this policy.
        """
        eps = self.eps_end + (self.eps_start - self.eps_end) * math.exp(-step / self.decay)

        if self._rng.random() < eps:
            # Explore: random treatment, uniform propensity
            treatment = self._rng.randint(0, 1)
            self._last_arm = treatment
            return treatment, 0.5

        # Exploit: use cate_score if non-zero (causal model available),
        # else fall back to internal arm reward estimates.
        if cate_score == 0.0:
            mu1 = self._arm_sums[1] / self._arm_counts[1] if self._arm_counts[1] > 0 else 0.5
            mu0 = self._arm_sums[0] / self._arm_counts[0] if self._arm_counts[0] > 0 else 0.5
            cate_score = mu1 - mu0
        treatment = 1 if cate_score > 0 else 0
        self._last_arm = treatment
        propensity = 1.0 - eps
        return treatment, propensity

    def update(self, reward: float) -> None:
        """Update internal arm reward estimate after observing a reward.

        Parameters
        ----------
        reward : float
            Observed reward for the arm chosen in the most recent
            ``choose`` call. No-op if ``choose`` has not been called yet.
        """
        if self._last_arm is not None:
            self._arm_sums[self._last_arm] += reward
            self._arm_counts[self._last_arm] += 1

    def current_epsilon(self, step: int) -> float:
        """Return the current epsilon value at a given step.

        Parameters
        ----------
        step : int
            Current time step.

        Returns
        -------
        float
            Epsilon at this step.
        """
        return self.eps_end + (self.eps_start - self.eps_end) * math.exp(-step / self.decay)

`choose(cate_score, step)`

Choose a treatment assignment.

Parameters:

Name	Type	Description	Default
`cate_score`	`float`	Current CATE estimate. Positive = treatment beneficial.	required
`step`	`int`	Current time step (used for epsilon decay).	required

Returns:

Name	Type	Description
`treatment`	`int`	Chosen treatment (0 or 1).
`propensity`	`float`	Probability of the chosen treatment under this policy.

Source code in onlinecml/policy/epsilon_greedy.py

def choose(self, cate_score: float, step: int) -> tuple[int, float]:
    """Choose a treatment assignment.

    Parameters
    ----------
    cate_score : float
        Current CATE estimate. Positive = treatment beneficial.
    step : int
        Current time step (used for epsilon decay).

    Returns
    -------
    treatment : int
        Chosen treatment (0 or 1).
    propensity : float
        Probability of the chosen treatment under this policy.
    """
    eps = self.eps_end + (self.eps_start - self.eps_end) * math.exp(-step / self.decay)

    if self._rng.random() < eps:
        # Explore: random treatment, uniform propensity
        treatment = self._rng.randint(0, 1)
        self._last_arm = treatment
        return treatment, 0.5

    # Exploit: use cate_score if non-zero (causal model available),
    # else fall back to internal arm reward estimates.
    if cate_score == 0.0:
        mu1 = self._arm_sums[1] / self._arm_counts[1] if self._arm_counts[1] > 0 else 0.5
        mu0 = self._arm_sums[0] / self._arm_counts[0] if self._arm_counts[0] > 0 else 0.5
        cate_score = mu1 - mu0
    treatment = 1 if cate_score > 0 else 0
    self._last_arm = treatment
    propensity = 1.0 - eps
    return treatment, propensity

`current_epsilon(step)`

Return the current epsilon value at a given step.

Parameters:

Name	Type	Description	Default
`step`	`int`	Current time step.	required

Returns:

Type	Description
`float`	Epsilon at this step.

Source code in onlinecml/policy/epsilon_greedy.py

def current_epsilon(self, step: int) -> float:
    """Return the current epsilon value at a given step.

    Parameters
    ----------
    step : int
        Current time step.

    Returns
    -------
    float
        Epsilon at this step.
    """
    return self.eps_end + (self.eps_start - self.eps_end) * math.exp(-step / self.decay)

`update(reward)`

Update internal arm reward estimate after observing a reward.

Parameters:

Name	Type	Description	Default
`reward`	`float`	Observed reward for the arm chosen in the most recent `choose` call. No-op if `choose` has not been called yet.	required

Source code in onlinecml/policy/epsilon_greedy.py

def update(self, reward: float) -> None:
    """Update internal arm reward estimate after observing a reward.

    Parameters
    ----------
    reward : float
        Observed reward for the arm chosen in the most recent
        ``choose`` call. No-op if ``choose`` has not been called yet.
    """
    if self._last_arm is not None:
        self._arm_sums[self._last_arm] += reward
        self._arm_counts[self._last_arm] += 1

`onlinecml.policy.thompson_sampling.ThompsonSampling`

Bases: BasePolicy

Thompson Sampling policy for binary outcomes (Beta-Bernoulli).

Maintains a Beta posterior over the success probability for each treatment arm. At each step, samples from each posterior and assigns the treatment with the higher sample.

Parameters:

Name	Type	Description	Default
`alpha_prior`	`float`	Prior pseudo-count for successes (Beta alpha). Default 1.0 (uniform prior).	`1.0`
`beta_prior`	`float`	Prior pseudo-count for failures (Beta beta). Default 1.0 (uniform prior).	`1.0`
`seed`	`int or None`	Random seed for reproducibility.	`None`

Notes

This policy assumes binary outcomes in [0, 1]. For continuous outcomes, use GaussianThompsonSampling.

The propensity returned is the probability that the chosen arm would be selected, estimated as the fraction of Monte Carlo samples where that arm wins. For implementation simplicity, we return 0.5 during exploration-equivalent draws and the exploit probability otherwise.

Examples:

>>> policy = ThompsonSampling(seed=0)
>>> treatment, propensity = policy.choose(cate_score=0.0, step=0)
>>> treatment in (0, 1)
True
>>> policy.update(reward=1.0)

Source code in onlinecml/policy/thompson_sampling.py

class ThompsonSampling(BasePolicy):
    """Thompson Sampling policy for binary outcomes (Beta-Bernoulli).

    Maintains a Beta posterior over the success probability for each
    treatment arm. At each step, samples from each posterior and assigns
    the treatment with the higher sample.

    Parameters
    ----------
    alpha_prior : float
        Prior pseudo-count for successes (Beta alpha). Default 1.0
        (uniform prior).
    beta_prior : float
        Prior pseudo-count for failures (Beta beta). Default 1.0
        (uniform prior).
    seed : int or None
        Random seed for reproducibility.

    Notes
    -----
    This policy assumes binary outcomes in [0, 1]. For continuous
    outcomes, use ``GaussianThompsonSampling``.

    The propensity returned is the probability that the chosen arm would
    be selected, estimated as the fraction of Monte Carlo samples where
    that arm wins. For implementation simplicity, we return 0.5 during
    exploration-equivalent draws and the exploit probability otherwise.

    Examples
    --------
    >>> policy = ThompsonSampling(seed=0)
    >>> treatment, propensity = policy.choose(cate_score=0.0, step=0)
    >>> treatment in (0, 1)
    True
    >>> policy.update(reward=1.0)
    """

    def __init__(
        self,
        alpha_prior: float = 1.0,
        beta_prior: float = 1.0,
        seed: int | None = None,
    ) -> None:
        self.alpha_prior = alpha_prior
        self.beta_prior = beta_prior
        self.seed = seed
        self._rng = random.Random(seed)
        # Posterior parameters: [alpha, beta] for each arm (0 = control, 1 = treated)
        self._alpha = [alpha_prior, alpha_prior]
        self._beta = [beta_prior, beta_prior]
        self._last_treatment: int = 0

    def _beta_sample(self, alpha: float, beta: float) -> float:
        """Draw one sample from a Beta(alpha, beta) distribution.

        Uses the gamma-ratio method: Beta(a,b) = Gamma(a) / (Gamma(a) + Gamma(b)).

        Parameters
        ----------
        alpha : float
            Beta distribution alpha parameter.
        beta : float
            Beta distribution beta parameter.

        Returns
        -------
        float
            Sample in (0, 1).
        """
        x = self._rng.gammavariate(alpha, 1.0)
        y = self._rng.gammavariate(beta, 1.0)
        total = x + y
        if total <= 0.0:
            return 0.5
        return x / total

    def choose(self, cate_score: float, step: int) -> tuple[int, float]:
        """Choose a treatment by sampling from the Beta posteriors.

        Parameters
        ----------
        cate_score : float
            Not used by Thompson Sampling (posteriors drive the choice).
        step : int
            Not used; included for API compatibility.

        Returns
        -------
        treatment : int
            The arm with the higher posterior sample.
        propensity : float
            Approximate propensity (0.5 as a conservative estimate for
            the Beta-Bernoulli sampler).
        """
        s0 = self._beta_sample(self._alpha[0], self._beta[0])
        s1 = self._beta_sample(self._alpha[1], self._beta[1])
        treatment = 1 if s1 > s0 else 0
        self._last_treatment = treatment
        # Conservative propensity estimate: 0.5 (true propensity depends on
        # the full posterior, which is expensive to compute exactly)
        return treatment, 0.5

    def update(self, reward: float) -> None:
        """Update the Beta posterior for the last chosen arm.

        Parameters
        ----------
        reward : float
            Observed outcome. Values > 0.5 are treated as successes;
            values ≤ 0.5 as failures (for binary reward encoding).
        """
        arm = self._last_treatment
        if reward > 0.5:
            self._alpha[arm] += 1.0
        else:
            self._beta[arm] += 1.0

    def reset(self) -> None:
        """Reset posteriors to prior and reinitialize RNG."""
        self.__init__(**self._get_params())  # type: ignore[misc]

`choose(cate_score, step)`

Choose a treatment by sampling from the Beta posteriors.

Parameters:

Name	Type	Description	Default
`cate_score`	`float`	Not used by Thompson Sampling (posteriors drive the choice).	required
`step`	`int`	Not used; included for API compatibility.	required

Returns:

Name	Type	Description
`treatment`	`int`	The arm with the higher posterior sample.
`propensity`	`float`	Approximate propensity (0.5 as a conservative estimate for the Beta-Bernoulli sampler).

Source code in onlinecml/policy/thompson_sampling.py

def choose(self, cate_score: float, step: int) -> tuple[int, float]:
    """Choose a treatment by sampling from the Beta posteriors.

    Parameters
    ----------
    cate_score : float
        Not used by Thompson Sampling (posteriors drive the choice).
    step : int
        Not used; included for API compatibility.

    Returns
    -------
    treatment : int
        The arm with the higher posterior sample.
    propensity : float
        Approximate propensity (0.5 as a conservative estimate for
        the Beta-Bernoulli sampler).
    """
    s0 = self._beta_sample(self._alpha[0], self._beta[0])
    s1 = self._beta_sample(self._alpha[1], self._beta[1])
    treatment = 1 if s1 > s0 else 0
    self._last_treatment = treatment
    # Conservative propensity estimate: 0.5 (true propensity depends on
    # the full posterior, which is expensive to compute exactly)
    return treatment, 0.5

`reset()`

Reset posteriors to prior and reinitialize RNG.

Source code in onlinecml/policy/thompson_sampling.py

def reset(self) -> None:
    """Reset posteriors to prior and reinitialize RNG."""
    self.__init__(**self._get_params())  # type: ignore[misc]

`update(reward)`

Update the Beta posterior for the last chosen arm.

Parameters:

Name	Type	Description	Default
`reward`	`float`	Observed outcome. Values > 0.5 are treated as successes; values ≤ 0.5 as failures (for binary reward encoding).	required

Source code in onlinecml/policy/thompson_sampling.py

def update(self, reward: float) -> None:
    """Update the Beta posterior for the last chosen arm.

    Parameters
    ----------
    reward : float
        Observed outcome. Values > 0.5 are treated as successes;
        values ≤ 0.5 as failures (for binary reward encoding).
    """
    arm = self._last_treatment
    if reward > 0.5:
        self._alpha[arm] += 1.0
    else:
        self._beta[arm] += 1.0

`onlinecml.policy.thompson_sampling.GaussianThompsonSampling`

Bases: BasePolicy

Thompson Sampling policy for continuous outcomes (Gaussian).

Maintains a Gaussian posterior over the mean reward for each arm using a Normal-Normal conjugate model. Assumes known variance.

Parameters:

Name	Type	Description	Default
`prior_mean`	`float`	Prior mean for each arm's reward. Default 0.0.	`0.0`
`prior_std`	`float`	Prior standard deviation (uncertainty about the mean). Default 1.0.	`1.0`
`noise_std`	`float`	Known observation noise standard deviation. Default 1.0.	`1.0`
`seed`	`int or None`	Random seed for reproducibility.	`None`

Notes

The posterior after n observations with sample mean y_bar is:

.. math::

\mu_{post} = \frac{\sigma_0^2 n \bar{y} + \sigma^2 \mu_0}
                   {\sigma_0^2 n + \sigma^2}

\sigma_{post}^2 = \frac{\sigma_0^2 \sigma^2}{\sigma_0^2 n + \sigma^2}

Examples:

>>> policy = GaussianThompsonSampling(seed=42)
>>> treatment, _ = policy.choose(1.5, 10)
>>> treatment in (0, 1)
True

Source code in onlinecml/policy/thompson_sampling.py

class GaussianThompsonSampling(BasePolicy):
    """Thompson Sampling policy for continuous outcomes (Gaussian).

    Maintains a Gaussian posterior over the mean reward for each arm
    using a Normal-Normal conjugate model. Assumes known variance.

    Parameters
    ----------
    prior_mean : float
        Prior mean for each arm's reward. Default 0.0.
    prior_std : float
        Prior standard deviation (uncertainty about the mean). Default 1.0.
    noise_std : float
        Known observation noise standard deviation. Default 1.0.
    seed : int or None
        Random seed for reproducibility.

    Notes
    -----
    The posterior after ``n`` observations with sample mean ``y_bar`` is:

    .. math::

        \\mu_{post} = \\frac{\\sigma_0^2 n \\bar{y} + \\sigma^2 \\mu_0}
                           {\\sigma_0^2 n + \\sigma^2}

        \\sigma_{post}^2 = \\frac{\\sigma_0^2 \\sigma^2}{\\sigma_0^2 n + \\sigma^2}

    Examples
    --------
    >>> policy = GaussianThompsonSampling(seed=42)
    >>> treatment, _ = policy.choose(1.5, 10)
    >>> treatment in (0, 1)
    True
    """

    def __init__(
        self,
        prior_mean: float = 0.0,
        prior_std: float = 1.0,
        noise_std: float = 1.0,
        seed: int | None = None,
    ) -> None:
        self.prior_mean = prior_mean
        self.prior_std = prior_std
        self.noise_std = noise_std
        self.seed = seed
        self._rng = random.Random(seed)
        # Posterior sufficient stats per arm: (sum_y, n)
        self._sum_y = [0.0, 0.0]
        self._n = [0, 0]
        self._last_treatment: int = 0

    def _posterior_params(self, arm: int) -> tuple[float, float]:
        """Return (posterior_mean, posterior_std) for an arm.

        Parameters
        ----------
        arm : int
            Arm index (0 or 1).

        Returns
        -------
        tuple of (float, float)
            Posterior mean and standard deviation.
        """
        n = self._n[arm]
        sigma0_sq = self.prior_std ** 2
        sigma_sq = self.noise_std ** 2
        if n == 0:
            return self.prior_mean, self.prior_std
        y_bar = self._sum_y[arm] / n
        denom = sigma0_sq * n + sigma_sq
        post_mean = (sigma0_sq * n * y_bar + sigma_sq * self.prior_mean) / denom
        post_var = (sigma0_sq * sigma_sq) / denom
        return post_mean, math.sqrt(post_var)

    def _gauss_sample(self, mu: float, sigma: float) -> float:
        """Sample from N(mu, sigma^2).

        Parameters
        ----------
        mu : float
            Mean.
        sigma : float
            Standard deviation.

        Returns
        -------
        float
            Gaussian sample.
        """
        return self._rng.gauss(mu, sigma)

    def choose(self, cate_score: float, step: int) -> tuple[int, float]:
        """Choose a treatment by sampling from the Gaussian posteriors.

        Parameters
        ----------
        cate_score : float
            Not used; included for API compatibility.
        step : int
            Not used; included for API compatibility.

        Returns
        -------
        treatment : int
            Arm with higher posterior sample.
        propensity : float
            Conservative propensity estimate (0.5).
        """
        mu0, s0 = self._posterior_params(0)
        mu1, s1 = self._posterior_params(1)
        sample0 = self._gauss_sample(mu0, s0)
        sample1 = self._gauss_sample(mu1, s1)
        treatment = 1 if sample1 > sample0 else 0
        self._last_treatment = treatment
        return treatment, 0.5

    def update(self, reward: float) -> None:
        """Update the Gaussian posterior for the last chosen arm.

        Parameters
        ----------
        reward : float
            Observed continuous outcome.
        """
        arm = self._last_treatment
        self._sum_y[arm] += reward
        self._n[arm] += 1

    def reset(self) -> None:
        """Reset posteriors to prior and reinitialize RNG."""
        self.__init__(**self._get_params())  # type: ignore[misc]

`choose(cate_score, step)`

Choose a treatment by sampling from the Gaussian posteriors.

Parameters:

Name	Type	Description	Default
`cate_score`	`float`	Not used; included for API compatibility.	required
`step`	`int`	Not used; included for API compatibility.	required

Returns:

Name	Type	Description
`treatment`	`int`	Arm with higher posterior sample.
`propensity`	`float`	Conservative propensity estimate (0.5).

Source code in onlinecml/policy/thompson_sampling.py

def choose(self, cate_score: float, step: int) -> tuple[int, float]:
    """Choose a treatment by sampling from the Gaussian posteriors.

    Parameters
    ----------
    cate_score : float
        Not used; included for API compatibility.
    step : int
        Not used; included for API compatibility.

    Returns
    -------
    treatment : int
        Arm with higher posterior sample.
    propensity : float
        Conservative propensity estimate (0.5).
    """
    mu0, s0 = self._posterior_params(0)
    mu1, s1 = self._posterior_params(1)
    sample0 = self._gauss_sample(mu0, s0)
    sample1 = self._gauss_sample(mu1, s1)
    treatment = 1 if sample1 > sample0 else 0
    self._last_treatment = treatment
    return treatment, 0.5

`reset()`

Reset posteriors to prior and reinitialize RNG.

Source code in onlinecml/policy/thompson_sampling.py

def reset(self) -> None:
    """Reset posteriors to prior and reinitialize RNG."""
    self.__init__(**self._get_params())  # type: ignore[misc]

`update(reward)`

Update the Gaussian posterior for the last chosen arm.

Parameters:

Name	Type	Description	Default
`reward`	`float`	Observed continuous outcome.	required

Source code in onlinecml/policy/thompson_sampling.py

def update(self, reward: float) -> None:
    """Update the Gaussian posterior for the last chosen arm.

    Parameters
    ----------
    reward : float
        Observed continuous outcome.
    """
    arm = self._last_treatment
    self._sum_y[arm] += reward
    self._n[arm] += 1

`onlinecml.policy.ucb.UCB`

Bases: BasePolicy

Upper Confidence Bound policy for treatment selection.

Selects the treatment with the highest upper confidence bound on its expected reward. Balances exploration (high uncertainty) and exploitation (high mean reward) via a confidence coefficient.

Parameters:

Name	Type	Description	Default
`confidence`	`float`	Exploration coefficient. Larger values encourage more exploration. Default 1.0 (standard UCB1).	`1.0`
`min_pulls`	`int`	Minimum number of times each arm must be pulled before switching to UCB selection. During warm-up, arms are pulled in round-robin. Default 1.	`1`

Notes

UCB1 bound for arm a:

.. math::

UCB_a = \hat{\mu}_a + c \sqrt{\frac{\ln(t)}{n_a}}

where t is the total number of pulls, n_a is the number of pulls for arm a, and c is the confidence coefficient.

The propensity returned reflects whether we are in the warm-up phase (0.5) or UCB exploitation phase (1 - exploration_fraction).

Examples:

>>> policy = UCB(confidence=1.0)
>>> treatment, propensity = policy.choose(cate_score=0.0, step=5)
>>> treatment in (0, 1)
True

Source code in onlinecml/policy/ucb.py

class UCB(BasePolicy):
    """Upper Confidence Bound policy for treatment selection.

    Selects the treatment with the highest upper confidence bound on its
    expected reward. Balances exploration (high uncertainty) and
    exploitation (high mean reward) via a confidence coefficient.

    Parameters
    ----------
    confidence : float
        Exploration coefficient. Larger values encourage more exploration.
        Default 1.0 (standard UCB1).
    min_pulls : int
        Minimum number of times each arm must be pulled before switching
        to UCB selection. During warm-up, arms are pulled in round-robin.
        Default 1.

    Notes
    -----
    UCB1 bound for arm ``a``:

    .. math::

        UCB_a = \\hat{\\mu}_a + c \\sqrt{\\frac{\\ln(t)}{n_a}}

    where ``t`` is the total number of pulls, ``n_a`` is the number of
    pulls for arm ``a``, and ``c`` is the confidence coefficient.

    The propensity returned reflects whether we are in the warm-up phase
    (0.5) or UCB exploitation phase (1 - exploration_fraction).

    Examples
    --------
    >>> policy = UCB(confidence=1.0)
    >>> treatment, propensity = policy.choose(cate_score=0.0, step=5)
    >>> treatment in (0, 1)
    True
    """

    def __init__(self, confidence: float = 1.0, min_pulls: int = 1) -> None:
        self.confidence = confidence
        self.min_pulls = min_pulls
        self._n_pulls = [0, 0]       # pulls per arm
        self._sum_reward = [0.0, 0.0]
        self._total_pulls: int = 0
        self._last_treatment: int = 0

    def choose(self, cate_score: float, step: int) -> tuple[int, float]:
        """Choose a treatment using the UCB rule.

        Parameters
        ----------
        cate_score : float
            Not used directly; the UCB rule uses observed rewards.
        step : int
            Not used directly; the class tracks pulls internally.

        Returns
        -------
        treatment : int
            Arm with the highest UCB score.
        propensity : float
            0.5 during warm-up; approximate propensity during UCB phase.
        """
        # Warm-up: ensure each arm is pulled min_pulls times
        for arm in range(2):
            if self._n_pulls[arm] < self.min_pulls:
                self._last_treatment = arm
                return arm, 0.5

        # UCB selection
        t = max(1, self._total_pulls)
        ucb_scores = []
        for arm in range(2):
            mean = self._sum_reward[arm] / self._n_pulls[arm]
            bonus = self.confidence * math.sqrt(math.log(t) / max(1, self._n_pulls[arm]))
            ucb_scores.append(mean + bonus)

        treatment = 1 if ucb_scores[1] >= ucb_scores[0] else 0
        self._last_treatment = treatment
        self._total_pulls += 1
        return treatment, 0.5

    def update(self, reward: float) -> None:
        """Update the reward estimate for the last chosen arm.

        Parameters
        ----------
        reward : float
            Observed outcome after applying the last chosen treatment.
        """
        arm = self._last_treatment
        self._n_pulls[arm] += 1
        self._sum_reward[arm] += reward

    def reset(self) -> None:
        """Reset all arm statistics."""
        self.__init__(**self._get_params())  # type: ignore[misc]

`choose(cate_score, step)`

Choose a treatment using the UCB rule.

Parameters:

Name	Type	Description	Default
`cate_score`	`float`	Not used directly; the UCB rule uses observed rewards.	required
`step`	`int`	Not used directly; the class tracks pulls internally.	required

Returns:

Name	Type	Description
`treatment`	`int`	Arm with the highest UCB score.
`propensity`	`float`	0.5 during warm-up; approximate propensity during UCB phase.

Source code in onlinecml/policy/ucb.py

def choose(self, cate_score: float, step: int) -> tuple[int, float]:
    """Choose a treatment using the UCB rule.

    Parameters
    ----------
    cate_score : float
        Not used directly; the UCB rule uses observed rewards.
    step : int
        Not used directly; the class tracks pulls internally.

    Returns
    -------
    treatment : int
        Arm with the highest UCB score.
    propensity : float
        0.5 during warm-up; approximate propensity during UCB phase.
    """
    # Warm-up: ensure each arm is pulled min_pulls times
    for arm in range(2):
        if self._n_pulls[arm] < self.min_pulls:
            self._last_treatment = arm
            return arm, 0.5

    # UCB selection
    t = max(1, self._total_pulls)
    ucb_scores = []
    for arm in range(2):
        mean = self._sum_reward[arm] / self._n_pulls[arm]
        bonus = self.confidence * math.sqrt(math.log(t) / max(1, self._n_pulls[arm]))
        ucb_scores.append(mean + bonus)

    treatment = 1 if ucb_scores[1] >= ucb_scores[0] else 0
    self._last_treatment = treatment
    self._total_pulls += 1
    return treatment, 0.5

`reset()`

Reset all arm statistics.

Source code in onlinecml/policy/ucb.py

def reset(self) -> None:
    """Reset all arm statistics."""
    self.__init__(**self._get_params())  # type: ignore[misc]

`update(reward)`

Update the reward estimate for the last chosen arm.

Parameters:

Name	Type	Description	Default
`reward`	`float`	Observed outcome after applying the last chosen treatment.	required

Source code in onlinecml/policy/ucb.py

def update(self, reward: float) -> None:
    """Update the reward estimate for the last chosen arm.

    Parameters
    ----------
    reward : float
        Observed outcome after applying the last chosen treatment.
    """
    arm = self._last_treatment
    self._n_pulls[arm] += 1
    self._sum_reward[arm] += reward