Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

evaluatio.inference.hypothesis

Hypothesis testing functions for paired samples.

This module provides paired statistical significance tests for comparing two systems or models on matched observations. Both tests are non-parametric and make no distributional assumptions, relying instead on resampling to construct a null distribution.

The intended use case is evaluating whether the difference in performance between two systems is statistically significant. For example two ASR models evaluated on the same utterances, or two MT systems evaluated on the same source sentences.

All tests in this module are paired: every element x1[i] must correspond to the same observation as x2[i]. The ordering of pairs is assumed to be meaningful and must be consistent between x1 and x2.

Note

For general guidance on which test to use:

Both tests converge to equivalent conclusions on large samples. On small samples the permutation test has a slight power advantage due to its exactness.

Functions

paired_bootstrap_test

paired_bootstrap_test(x1: Iterable[float], x2: Iterable[float], iterations: int) -> float

Perform a paired bootstrap significance test on the mean difference.

Resamples pairs with replacement to construct a null distribution of the mean difference under the hypothesis of no effect, and returns a two-sided p-value. The p-value is estimated as the proportion of bootstrap iterations in which the resampled mean difference is at least as extreme as the observed mean difference, using the (count + 1) / (iterations + 1) correction to ensure the p-value is never exactly zero.

Parameters

Returns

Raises

Note

The minimum possible p-value is 1 / (iterations + 1). With iterations=9999 this is 0.0001. Reporting p < 0.0001 without increasing iterations accordingly is not meaningful.

This test resamples pairs with replacement, which models variability in the observed mean difference as if a different test set of the same size had been drawn. It does not account for variability across training runs or random seeds.

Examples

>>> x1 = [0.85, 0.90, 0.78, 0.92, 0.88]
>>> x2 = [0.80, 0.85, 0.75, 0.88, 0.82]
>>> p = paired_bootstrap_test(x1, x2, iterations=9999)
>>> print(f"p = {p:.4f}")
p = 0.0312

paired_permutation_test

paired_permutation_test(x1: Iterable[float], x2: Iterable[float], iterations: int, two_tailed: bool=True) -> float

Perform a paired permutation significance test on the mean difference.

Constructs a null distribution by randomly flipping the sign of each paired difference, under the sharp null hypothesis that the two systems are exchangeable on every observation. Returns a p-value estimated as the proportion of permutations producing a test statistic at least as extreme as the observed statistic, using the (count + 1) / (iterations + 1) correction.

Parameters

Returns

Raises

Note

The permutation test operates under a stricter null hypothesis than the bootstrap test: it assumes not merely that the mean difference is zero, but that the treatment assignment is completely arbitrary for every individual observation. This makes it more powerful than the bootstrap test on small samples, but the two converge on large samples.

For a two-tailed test the sign-flip procedure is symmetric, so swapping x1 and x2 produces an identical p-value.

The minimum possible p-value is 1 / (iterations + 1).

Examples

Two-tailed test (default):

>>> x1 = [0.85, 0.90, 0.78, 0.92, 0.88]
>>> x2 = [0.80, 0.85, 0.75, 0.88, 0.82]
>>> p = paired_permutation_test(x1, x2, iterations=9999)
>>> print(f"p = {p:.4f}")
p = 0.0287

One-tailed test:

>>> p = paired_permutation_test(x1, x2, iterations=9999, two_tailed=False)
>>> print(f"p = {p:.4f}")
p = 0.0144