evaluatio.inference.hypothesis¶
Hypothesis testing functions for paired samples.
This module provides paired statistical significance tests for comparing two systems or models on matched observations. Both tests are non-parametric and make no distributional assumptions, relying instead on resampling to construct a null distribution.
The intended use case is evaluating whether the difference in performance between two systems is statistically significant. For example two ASR models evaluated on the same utterances, or two MT systems evaluated on the same source sentences.
All tests in this module are paired: every element x1[i] must correspond
to the same observation as x2[i]. The ordering of pairs is assumed to be
meaningful and must be consistent between x1 and x2.
Note
For general guidance on which test to use:
Use :func:
paired_bootstrap_testwhen you want a p-value alongside a separately computed confidence interval, or when the bootstrap CI is your primary reported result.Use :func:
paired_permutation_testwhen you want an exact significance test under the sharp null hypothesis of no effect on any unit.
Both tests converge to equivalent conclusions on large samples. On small samples the permutation test has a slight power advantage due to its exactness.
Functions¶
paired_bootstrap_test¶
paired_bootstrap_test(x1: Iterable[float], x2: Iterable[float], iterations: int) -> floatPerform a paired bootstrap significance test on the mean difference.
Resamples pairs with replacement to construct a null distribution of the
mean difference under the hypothesis of no effect, and returns a two-sided
p-value. The p-value is estimated as the proportion of bootstrap iterations
in which the resampled mean difference is at least as extreme as the
observed mean difference, using the (count + 1) / (iterations + 1)
correction to ensure the p-value is never exactly zero.
Parameters
x1:iterable of float
Per-observation scores for the first system. Must be the same length asx2, withx1[i]andx2[i]corresponding to the same observation.x2:iterable of float
Per-observation scores for the second system. Must be the same length asx1.iterations:int
Number of bootstrap resamples. Values of 5000 to 10000 give stable p-value estimates for most purposes. Larger values reduce Monte Carlo variance in the p-value but increase runtime linearly.
Returns
float
Two-sided p-value in the range(0, 1]. A value below 0.05 indicates that the observed mean difference is unlikely under the null hypothesis of no effect.
Raises
ValueError
Ifx1andx2have different lengths, or if either is empty.
Note
The minimum possible p-value is 1 / (iterations + 1). With
iterations=9999 this is 0.0001. Reporting p < 0.0001 without
increasing iterations accordingly is not meaningful.
This test resamples pairs with replacement, which models variability in the observed mean difference as if a different test set of the same size had been drawn. It does not account for variability across training runs or random seeds.
Examples
>>> x1 = [0.85, 0.90, 0.78, 0.92, 0.88]
>>> x2 = [0.80, 0.85, 0.75, 0.88, 0.82]
>>> p = paired_bootstrap_test(x1, x2, iterations=9999)
>>> print(f"p = {p:.4f}")
p = 0.0312paired_permutation_test¶
paired_permutation_test(x1: Iterable[float], x2: Iterable[float], iterations: int, two_tailed: bool=True) -> floatPerform a paired permutation significance test on the mean difference.
Constructs a null distribution by randomly flipping the sign of each
paired difference, under the sharp null hypothesis that the two systems
are exchangeable on every observation. Returns a p-value estimated as the
proportion of permutations producing a test statistic at least as extreme
as the observed statistic, using the (count + 1) / (iterations + 1)
correction.
Parameters
x1:iterable of float
Per-observation scores for the first system. Must be the same length asx2, withx1[i]andx2[i]corresponding to the same observation.x2:iterable of float
Per-observation scores for the second system. Must be the same length asx1.iterations:int
Number of random permutations to sample. Values of 5000 to 10000 give stable p-value estimates for most purposes. The total number of distinct permutations fornpairs is2^n, so exhaustive enumeration is only feasible for very smalln.two_tailed:bool
IfTrue(default), the test is two-sided: both directions of difference contribute to the p-value. IfFalse, the test is one-sided in the direction wherex1exceedsx2.
Returns
float
P-value in the range(0, 1]. A value below 0.05 indicates that the observed mean difference is unlikely under the sharp null hypothesis of exchangeability.
Raises
ValueError
Ifx1andx2have different lengths, if either is empty, or if the number of iterations is < 1.
Note
The permutation test operates under a stricter null hypothesis than the bootstrap test: it assumes not merely that the mean difference is zero, but that the treatment assignment is completely arbitrary for every individual observation. This makes it more powerful than the bootstrap test on small samples, but the two converge on large samples.
For a two-tailed test the sign-flip procedure is symmetric, so swapping
x1 and x2 produces an identical p-value.
The minimum possible p-value is 1 / (iterations + 1).
Examples
Two-tailed test (default):
>>> x1 = [0.85, 0.90, 0.78, 0.92, 0.88]
>>> x2 = [0.80, 0.85, 0.75, 0.88, 0.82]
>>> p = paired_permutation_test(x1, x2, iterations=9999)
>>> print(f"p = {p:.4f}")
p = 0.0287One-tailed test:
>>> p = paired_permutation_test(x1, x2, iterations=9999, two_tailed=False)
>>> print(f"p = {p:.4f}")
p = 0.0144