Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

evaluatio.metrics.bleu

BLEU metrics

Evaluatio does not implement BLEU natively, but instead relies on sacrebleu3. Evaluation complements sacrebleu by providing statistical comparison tools, which are not included in sacrebleu itself. This module contains those functions.

This module provides a paired bootstrap significance test for comparing two machine translation systems using the BLEU metric. It follows the method introduced by Koehn (2004), in which corpus-level BLEU is recomputed on each bootstrap resample rather than aggregating per-sentence scores directly.

Sufficient statistics (clipped n-gram counts, total n-gram counts, and lengths required for the brevity penalty) are precomputed per sentence using sacrebleu, ensuring that tokenisation and scoring are fully compatible with the sacrebleu reference implementation. The resampling itself is performed in Rust for efficiency.

The confidence interval function works the same way as the paired bootstrap and the required statistics are precomputed using sacrebleu.

Note

BLEU is a corpus-level metric and does not decompose meaningfully at the sentence level. The sufficient statistics approach used here preserves corpus-level correctness while avoiding redundant tokenisation on each resample.

Tokenisation is handled by sacrebleu using the 13a tokeniser by default, consistent with WMT evaluation practice. Scores produced by this module are directly comparable to sacrebleu corpus-level BLEU scores computed with the same settings.

For sentence-decomposable metrics such as chrF or COMET, use :func:evaluatio.hypothesis.paired_bootstrap_test directly with per-sentence scores.

References

  1. Papineni, K., et al. (2002). BLEU: a method for automatic evaluation of machine translation. ACL.

  2. Koehn, P. (2004). Statistical significance tests for machine translation evaluation. Proceedings of EMNLP 2004, 388-395.

  3. Post, M. (2018). A call for clarity in reporting BLEU scores. Proceedings of the Third Conference on Machine Translation, 186-191.

Functions

bleu_bootstrap_test

bleu_bootstrap_test(references: Iterable[Iterable[str]], hyp1: Iterable[str], hyp2: Iterable[str], iterations: int, effective_order: bool=True) -> float

Perform a paired bootstrap significance test comparing two MT systems using corpus-level BLEU.

Sufficient statistics are precomputed per sentence using sacrebleu, then passed to the Rust resampling backend. On each bootstrap iteration a pseudo-test-set is drawn by sampling sentences with replacement, and corpus-level BLEU is recomputed for both systems from the accumulated sufficient statistics. The p-value is the proportion of iterations in which the worse system appears to outperform the better, using the (count + 1) / (iterations + 1) correction.

Parameters

Returns

Raises

Note

The minimum possible p-value is 1 / (iterations + 1). With iterations=9999 this is 0.0001.

Tokenisation uses sacrebleu’s 13a tokeniser by default, consistent with WMT evaluation practice. BLEU scores computed internally are directly comparable to sacrebleu corpus-level scores produced with the same tokeniser and effective_order setting.

Examples

>>> references = [["the cat sat on the mat"], ["the dog ate the bone"]]
>>> hyp1 = ["the cat sat on the mat", "the dog ate the bone"]
>>> hyp2 = ["a cat sat on a mat", "a dog ate a bone"]
>>> p = bleu_bootstrap_test(references, hyp1, hyp2, iterations=9999)
>>> print(f"p = {p:.4f}")
p = 0.0231

bleu_ci

bleu_ci(references: Iterable[Iterable[str]], hypotheses: Iterable[str], iterations: int, alpha: float, effective_order: bool=True) -> ConfidenceInterval

Estimate a confidence interval for corpus-level BLEU using bootstrap resampling.

This function computes a percentile bootstrap confidence interval for the BLEU score by repeatedly resampling sentence-level sufficient statistics with replacement and recomputing corpus-level BLEU for each resample.

The returned interval reflects uncertainty due to sampling variation in the evaluation dataset.

Parameters

Returns

Note

References

  1. Papineni, K., et al. (2002). BLEU: a method for automatic evaluation of machine translation. ACL.