valve. Doc A contains the word valve 50 times — does that automatically make Doc A the most relevant result? Not necessarily. Document length and term density both matter.
Document lengths and occurrences:
| Document | Length (words) | valve occurrences | Density (occurrences / length) |
|---|---|---|---|
| Doc A | 1,100 | 50 | 0.045 |
| Doc B | 20 | 2 | 0.20 |
| Doc C | 18 | 0 | 0.00 |

Keyword stuffing — repeating a keyword excessively to inflate relevance — can cause TF-IDF-based systems to return spammy or low-quality results. Modern ranking functions like BM25 were designed to reduce that risk.

- Diminishing returns for repeated occurrences of the same term (saturation).
- Length normalization so shorter, more focused documents can compete against longer ones.

- It counts how often a query term appears in a document (
term frequency), but applies a saturation function so each additional occurrence contributes less. - It normalizes scores by document length so long documents do not automatically win.
- It multiplies by an inverse document frequency (
IDF) term to give more weight to rare terms across the corpus.


f— term frequency in the document (how many times the query term appears).L— document length (typically number of words).AVGL— average document length across the collection.k1— term frequency saturation parameter (commonly 1.2–2.0). Largerk1reduces saturation (gives more weight to additional occurrences).b— length normalization parameter in [0, 1];b = 0disables length normalization,b = 1enables full normalization (common default ~0.75).IDF— inverse document frequency that boosts rare terms and downweights common terms.

- Use
k1to control how quickly additional term occurrences saturate; increasek1if repeated terms should count more. - Use
bto control how much length affects scores; lowerbif document length should matter less. - Defaults (
k1 ≈ 1.2–1.5,b ≈ 0.75) work well for many text collections, but always validate on your dataset.
- Compute term frequency
ffor each query term in each document. - Normalize
fby document length usingL,AVGL, andb. - Apply saturation using
k1so the contribution fromflevels off. - Multiply by
IDFto emphasize rare, informative terms.
- BM25 (Wikipedia)
- Elasticsearch relevance documentation
- Robertson, S., Walker, S. — “Okapi at TREC” (original BM25 research)