BM25.md (1546B)
1 # BM-25 2 3 **Source:** https://en.wikipedia.org/wiki/Okapi_BM25 4 5 **Definition:** BM25 is a ranking function that attempts to concretize the notion that going from 1 -> 2 occurrences of a word is more important than going from 2000 -> 2001, and that simply repeating the same words over and over does not improve document relevance (keyword stuffing). 6 7 ## Equation 8 9 $score(D,Q) = \Sigma_{i=1}^n IDF(q_i) \cdot\frac{f(q_i,D) \cdot (k_1 + 1)}{f(q_i, D) + k_1 \cdot (1 - b + b \cdot \frac{|D|}{avgdl})}$ 10 11 Where the IDF of $q_i$ is $\text{log(}\frac{\text{\# of documents}}{\text{\# of documents conaining containing }q_i})$ and both $k_1$ and $b$ are free variables. 12 13 NOTE: There are different variants of IDF. 14 15 Often, $k_1 \in [1.2,2.0]$ and $b = 0.75$. 16 17 ## Intuition 18 19 The score of a given document D wrt the query Q is the sum of the scores for each query term. 20 21 The second part of the equation for a given term is representative of the notion that we want documents to be shorter (dictated by the $b$ term), and that going from 1 -> 2 instances of a term is more important than going from 200 -> 201, dictated by the relative term frequency appearing in both the numerator and the denominator. 22 23 ## History 24 25 This equation is rooted in empiricism and statistics. BM25 is called best matching 25 because it was the 25th iteration of the best matching formula. As such, there were (probably) 24 prior iterations of the equation, supporting the conclusion that while it does have some statistical significance, it is not grounded entirely in statistics.