Same corpus, the other score is 0 #43

rcy1122 · 2024-06-11T02:09:27Z

from rank_bm25 import BM25Okapi

corpus = [
    "Hello there good man!",
    "It is quite windy in London"
    "How is the weather today?"
]

tokenized_corpus = [doc.split(" ") for doc in corpus]

bm25 = BM25Okapi(tokenized_corpus)

query = "windy London"
tokenized_query = query.split(" ")

doc_scores = bm25.get_scores(tokenized_query)

print(doc_scores)

[0. 0.93729472 0. ]

But

from rank_bm25 import BM25Okapi

corpus = [
    "Hello there good man!",
    "It is quite windy in London",
    # "How is the weather today?"
]

tokenized_corpus = [doc.split(" ") for doc in corpus]

bm25 = BM25Okapi(tokenized_corpus)

query = "windy London"
tokenized_query = query.split(" ")

doc_scores = bm25.get_scores(tokenized_query)

print(doc_scores)

[0. 0.]

The difference lies in the number of corpus elements. It should be incorrect, but I don't know why？

rcy1122 · 2024-06-11T02:53:06Z

    def _calc_idf(self, nd):
        """
        Calculates frequencies of terms in documents and in corpus.
        This algorithm sets a floor on the idf values to eps * average_idf
        """
        # collect idf sum to calculate an average idf for epsilon value
        idf_sum = 0
        # collect words with negative idf to set them a special epsilon value.
        # idf can be negative if word is contained in more than half of documents
        negative_idfs = []
        for word, freq in nd.items():
            --> idf = math.log(self.corpus_size - freq + 0.5) - math.log(freq + 0.5)
            self.idf[word] = idf
            idf_sum += idf
            if idf < 0:
                negative_idfs.append(word)
        self.average_idf = idf_sum / len(self.idf)

        eps = self.epsilon * self.average_idf
        for word in negative_idfs:
            self.idf[word] = eps

Here
idf = math.log(self.corpus_size - freq + 0.5) - math.log(freq + 0.5)

If self.corpus_size = 2, freq=1, then idf will be equal to 0.

When using get_scores to calculate scores:

    def get_scores(self, query):
        """
        The ATIRE BM25 variant uses an idf function which uses a log(idf) score. To prevent negative idf scores,
        this algorithm also adds a floor to the idf value of epsilon.
        See [Trotman, A., X. Jia, M. Crane, Towards an Efficient and Effective Search Engine] for more info
        :param query:
        :return:
        """
        score = np.zeros(self.corpus_size)
        doc_len = np.array(self.doc_len)
        for q in query:
            q_freq = np.array([(doc.get(q) or 0) for doc in self.doc_freqs])
            print("self.idf.get(q) or 0 = ", self.idf.get(q) or 0)

           --> score += (self.idf.get(q) or 0) * (q_freq * (self.k1 + 1) /
                                               (q_freq + self.k1 * (1 - self.b + self.b * doc_len / self.avgdl)))
        return score

==> When calculating idf (self.idf.get(q) or 0) all will be equal to 0.

Is this correct?

jameswu1991 · 2024-10-31T20:07:29Z

it seems like the trigger condition is when freq * 2 = self.corpus_size. we're missing the +1 term in calculating idf, which would also prevent negative idf values

there seems to be an attempt to fix this missing +1 term here: #40
but the proposed code change mysteriously removes the freq term from calculation

shouldn't

- idf = math.log(self.corpus_size - freq + 0.5) - math.log(freq + 0.5)
+ idf = math.log((self.corpus_size - freq + 0.5) / freq + 0.5) + 1)

be the right expression for calcuating idf?

rcy1122 · 2024-11-04T04:00:26Z

@jameswu1991 Hi, Do you want to edit this issue? #40 has not been merged yet.
I don't understand bm25 very well, but is it possible to calculate it using :
idf = math.log((self.corpus_size + 0.5) / (freq + 0.5)) ?

Thanks.

xhluca · 2025-02-13T21:09:13Z

@rcy1122 Kamphuis et al shows multiple ways to compute BM25 (the idf term is on the left side)

It appears the idf term in rank-bm25 follows the idf of robertson (the original bm25), which can be undefined when the term becomes 0 (for example, if the N - df + 0.5 == 0). This is why other methods came up with smoothing, e.g. by adding a +1 (in the case of lucene, as mentioned by @jameswu1991) or by changing the idf to log((N+1)/df + 0.5), which is what BM25L does, and what #40 is trying to achieve.

As for the original robertson, the edge case undefined doesn't have a standard way to be handled, so it becomes the responsibility of the implementation to handle it. In bm25s, I decided to round up the inner ratio (N - df + 0.5) / (df + 0.5) to 1 when it is between 0 and 1, thus forcing the idf to be non-negative, since a negative idf would not make sense (it would mean the term has a negative relevance simply because it appears in too many documents).

dorianbrown added the bug Something isn't working label Oct 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Same corpus, the other score is 0 #43

Same corpus, the other score is 0 #43

rcy1122 commented Jun 11, 2024 •

edited

Loading

rcy1122 commented Jun 11, 2024

jameswu1991 commented Oct 31, 2024 •

edited

Loading

rcy1122 commented Nov 4, 2024

xhluca commented Feb 13, 2025 •

edited

Loading

Same corpus, the other score is 0 #43

Same corpus, the other score is 0 #43

Comments

rcy1122 commented Jun 11, 2024 • edited Loading

rcy1122 commented Jun 11, 2024

jameswu1991 commented Oct 31, 2024 • edited Loading

rcy1122 commented Nov 4, 2024

xhluca commented Feb 13, 2025 • edited Loading

rcy1122 commented Jun 11, 2024 •

edited

Loading

jameswu1991 commented Oct 31, 2024 •

edited

Loading

xhluca commented Feb 13, 2025 •

edited

Loading