Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Same corpus, the other score is 0 #43

Open
rcy1122 opened this issue Jun 11, 2024 · 4 comments
Open

Same corpus, the other score is 0 #43

rcy1122 opened this issue Jun 11, 2024 · 4 comments
Labels
bug Something isn't working

Comments

@rcy1122
Copy link

rcy1122 commented Jun 11, 2024

from rank_bm25 import BM25Okapi

corpus = [
    "Hello there good man!",
    "It is quite windy in London"
    "How is the weather today?"
]

tokenized_corpus = [doc.split(" ") for doc in corpus]

bm25 = BM25Okapi(tokenized_corpus)

query = "windy London"
tokenized_query = query.split(" ")

doc_scores = bm25.get_scores(tokenized_query)

print(doc_scores)

[0. 0.93729472 0. ]

But

from rank_bm25 import BM25Okapi

corpus = [
    "Hello there good man!",
    "It is quite windy in London",
    # "How is the weather today?"
]

tokenized_corpus = [doc.split(" ") for doc in corpus]

bm25 = BM25Okapi(tokenized_corpus)

query = "windy London"
tokenized_query = query.split(" ")

doc_scores = bm25.get_scores(tokenized_query)

print(doc_scores)

[0. 0.]

The difference lies in the number of corpus elements. It should be incorrect, but I don't know why?

@rcy1122
Copy link
Author

rcy1122 commented Jun 11, 2024

    def _calc_idf(self, nd):
        """
        Calculates frequencies of terms in documents and in corpus.
        This algorithm sets a floor on the idf values to eps * average_idf
        """
        # collect idf sum to calculate an average idf for epsilon value
        idf_sum = 0
        # collect words with negative idf to set them a special epsilon value.
        # idf can be negative if word is contained in more than half of documents
        negative_idfs = []
        for word, freq in nd.items():
            --> idf = math.log(self.corpus_size - freq + 0.5) - math.log(freq + 0.5)
            self.idf[word] = idf
            idf_sum += idf
            if idf < 0:
                negative_idfs.append(word)
        self.average_idf = idf_sum / len(self.idf)

        eps = self.epsilon * self.average_idf
        for word in negative_idfs:
            self.idf[word] = eps
Here
idf = math.log(self.corpus_size - freq + 0.5) - math.log(freq + 0.5)

If self.corpus_size = 2, freq=1, then idf will be equal to 0.

When using get_scores to calculate scores:

    def get_scores(self, query):
        """
        The ATIRE BM25 variant uses an idf function which uses a log(idf) score. To prevent negative idf scores,
        this algorithm also adds a floor to the idf value of epsilon.
        See [Trotman, A., X. Jia, M. Crane, Towards an Efficient and Effective Search Engine] for more info
        :param query:
        :return:
        """
        score = np.zeros(self.corpus_size)
        doc_len = np.array(self.doc_len)
        for q in query:
            q_freq = np.array([(doc.get(q) or 0) for doc in self.doc_freqs])
            print("self.idf.get(q) or 0 = ", self.idf.get(q) or 0)

           --> score += (self.idf.get(q) or 0) * (q_freq * (self.k1 + 1) /
                                               (q_freq + self.k1 * (1 - self.b + self.b * doc_len / self.avgdl)))
        return score

==> When calculating idf (self.idf.get(q) or 0) all will be equal to 0.

Is this correct?

@dorianbrown dorianbrown added the bug Something isn't working label Oct 8, 2024
@jameswu1991
Copy link

jameswu1991 commented Oct 31, 2024

it seems like the trigger condition is when freq * 2 = self.corpus_size. we're missing the +1 term in calculating idf, which would also prevent negative idf values

there seems to be an attempt to fix this missing +1 term here: #40
but the proposed code change mysteriously removes the freq term from calculation

shouldn't

- idf = math.log(self.corpus_size - freq + 0.5) - math.log(freq + 0.5)
+ idf = math.log((self.corpus_size - freq + 0.5) / freq + 0.5) + 1)

be the right expression for calcuating idf?

@rcy1122
Copy link
Author

rcy1122 commented Nov 4, 2024

@jameswu1991 Hi, Do you want to edit this issue? #40 has not been merged yet.
I don't understand bm25 very well, but is it possible to calculate it using :
idf = math.log((self.corpus_size + 0.5) / (freq + 0.5)) ?

Thanks.

@xhluca
Copy link

xhluca commented Feb 13, 2025

@rcy1122 Kamphuis et al shows multiple ways to compute BM25 (the idf term is on the left side)

Image

It appears the idf term in rank-bm25 follows the idf of robertson (the original bm25), which can be undefined when the term becomes 0 (for example, if the N - df + 0.5 == 0). This is why other methods came up with smoothing, e.g. by adding a +1 (in the case of lucene, as mentioned by @jameswu1991) or by changing the idf to log((N+1)/df + 0.5), which is what BM25L does, and what #40 is trying to achieve.

As for the original robertson, the edge case undefined doesn't have a standard way to be handled, so it becomes the responsibility of the implementation to handle it. In bm25s, I decided to round up the inner ratio (N - df + 0.5) / (df + 0.5) to 1 when it is between 0 and 1, thus forcing the idf to be non-negative, since a negative idf would not make sense (it would mean the term has a negative relevance simply because it appears in too many documents).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants