Why need check corpus_id != query_id in DenseRetrievalExactSearch.search() #169

mengyao00 · 2024-04-11T23:39:41Z

Why do we need this line to check corpus_id != query_id

for a query with id_q, the corpus with the same id id_q does not mean it is the positive corpus for it. So why do we need to avoid corpus_id == query_id

            for query_itr in range(len(query_embeddings)):
                query_id = query_ids[query_itr]                  
                for sub_corpus_id, score in zip(cos_scores_top_k_idx[query_itr], cos_scores_top_k_values[query_itr]):
                    corpus_id = corpus_ids[corpus_start_idx+sub_corpus_id]
                    if corpus_id != query_id:
                        if len(result_heaps[query_id]) < top_k:
                            # Push item on the heap
                            heapq.heappush(result_heaps[query_id], (score, corpus_id))
                        else:
                            # If item is larger than the smallest in the heap, push it on the heap then pop the smallest element
                            heapq.heappushpop(result_heaps[query_id], (score, corpus_id))

        for qid in result_heaps:
            for score, corpus_id in result_heaps[qid]:
                self.results[qid][corpus_id] = score
        
        return self.results

The text was updated successfully, but these errors were encountered:

thakur-nandan · 2024-04-12T18:04:57Z

Hi @mengyao00, thanks for asking the question.

We require this line for two datasets: ArguAna and Quora, where corpus_ids and query_ids are similar, i.e., the query is also present within the corpus.

The line is used to avoid the edge case of self-retrieval where the query is self-retrieved at the top-1 position, which reduces the nDCG@10 score for ArguAna and Quora.

Hope it helps!

Regards,
Nandan Thakur

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why need check corpus_id != query_id in DenseRetrievalExactSearch.search() #169

Why need check corpus_id != query_id in DenseRetrievalExactSearch.search() #169

mengyao00 commented Apr 11, 2024

thakur-nandan commented Apr 12, 2024

Why need check corpus_id != query_id in DenseRetrievalExactSearch.search() #169

Why need check corpus_id != query_id in DenseRetrievalExactSearch.search() #169

Comments

mengyao00 commented Apr 11, 2024

thakur-nandan commented Apr 12, 2024