This document explains various evaluation metrics used to assess the performance of retrieval systems, especially in the context of answering financial questions based on a 10-K question-answer dataset.
This material is based on data from DataTalksClub LLM Zoomcamp.
In a Retrieval-Augmented Generation (RAG) system, a retriever finds relevant documents from a collection (like 10-K filings), and a language model generates answers using this information. Evaluating the retriever is crucial because the quality of the retrieved documents directly affects the accuracy of the generated answers.
- Precision at k (P@k)
- Recall
- Mean Average Precision (MAP)
- Normalized Discounted Cumulative Gain (NDCG)
- Mean Reciprocal Rank (MRR)
- F1 Score
- Area Under the ROC Curve (AUC-ROC)
- Mean Rank (MR)
- Hit Rate (HR) or Recall at k
- Expected Reciprocal Rank (ERR)
What it measures: Out of the top k results returned by the retriever, how many are actually relevant?
Formula:
Example:
- Scenario: You're retrieving documents to answer "What are the company's main revenue sources?"
- Top 5 retrieved documents (k=5):
- Section on "Revenue Streams" (relevant)
- "Executive Profiles" section (not relevant)
- "Market Analysis" section (not relevant)
- "Revenue Recognition Policies" (relevant)
- "Financial Statements" (relevant)
- Calculation: 3 relevant documents out of 5.
- P@5 = 3/5 = 0.6
What it measures: Out of all the relevant documents available, how many did the retriever find?
Formula:
Example:
- Total relevant documents on "Revenue Sources": 10
- Retriever found: 6 relevant documents.
- Calculation: 6 out of 10 relevant documents.
- Recall = 6/10 = 0.6
What it measures: The average precision across multiple queries.
Formula:
Example:
- Queries and their average precisions:
- Query 1: 0.7
- Query 2: 0.8
- Query 3: 0.6
- Calculation: (0.7 + 0.8 + 0.6) / 3
- MAP = 0.7
What it measures: How well the retriever ranks the documents, giving more importance to higher-ranked positions.
Formula:
Where:
- $$ \text{rel}_i $$ is the relevance score at position $$ i $$.
- $$ \text{IDCG} $$ is the ideal DCG with perfect ranking.
Example:
- Relevance scores for top 3 documents:
- Highly relevant (score 3)
- Not relevant (score 0)
- Moderately relevant (score 2)
- Interpretation: NDCG reflects that the most relevant documents should be at the top.
What it measures: The average of the reciprocal ranks of the first relevant document for each query.
Formula:
Example:
- First relevant document positions:
- Query 1: Position 2
- Query 2: Position 1
- Query 3: Position 4
- Calculation: (1/2 + 1/1 + 1/4) / 3
- MRR ≈ 0.58
What it measures: The balance between precision and recall.
Formula:
Example:
- Precision: 0.7
- Recall: 0.5
- Calculation: $$ 2 \times \frac{0.7 \times 0.5}{0.7 + 0.5} $$
- F1 Score ≈ 0.58
What it measures: The ability of the retriever to distinguish between relevant and irrelevant documents.
Interpretation:
- AUC of 0.5: No discrimination (random chance).
- AUC of 1.0: Perfect discrimination.
Example:
- AUC Score: 0.85
- Meaning: There's an 85% chance that the retriever ranks a relevant document higher than a non-relevant one.
What it measures: The average position of the first relevant document across all queries.
Formula:
Example:
- First relevant document positions:
- Query 1: Position 3
- Query 2: Position 2
- Query 3: Position 5
- Calculation: (3 + 2 + 5) / 3
- Mean Rank ≈ 3.33
What it measures: The proportion of queries where at least one relevant document is retrieved in the top k results.
Formula:
Example:
- Total queries: 20
- Queries with relevant document in top 5: 15
- Calculation: 15 / 20
- HR@5 = 0.75
What it measures: The probability that a user finds a relevant document at each position, assuming they might stop searching after finding it.
Formula:
- $$ r_i $$ is the relevance probability at position $$ i $$.
Example:
- Relevance probabilities:
- Position 1: 0.2
- Position 2: 0.5
- Position 3: 0.8
- Interpretation: Higher ERR means users are likely to find relevant information sooner.
Understanding these metrics helps evaluate the retriever's performance in a RAG system. By optimizing these metrics, especially when answering financial questions using 10-K filings, we can improve the accuracy and reliability of the system.