Skip to content

Commit a812289

Browse files
authored
RAG example using financebench (#738)
* Upload dataset script * Uses latest kolena package * Added upload results and allow uploading dataset without GT * Compute metrics in upload_results script * remove relative dependency * Added citation + updated dependency list * Updated README.md with more information + updated copyright year * Pinning pandera version on problematic integration test * Pinned multimethod on workflow/speaker_diarization example
1 parent b4ba296 commit a812289

File tree

9 files changed

+386
-0
lines changed

9 files changed

+386
-0
lines changed
Lines changed: 110 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,110 @@
1+
# Example Integration: Retrieval Augmented Generation (RAG)
2+
3+
This example integration uses the [Financebench](https://github.com/patronus-ai/financebench) dataset to
4+
demonstrate testing RAG system on Kolena.
5+
6+
## Setup
7+
8+
This project uses [uv](https://docs.astral.sh/uv/) for packaging and Python dependency management. To get started,
9+
install project dependencies from [`pyproject.toml`](./pyproject.toml) by running:
10+
11+
```shell
12+
uv sync
13+
```
14+
15+
## Usage
16+
17+
The data for this example integration lives in the publicly accessible S3 bucket `s3://kolena-public-examples`.
18+
19+
First, ensure that the `KOLENA_TOKEN` environment variable is populated in your environment. See our
20+
[initialization documentation](https://docs.kolena.com/installing-kolena/#initialization) for details.
21+
22+
This project defines three scripts that perform the following operations:
23+
24+
1. [`upload_dataset.py`](retrieval_augmented_generation/upload_dataset.py) creates the Financebench dataset on Kolena
25+
26+
To run it without ground truth, use `s3://kolena-public-examples/financebench/raw/financebench_without_gt.jsonl`
27+
dataset jsonl file instead:
28+
29+
```shell
30+
uv run python3 retrieval_augmented_generation/upload_dataset.py --dataset-jsonl s3://kolena-public-examples/financebench/raw/financebench_without_gt.jsonl
31+
```
32+
33+
2. [`upload_results.py`](retrieval_augmented_generation/upload_results.py) uploads a RAG system's raw inference
34+
on the Financebench dataset.
35+
36+
There are three example RAG systems (`baseline`, `qme`, and `query_decomp`) from which we have collected inferences.
37+
The inferences are stored in jsonl format and uploaded to the s3 bucket.
38+
[Here](https://kolena-public-examples.s3.us-west-2.amazonaws.com/financebench/results/raw/gpt-4o-baseline.jsonl) is a
39+
link to download `baseline` system's inference jsonl file as an example.
40+
An inference to a question is formatted in the following JSON:
41+
```
42+
{
43+
"retrieved_contents":[
44+
{
45+
"content":"...",
46+
"doc_name":"3M_2017_10K",
47+
"page_number":48
48+
},
49+
{
50+
"content":"...",
51+
"doc_name":"3M_2018_10K",
52+
"page_number":47
53+
}
54+
],
55+
"answer":"Answer from RAG goes here",
56+
"query_time":8.1,
57+
"financebench_id":"financebench_id_03029"
58+
}
59+
```
60+
61+
The `upload_results.py` script defines command line arguments to select which model to evaluate — run
62+
using the `--help` flag for more information:
63+
64+
```shell
65+
$ uv run python3 retrieval_augmented_generation/upload_results.py --help
66+
usage: upload_results.py [-h] [--dataset-name DATASET_NAME] [--evaluate] [{baseline,qme,query_decomp}]
67+
68+
positional arguments:
69+
{baseline,qme,query_decomp}
70+
Name of the model to test.
71+
72+
optional arguments:
73+
-h, --help show this help message and exit
74+
--dataset-name DATASET_NAME
75+
Optionally specify a custom dataset name to test.
76+
--evaluate Computes metrics on the model results. Requires dataset with ground truth.
77+
```
78+
79+
3. Label your dataset on [Kolena]((https://app.kolena.com/redirect/))
80+
81+
4. Run evaluation by using `--evaluate` option from the `upload_results.py` script. It will compute metrics on the
82+
model results and upload the model results including the metrics to Kolena.
83+
84+
## Quality Standards Guide
85+
86+
Once the dataset and results have been uploaded to Kolena, visit [Kolena](https://app.kolena.com/redirect/) to
87+
test the RAG systems. See our [QuickStart](https://docs.kolena.com/dataset/quickstart/) guide
88+
for details.
89+
90+
Here are our [Quality Standards](https://docs.kolena.com/dataset/core-concepts/quality-standard/) recommendations for
91+
this workflow:
92+
93+
### Metrics
94+
95+
1. rate(`result.is_page_retrieved`=true): page-level retrieval rate
96+
2. rate(`result.is_doc_retrieved`=true): doc-level retrieval rate
97+
3. `is_correct` using [LLM prompt](https://docs.kolena.com/dataset/advanced-usage/llm-prompt-extraction/)
98+
99+
## Citation
100+
101+
```
102+
@misc{islam2023financebench,
103+
title={FinanceBench: A New Benchmark for Financial Question Answering},
104+
author={Pranab Islam and Anand Kannappan and Douwe Kiela and Rebecca Qian and Nino Scherrer and Bertie Vidgen},
105+
year={2023},
106+
eprint={2311.11944},
107+
archivePrefix={arXiv},
108+
primaryClass={cs.CL}
109+
}
110+
```
Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
[project]
2+
name = "retrieval_augmented_generation"
3+
version = "0.1.0"
4+
description = "Kolena Datasets Example integration for RAG"
5+
authors = [
6+
{ name = "Kolena Engineering", email = "eng@kolena.com" }
7+
]
8+
license = "Apache-2.0"
9+
requires-python = ">=3.8,<3.12"
10+
11+
dependencies = [
12+
"kolena>=1.50.0,<2",
13+
"s3fs>=2024.10.0",
14+
]
15+
16+
[tool.uv]
17+
dev-dependencies = [
18+
"pre-commit>=2.17,<3",
19+
"pytest>=7,<8",
20+
"pytest-depends>=1.0.1,<8",
21+
]
22+
23+
[build-system]
24+
requires = ["hatchling"]
25+
build-backend = "hatchling.build"
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
# Copyright 2021-2025 Kolena Inc.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
# Copyright 2021-2025 Kolena Inc.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
S3_BUCKET = "s3://kolena-public-examples"
15+
DATASET = "financebench"
16+
TASK = "retrieval-augmented_generation"
17+
ID_FIELDS = ["financebench_id"]
18+
MODEL_NAME = {
19+
"baseline": "gpt-4o-baseline",
20+
"qme": "gpt-4o-qme",
21+
"query_decomp": "gpt-4o-qme-query-decomp",
22+
}
Lines changed: 46 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,46 @@
1+
# Copyright 2021-2025 Kolena Inc.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
import pandas as pd
15+
from retrieval_augmented_generation.constants import ID_FIELDS
16+
17+
18+
def is_doc_retrieved(retrieved_contents: list, doc_name: str) -> bool:
19+
return any([doc_name in content.locator for content in retrieved_contents])
20+
21+
22+
def is_page_retrieved(retrieved_contents: list, doc_name: str, relevant_pages: list) -> bool:
23+
retrieved_pages = [content.page_number for content in retrieved_contents if doc_name in content.locator]
24+
25+
# NOTE: all relevant pages must be retrieved to be considered correct.
26+
return set(relevant_pages).issubset(retrieved_pages)
27+
28+
29+
def compute_metrics(df_dataset: pd.DataFrame, df_results: pd.DataFrame) -> pd.DataFrame:
30+
ground_truth_columns = ["doc_name", "relevant_pages", "financebench_id"]
31+
assert set(ground_truth_columns).issubset(
32+
df_dataset.columns,
33+
), f"ground truth columns {ground_truth_columns} cannot be found in dataset dataframe."
34+
35+
df = df_results.merge(df_dataset[ground_truth_columns], on=ID_FIELDS, how="left")
36+
37+
metrics = []
38+
for record in df.itertuples():
39+
metrics.append(
40+
dict(
41+
is_doc_retrieved=is_doc_retrieved(record.retrieved_contents, record.doc_name),
42+
is_page_retrieved=is_page_retrieved(record.retrieved_contents, record.doc_name, record.relevant_pages),
43+
),
44+
)
45+
46+
return pd.DataFrame(metrics)
Lines changed: 67 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,67 @@
1+
# Copyright 2021-2025 Kolena Inc.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
from argparse import ArgumentParser
15+
from argparse import Namespace
16+
from typing import Any
17+
from typing import Optional
18+
19+
import pandas as pd
20+
from retrieval_augmented_generation.constants import DATASET
21+
from retrieval_augmented_generation.constants import ID_FIELDS
22+
from retrieval_augmented_generation.constants import S3_BUCKET
23+
from retrieval_augmented_generation.utils import to_locator
24+
25+
from kolena.asset import DocumentAsset
26+
from kolena.dataset import upload_dataset
27+
28+
29+
def to_document(evidence: list[dict[str, Any]]) -> Optional[DocumentAsset]:
30+
if len(evidence) > 0:
31+
return DocumentAsset(to_locator(str(evidence[0]["doc_name"])))
32+
33+
return None
34+
35+
36+
def get_pages(evidence: list[dict[str, Any]]) -> list[int]:
37+
pages = [e["evidence_page_num"] + 1 for e in evidence] # financebench page number starts from 0 index
38+
return pages
39+
40+
41+
def run(args: Namespace) -> None:
42+
df_dataset = pd.read_json(args.dataset_jsonl, lines=True)
43+
if "evidence" in df_dataset.columns:
44+
df_dataset["document"] = df_dataset["evidence"].apply(to_document)
45+
df_dataset["relevant_pages"] = df_dataset["evidence"].apply(get_pages)
46+
upload_dataset(args.dataset_name, df_dataset, id_fields=ID_FIELDS)
47+
48+
49+
def main() -> None:
50+
ap = ArgumentParser()
51+
ap.add_argument(
52+
"--dataset-jsonl",
53+
type=str,
54+
default=f"{S3_BUCKET}/{DATASET}/raw/financebench_open_source.jsonl",
55+
help="JSONL file specifying dataset. See default JSONL for details",
56+
)
57+
ap.add_argument(
58+
"--dataset-name",
59+
type=str,
60+
default=DATASET,
61+
help="Optionally specify a name of the dataset",
62+
)
63+
run(ap.parse_args())
64+
65+
66+
if __name__ == "__main__":
67+
main()
Lines changed: 83 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,83 @@
1+
# Copyright 2021-2025 Kolena Inc.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
from argparse import ArgumentParser
15+
from argparse import Namespace
16+
from typing import Any
17+
18+
import pandas as pd
19+
from retrieval_augmented_generation.constants import DATASET
20+
from retrieval_augmented_generation.constants import MODEL_NAME
21+
from retrieval_augmented_generation.constants import S3_BUCKET
22+
from retrieval_augmented_generation.metrics import compute_metrics
23+
from retrieval_augmented_generation.utils import to_locator
24+
25+
from kolena.asset import DocumentAsset
26+
from kolena.dataset import download_dataset
27+
from kolena.dataset import upload_results
28+
29+
30+
def to_documents(retrieved_contents: list[dict[str, Any]]) -> list:
31+
if not retrieved_contents:
32+
return []
33+
34+
documents = []
35+
for content in retrieved_contents:
36+
documents.append(
37+
DocumentAsset(
38+
locator=to_locator(content["doc_name"]),
39+
content=content["content"], # type: ignore[call-arg]
40+
page_number=content["page_number"], # type: ignore[call-arg]
41+
),
42+
)
43+
44+
return documents
45+
46+
47+
def run(args: Namespace) -> None:
48+
model_name = MODEL_NAME[args.model]
49+
df_results = pd.read_json(f"{S3_BUCKET}/{DATASET}/results/raw/{model_name}.jsonl", lines=True)
50+
df_results["retrieved_contents"] = df_results["retrieved_contents"].apply(to_documents)
51+
if args.evaluate:
52+
df_dataset = download_dataset(args.dataset_name)
53+
df_metrics = compute_metrics(df_dataset, df_results)
54+
df_results = pd.concat([df_results, df_metrics], axis=1)
55+
upload_results(args.dataset_name, model_name, df_results)
56+
57+
58+
def main() -> None:
59+
ap = ArgumentParser()
60+
ap.add_argument(
61+
"model",
62+
type=str,
63+
default="baseline",
64+
nargs="?",
65+
choices=list(MODEL_NAME.keys()),
66+
help="Name of the model to test.",
67+
)
68+
ap.add_argument(
69+
"--dataset-name",
70+
type=str,
71+
default=DATASET,
72+
help="Optionally specify a custom dataset name to test.",
73+
)
74+
ap.add_argument(
75+
"--evaluate",
76+
action="store_true",
77+
help="Computes metrics on the model results. Requires dataset with ground truth.",
78+
)
79+
run(ap.parse_args())
80+
81+
82+
if __name__ == "__main__":
83+
main()
Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,19 @@
1+
# Copyright 2021-2025 Kolena Inc.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
from retrieval_augmented_generation.constants import DATASET
15+
from retrieval_augmented_generation.constants import S3_BUCKET
16+
17+
18+
def to_locator(filename: str) -> str:
19+
return f"{S3_BUCKET}/{DATASET}/data/{filename}.pdf"

0 commit comments

Comments
 (0)