Skip to content

Commit

Permalink
Merge pull request #90 from Snowflake-Labs/release-12-11-2024
Browse files Browse the repository at this point in the history
Release 12-11-2024
  • Loading branch information
sfc-gh-bklein authored Dec 11, 2024
2 parents 25c9c1f + 1d345bf commit 689fe2b
Show file tree
Hide file tree
Showing 4 changed files with 52 additions and 11 deletions.
10 changes: 10 additions & 0 deletions framework-evalanche/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,6 +26,13 @@ Please see TAGGING.md for details on object comments.
# Overview
Evalanche is a Streamlit in Snowflake (SiS) application that provides a single location to evaluate and compare generative AI use case outputs in a streamlined, on demand, and automated fashion. Regardless if your goal is to measure the quality of RAG-based LLM solutions or accuracy of SQL generation, Evalanche provides a scalable, customizable, and trackable way to do it.

> **Note:** Snowflake provides a few tools/frameworks for conducting LLM evaluations.
This solution, Evalanche, serves as a generalized application to make it easy to create and automate LLM use case evaluations.
Most LLM use cases can be evaluated with Evalanche through out of the box metrics or custom metrics.
Alternatively, Snowflake's AI Observability (Private Preview) is powered by open source [TruLens](https://www.trulens.org/), which provides extensible evaluations and tracing for LLM apps including RAGs and LLM agents.
Lastly, [Cortex Search Evaluation and Tuning Studio](https://github.com/Snowflake-Labs/cortex-search/tree/main/examples/streamlit-evaluation) (Private Preview) offers systematic evaluation and search quality improvements for specific search-based use cases.
Please contact your account representative to learn more about any of these other offerings.

# How it Works
Evalanche's primary structure is based on 2 components: 1) Metrics and 2) Data Sources. Together, Metrics and Data Sources can be combined to make an Evaluation.

Expand Down Expand Up @@ -138,5 +145,8 @@ $$;
### Using the Cortex Analyst Runner
To run a gold or reference set of questions through Cortex Analyst, select the target semantic model and the table containing the reference questions. The SQL results will be written to a table for further evaluation with the Cortex Analyst-suggested metric.

## Model Options and Selection
Out of the box Metrics have defaulted LLMs. These defaults are selected based on balancing availability, performance, and cost. However, depending on your region, the defaulted LLM may not be available. If that is the case, please select an alternative LLM. See LLM availability [here](https://docs.snowflake.com/en/user-guide/snowflake-cortex/llm-functions?utm_cta=website-homepage-live-demo#availability).

# Feedback
Please add issues to GitHub or email Jason Summer (jason.summer@snowflake.com).
29 changes: 19 additions & 10 deletions framework-evalanche/pages/results.py
Original file line number Diff line number Diff line change
Expand Up @@ -448,18 +448,27 @@ def review_record() -> None:
on_change=set_score,
args=(selected_record,)
)
if selected_metric_name is not None:
matching_metric = next(
(
metric
for metric in st.session_state["metrics"]
if metric.get_column() == selected_metric_name.upper()
),
None,
)
with model_col:
select_model('review', default = "llama3.2-3b")
# Use the default model of the metric class if available
if matching_metric is not None:
if matching_metric.model is not None:
model_default = matching_metric.model
else:
model_default = "llama3.2-3b"
else:
model_default = "llama3.2-3b"
select_model('review', default = model_default)

if selected_metric_name is not None:
matching_metric = next(
(
metric
for metric in st.session_state["metrics"]
if metric.get_column() == selected_metric_name.upper()
),
None,
)
if matching_metric is not None:
# Re-add session attribute to metric object
matching_metric.session = st.session_state["session"]

Expand Down
2 changes: 1 addition & 1 deletion framework-evalanche/src/metrics.py
Original file line number Diff line number Diff line change
Expand Up @@ -60,7 +60,7 @@ def get_column(self):
class SQLResultsAccuracy(Metric):
def __init__(
self,
model: str = "reka-flash"
model: str = "mistral-large2"
):
super().__init__(
name="SQL Results Accuracy",
Expand Down
22 changes: 22 additions & 0 deletions framework-evalanche/src/prompts.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,28 @@
[The End of the Ground Truth Data]
"""

## Benchmark prompt used in internal Cortex Analyst benchmarking - under team review
# SQLAccuracy_prompt = """\
# [INST] Your task is to determine whether the two given JSON datasets are
# equivalent semantically in the context of a question. You should attempt to
# answer the given question by using the data in each JSON dataset. If the two
# answers are equivalent, those two JSON datasets are considered equivalent.
# Otherwise, they are not equivalent.
# If they are equivalent, output "ANSWER: true". If they are
# not equivalent, output "ANSWER: false".

# ### QUESTION: {question}

# * JSON DATASET 1:
# {inference_data}

# * DATAFRAME 2:
# {expected_data}

# Are the two dataframes equivalent?
# OUTPUT:
# [/INST] """

Correctness_prompt = """Please act as an impartial judge and evaluate the quality of the response provided by the AI Assistant to the user question displayed below.
Your evaluation should consider CORRECTNESS. You will be given a reference answer and the AI Assistant's answer.
Your job is to rate the assistant's answer from 1 to 5, where 5 indicates you strongly agree that the response is
Expand Down

0 comments on commit 689fe2b

Please sign in to comment.