Merge pull request #90 from Snowflake-Labs/release-12-11-2024

Release 12-11-2024
Snowflake-Labs · Dec 11, 2024 · 689fe2b · 689fe2b
2 parents 25c9c1f + 1d345bf
commit 689fe2b
Show file tree

Hide file tree

Showing 4 changed files with 52 additions and 11 deletions.
diff --git a/framework-evalanche/README.md b/framework-evalanche/README.md
@@ -26,6 +26,13 @@ Please see TAGGING.md for details on object comments.
 # Overview
 Evalanche is a Streamlit in Snowflake (SiS) application that provides a single location to evaluate and compare generative AI use case outputs in a streamlined, on demand, and automated fashion. Regardless if your goal is to measure the quality of RAG-based LLM solutions or accuracy of SQL generation, Evalanche provides a scalable, customizable, and trackable way to do it.
 
+> **Note:** Snowflake provides a few tools/frameworks for conducting LLM evaluations. 
+This solution, Evalanche, serves as a generalized application to make it easy to create and automate LLM use case evaluations. 
+Most LLM use cases can be evaluated with Evalanche through out of the box metrics or custom metrics. 
+Alternatively, Snowflake's AI Observability (Private Preview) is powered by open source [TruLens](https://www.trulens.org/), which provides extensible evaluations and tracing for LLM apps including RAGs and LLM agents. 
+Lastly, [Cortex Search Evaluation and Tuning Studio](https://github.com/Snowflake-Labs/cortex-search/tree/main/examples/streamlit-evaluation) (Private Preview) offers systematic evaluation and search quality improvements for specific search-based use cases.
+Please contact your account representative to learn more about any of these other offerings. 
+
 # How it Works
 Evalanche's primary structure is based on 2 components: 1) Metrics and 2) Data Sources. Together, Metrics and Data Sources can be combined to make an Evaluation.
 
@@ -138,5 +145,8 @@ $$;
 ### Using the Cortex Analyst Runner
 To run a gold or reference set of questions through Cortex Analyst, select the target semantic model and the table containing the reference questions. The SQL results will be written to a table for further evaluation with the Cortex Analyst-suggested metric. 
 
+## Model Options and Selection
+Out of the box Metrics have defaulted LLMs. These defaults are selected based on balancing availability, performance, and cost. However, depending on your region, the defaulted LLM may not be available. If that is the case, please select an alternative LLM. See LLM availability [here](https://docs.snowflake.com/en/user-guide/snowflake-cortex/llm-functions?utm_cta=website-homepage-live-demo#availability).
+
 # Feedback
 Please add issues to GitHub or email Jason Summer (jason.summer@snowflake.com).
diff --git a/framework-evalanche/pages/results.py b/framework-evalanche/pages/results.py
@@ -448,18 +448,27 @@ def review_record() -> None:
                     on_change=set_score,
                     args=(selected_record,)
                 )
+            if selected_metric_name is not None:     
+                matching_metric = next(
+                    (
+                        metric
+                        for metric in st.session_state["metrics"]
+                        if metric.get_column() == selected_metric_name.upper()
+                    ),
+                    None,
+                )
         with model_col:
-            select_model('review', default = "llama3.2-3b")
+            # Use the default model of the metric class if available
+            if matching_metric is not None:
+                if matching_metric.model is not None:
+                    model_default = matching_metric.model
+                else:
+                    model_default = "llama3.2-3b"
+            else:
+                model_default = "llama3.2-3b"
+            select_model('review', default = model_default)
 
-        if selected_metric_name is not None:     
-            matching_metric = next(
-                (
-                    metric
-                    for metric in st.session_state["metrics"]
-                    if metric.get_column() == selected_metric_name.upper()
-                ),
-                None,
-            )
+        if matching_metric is not None:     
             # Re-add session attribute to metric object
             matching_metric.session = st.session_state["session"]
 

diff --git a/framework-evalanche/src/metrics.py b/framework-evalanche/src/metrics.py
@@ -60,7 +60,7 @@ def get_column(self):
 class SQLResultsAccuracy(Metric):
     def __init__(
         self,
-        model: str = "reka-flash"
+        model: str = "mistral-large2"
     ):
         super().__init__(
             name="SQL Results Accuracy",

diff --git a/framework-evalanche/src/prompts.py b/framework-evalanche/src/prompts.py
@@ -18,6 +18,28 @@
 [The End of the Ground Truth Data]
 """
 
+## Benchmark prompt used in internal Cortex Analyst benchmarking - under team review
+# SQLAccuracy_prompt = """\
+# [INST] Your task is to determine whether the two given JSON datasets are
+# equivalent semantically in the context of a question. You should attempt to
+# answer the given question by using the data in each JSON dataset. If the two
+# answers are equivalent, those two JSON datasets are considered equivalent.
+# Otherwise, they are not equivalent.
+# If they are equivalent, output "ANSWER: true". If they are
+# not equivalent, output "ANSWER: false".
+
+# ### QUESTION: {question}
+
+# * JSON DATASET 1:
+# {inference_data}
+
+# * DATAFRAME 2:
+# {expected_data}
+
+# Are the two dataframes equivalent?
+# OUTPUT:
+# [/INST] """
+
 Correctness_prompt = """Please act as an impartial judge and evaluate the quality of the response provided by the AI Assistant to the user question displayed below.
 Your evaluation should consider CORRECTNESS. You will be given a reference answer and the AI Assistant's answer.
 Your job is to rate the assistant's answer from 1 to 5, where 5 indicates you strongly agree that the response is