Merge branch 'main' into isaac/threadhowtoupdate

langchain-ai · Feb 5, 2025 · 8c7715c · 8c7715c
2 parents 200a75c + 1d1082f
commit 8c7715c
Show file tree

Hide file tree

Showing 55 changed files with 1,022 additions and 300 deletions.
diff --git a/docs/evaluation/concepts/index.mdx b/docs/evaluation/concepts/index.mdx
@@ -397,3 +397,36 @@ If ground truth reference labels are provided, then it's common to simply define
 | Precision | Standard definition | Yes                    | No            | No                |
 | Recall    | Standard definition | Yes                    | No            | No                |
 
+
+## Experiment configuration
+
+LangSmith supports a number of experiment configurations which make it easier to run your evals in the manner you want.
+
+### Repetitions
+
+By passing the `num_repitions` argument to `evaluate` / `aevaluate`, you can specify how many times to repeat the experiment on your data.
+Repeating the experiment involves both rerunning the target function and rerunning the evaluators. Running an experiment multiple times can
+be helpful since the LLM outputs are not deterministic and can differ from one repetition to the next. By running multiple repetitions, you can
+get a more accurate estimate of the performance of your system.
+
+### Concurrency
+
+By passing the `max_concurrency` argument to `evaluate` / `aevaluate`, you can specify the concurrency of your experiment. The
+`max_concurrency` argument has slightly different semantics depending on whether you are using `evaluate` or `aevaluate`.
+
+#### `evaluate`
+
+The `max_concurrency` argument to `evaluate` specifies the maximum number of concurrent threads to use when running the experiment.
+This is both for when running your target function as well as your evaluators.
+
+#### `aevaluate`
+
+The `max_concurrency` argument to `aevaluate` is fairly similar to `evaluate`, but instead uses a semaphore to limit the number of
+concurrent tasks that can run at once. `aevaluate` works by creating a task for each example in the dataset. Each task consists of running the target function
+as well as all of the evaluators on that specific example. The `max_concurrency` argument specifies the maximum number of concurrent tasks, or put another way - examples,
+to run at once.
+
+### Caching
+
+Lastly, you can also cache the API calls made in your experiment by setting the `LANGSMITH_CACHE_PATH` to a valid folder on your device with write access.
+This will cause the API calls made in your experiment to be cached to disk, meaning future experiments that make the same API calls will be greatly sped up.
diff --git a/docs/evaluation/how_to_guides/analyze_single_experiment.mdx b/docs/evaluation/how_to_guides/analyze_single_experiment.mdx
@@ -0,0 +1,67 @@
+---
+sidebar_position: 1
+---
+
+# Analyze a single experiment
+After running an experiment, you can use LangSmith's experiment view to analyze the results and draw insights about how your experiment performed.
+
+This guide will walk you through viewing the results of an experiment and highlights the features available in the experiments view. 
+
+## Open the experiment view
+To open the experiment view, select the relevant Dataset from the Dataset & Experiments page and then select the experiment you want to view.
+
+![Open experiment view](./static/select_experiment.png)
+
+## View experiment results
+This table displays your experiment results. This includes the input, output, and reference output for each [example](/evaluation/concepts#examples) in the dataset.  It also shows each configured feedback key in separate columns alongside its corresponding feedback score.
+
+Out of the box metrics (latency, status, cost, and token count) will also be displayed in individual columns.
+
+In the columns dropdown, you can choose which columns to hide and which to show.
+
+![Experiment view](./static/experiment_view.png)
+
+## Heatmap view
+The experiment view defaults to a heatmap view, where feedback scores for each run are highlighted in a color.
+Red indicates a lower score, while green indicates a higher score.
+The heatmap visualization makes it easy to identify patterns, spot outliers, and understand score distributions across your dataset at a glance.
+
+![Heatmap view](./static/heatmap.png)
+
+## Sort and filter
+To sort or filter feedback scores, you can use the actions in the column headers.
+
+![Sort and filter](./static/sort_filter.png)
+
+## Table views
+Depending on the view most useful for your analysis, you can change the formatting of the table by toggling between a compact view, a full, view, and a diff view. 
+- The `Compact` view shows each run as a one-line row, for ease of comparing scores at a glance.
+- The `Full` view shows the full output for each run for digging into the details of individual runs.
+- The `Diff` view shows the text difference between the reference output and the output for each run.
+
+![Diff view](./static/diff_mode.png)
+
+## View the traces
+Hover over any of the output cells, and click on the trace icon to view the trace for that run. This will open up a trace in the side panel.
+
+To view the entire tracing project, click on the "View Project" button in the top right of the header.
+
+![View trace](./static/view_trace.png)
+
+## View evaluator runs
+For evaluator scores, you can view the source run by hovering over the evaluator score cell and clicking on the arrow icon. This will open up a trace in the side panel. If you're running a LLM-as-a-judge evaluator, you can view the prompt used for the evaluator in this run. 
+If your experiment has [repetitions](/evaluation/concepts#repetitions), you can click on the aggregate average score to find links to all of the individual runs.
+
+![View evaluator runs](./static/evaluator_run.png)
+
+## Repetitions
+If you've run your experiment with [repetitions](/evaluation/concepts#repetitions), there will be arrows in the output results column so you can view outputs in the table. To view each run from the repetition, hover over the output cell and click the expanded view.
+
+When you run an experiment with repetitions, LangSmith displays the average for each feedback score in the table. Click on the feedback score to view the feedback scores from individual runs, or to view the standard deviation across repetitions. 
+
+![Repetitions](./static/repetitions.png)
+## Compare to another experiment
+In the top right of the experiment view, you can select another experiment to compare to. This will open up a comparison view, where you can see how the two experiments compare.
+To learn more about the comparison view, see [how to compare experiment results](./compare_experiment_results).
+
+![Compare](./static/compare_to_another.png)
diff --git a/docs/evaluation/how_to_guides/compare_experiment_results.mdx b/docs/evaluation/how_to_guides/compare_experiment_results.mdx
@@ -8,21 +8,23 @@ Oftentimes, when you are iterating on your LLM application (such as changing the
 
 LangSmith supports a powerful comparison view that lets you hone in on key differences, regressions, and improvements between different experiments.
 
-![](./static/regression_test.gif)
+![](./static/compare.gif)
 
 ## Open the comparison view
 
-To open the comparison view, select two or more experiments from the "Experiments" tab from a given dataset page. Then, click on the "Compare" button at the bottom of the page.
+To open the experiment comparison view, click the **Dataset & Experiments** page, select the relevant Dataset,  select two or more experiments on the Experiments tab and click compare.
 
-![](./static/open_comparison_view.png)
+![](./static/compare_select.png)
 
-## Toggle different views
+## Adjust the table display
 
-You can toggle between different views by clicking on the "Display" dropdown at the top right of the page. You can toggle different views to be displayed.
+You can toggle between different views by clicking "Full" or "Compact" at the top of the page.
 
 Toggling Full Text will show the full text of the input, output and reference output for each run. If the reference output is too long to display in the table, you can click on expand to view the full content.
 
-![](./static/toggle_views.png)
+You can also select and hide individual feedback keys or individual metrics in the display settings dropdown to isolate the information you want to see.
+
+![](./static/toggle_views.gif)
 
 ## View regressions and improvements
 
@@ -37,50 +39,39 @@ Click on the regressions or improvements buttons on the top of each column to fi
 
 ![Regressions Filter](./static/filter_to_regressions.png)
 
-## Update baseline experiment
-
-In order to track regressions, you need a baseline experiment against which to compare. This will be automatically assigned as the first experiment in your comparison, but you can
-change it from the dropdown at the top of the page.
+## Update baseline experiment and metric
 
-![Baseline](./static/select_baseline.png)
+In order to track regressions, you need to:
+1. Select a baseline experiment against which to compare and a metric to measure. By default, the newest experiment is selected as the baseline. 
+2. Select feedback key (evaluation metric) you want to focus compare against. One will be assigned by default, but you can adjust as needed.
+3. Configure whether a higher score is better for the selected feedback key. This preference will be stored.
 
-## Select feedback key
 
-You will also want to select the feedback key (evaluation metric) on which you would like focus on. This can be selected via another dropdown at the top. Again, one will be assigned by
-default, but you can adjust as needed.
 
-![Feedback](./static/select_feedback.png)
+![Baseline](./static/select_baseline.png)
 
 ## Open a trace
 
-If tracing is enabled for the evaluation run, you can click on the trace icon in the hover state of any experiment cell to open the trace view for that run. This will open up a trace in the side panel.
+If the example you're evaluating is from an ingested [run](/observability/concepts#runs), you can hover over the output cell and click on the trace icon to open the trace view for that run. This will open up a trace in the side panel.
 
-![](./static/open_trace_comparison.png)
+![](./static/open_source_trace.png)
 
 ## Expand detailed view
 
 From any cell, you can click on the expand icon in the hover state to open up a detailed view of all experiment results on that particular example input, along with feedback keys and scores.
 
 ![](./static/expanded_view.png)
 
-## Update display settings
+## View summary charts
 
-You can adjust the display settings for comparison view by clicking on "Display" in the top right corner.
+You can also view summary charts by clicking on the "Charts" tab at the top of the page.
 
-Here, you'll be able to toggle feedback, metrics, summary charts, and expand full text.
-
-![](./static/update_display.png)
+![](./static/charts_tab.png)
 
 ## Use experiment metadata as chart labels
 
-With the summary charts enabled, you can configure the x-axis labels based on [experiment metadata](./filter_experiments_ui#background-add-metadata-to-your-experiments). First, click the three dots in the top right of the charts (note that you will only see them if your experiments have metadata attached).
-
-![](./static/three_dots_charts.png)
-
-Next, select a metadata key - note that this key must contain string values in order to render in the charts.
-
-![](./static/select_metadata_key.png)
+You can configure the x-axis labels for the charts based on [experiment metadata](./filter_experiments_ui#background-add-metadata-to-your-experiments).
 
-You will now see your metadata in the x-axis of the charts:
+Select a metadata key to see change the x-axis labels of the charts.
 
 ![](./static/metadata_in_charts.png)
diff --git a/docs/evaluation/how_to_guides/index.md b/docs/evaluation/how_to_guides/index.md
@@ -56,7 +56,8 @@ Run evals using your favorite testing tools:
 
 Evaluate and monitor your system's live performance on production data.
 
-- [Set up an online evaluator](../../observability/how_to_guides/monitoring/online_evaluations)
+- [Set up an LLM-as-judge online evaluator](../../observability/how_to_guides/monitoring/online_evaluations#configure-llm-as-judge-evaluators)
+- [Set up a custom code online evaluator](../../observability/how_to_guides/monitoring/online_evaluations#configure-custom-code-evaluators)
 - [Create a few-shot evaluator](./how_to_guides/create_few_shot_evaluators)
 
 ## Automatic evaluation
@@ -70,6 +71,7 @@ Set up evaluators that automatically run for all experiments against a dataset.
 
 Use the UI & API to understand your experiment results.
 
+- [Analyze a single experiment](./how_to_guides/analyze_single_experiment)
 - [Compare experiments with the comparison view](./how_to_guides/compare_experiment_results)
 - [Filter experiments](./how_to_guides/filter_experiments_ui)
 - [View pairwise experiments](./how_to_guides/evaluate_pairwise#view-pairwise-experiments)

diff --git a/docs/evaluation/how_to_guides/pytest.mdx b/docs/evaluation/how_to_guides/pytest.mdx
@@ -25,7 +25,7 @@ The JS/TS SDK has an analogous [Vitest/Jest integration](./vitest_jest).
 
 ## Installation
 
-This functionality requires Python SDK version `langsmith>=0.3.1`.
+This functionality requires Python SDK version `langsmith>=0.3.4`.
 
 For extra features like [rich terminal outputs](./pytest#rich-outputs) and [test caching](./pytest#caching) install:
 ```bash
@@ -296,12 +296,14 @@ LANGSMITH_TEST_CACHE=tests/cassettes ptw tests/my_llm_tests
 
 ## Rich outputs
 
-If you'd like to see a rich display of the LangSmith results of your test run you can specify `--output='ls'`:
+If you'd like to see a rich display of the LangSmith results of your test run you can specify `--langsmith-output`:
 
 ```bash
-pytest --output='ls' tests
+pytest --langsmith-output tests
 ```
 
+**Note:** This flag used to be `--output=langsmith` in `langsmith<=0.3.3` but was updated to avoid collisions with other pytest plugins.
+
 You'll get a nice table per test suite that updates live as the results are uploaded to LangSmith:
 
 ![Rich pytest outputs](./static/rich-pytest-outputs.png)

diff --git a/docs/evaluation/how_to_guides/static/charts_tab.png b/docs/evaluation/how_to_guides/static/charts_tab.png
diff --git a/docs/evaluation/how_to_guides/static/compare.gif b/docs/evaluation/how_to_guides/static/compare.gif
diff --git a/docs/evaluation/how_to_guides/static/compare_select.png b/docs/evaluation/how_to_guides/static/compare_select.png
diff --git a/docs/evaluation/how_to_guides/static/compare_to_another.png b/docs/evaluation/how_to_guides/static/compare_to_another.png
diff --git a/docs/evaluation/how_to_guides/static/diff_mode.png b/docs/evaluation/how_to_guides/static/diff_mode.png
diff --git a/docs/evaluation/how_to_guides/static/evaluator_run.png b/docs/evaluation/how_to_guides/static/evaluator_run.png
diff --git a/docs/evaluation/how_to_guides/static/expanded_view.png b/docs/evaluation/how_to_guides/static/expanded_view.png
diff --git a/docs/evaluation/how_to_guides/static/experiment_view.png b/docs/evaluation/how_to_guides/static/experiment_view.png
diff --git a/docs/evaluation/how_to_guides/static/filter_to_regressions.png b/docs/evaluation/how_to_guides/static/filter_to_regressions.png
diff --git a/docs/evaluation/how_to_guides/static/heatmap.png b/docs/evaluation/how_to_guides/static/heatmap.png
diff --git a/docs/evaluation/how_to_guides/static/metadata_in_charts.png b/docs/evaluation/how_to_guides/static/metadata_in_charts.png
diff --git a/docs/evaluation/how_to_guides/static/open_comparison_view.png b/docs/evaluation/how_to_guides/static/open_comparison_view.png
diff --git a/docs/evaluation/how_to_guides/static/open_source_trace.png b/docs/evaluation/how_to_guides/static/open_source_trace.png
diff --git a/docs/evaluation/how_to_guides/static/open_trace_comparison.png b/docs/evaluation/how_to_guides/static/open_trace_comparison.png
diff --git a/docs/evaluation/how_to_guides/static/regression_view.png b/docs/evaluation/how_to_guides/static/regression_view.png
diff --git a/docs/evaluation/how_to_guides/static/repetitions.png b/docs/evaluation/how_to_guides/static/repetitions.png
diff --git a/docs/evaluation/how_to_guides/static/select_baseline.png b/docs/evaluation/how_to_guides/static/select_baseline.png
diff --git a/docs/evaluation/how_to_guides/static/select_experiment.png b/docs/evaluation/how_to_guides/static/select_experiment.png
diff --git a/docs/evaluation/how_to_guides/static/sort_filter.png b/docs/evaluation/how_to_guides/static/sort_filter.png
diff --git a/docs/evaluation/how_to_guides/static/three_dots_charts.png b/docs/evaluation/how_to_guides/static/three_dots_charts.png
diff --git a/docs/evaluation/how_to_guides/static/toggle_views.gif b/docs/evaluation/how_to_guides/static/toggle_views.gif
diff --git a/docs/evaluation/how_to_guides/static/view_trace.png b/docs/evaluation/how_to_guides/static/view_trace.png
diff --git a/docs/evaluation/index.mdx b/docs/evaluation/index.mdx
@@ -12,7 +12,7 @@ import {
 } from "@site/src/components/InstructionsWithCode";
 import { RegionalUrl } from "@site/src/components/RegionalUrls";
 
-# Evaluation quick start
+# Evaluation Quick Start
 
 This quick start will get you up and running with our evaluation SDK and Experiments UI.
 
@@ -52,7 +52,7 @@ export OPENAI_API_KEY="<your-openai-api-key>"`),
   groupId="client-language"
 />
 
-## 3. Import dependencies
+## 4. Import dependencies
 
 <CodeTabs
   tabs={[
@@ -85,7 +85,7 @@ const openai = new OpenAI();`,
   groupId="client-language"
 />
 
-## 4. Create a dataset
+## 5. Create a dataset
 
 <CodeTabs
   tabs={[
@@ -164,7 +164,7 @@ await client.createExamples({
 groupId="client-language"
 />
 
-## 5. Define what you're evaluating
+## 6. Define what you're evaluating
 
 <CodeTabs
   tabs={[
@@ -204,7 +204,7 @@ async function target(inputs: string): Promise<{ response: string }> {
   groupId="client-language"
 />
 
-## 6. Define evaluator
+## 7. Define evaluator
 
 <CodeTabs
   tabs={[
@@ -289,7 +289,7 @@ async function accuracy({
   groupId="client-language"
 />
 
-## 7. Run and view results
+## 8. Run and view results
 
 <CodeTabs tabs={[
 

diff --git a/docs/evaluation/tutorials/testing.mdx b/docs/evaluation/tutorials/testing.mdx
@@ -630,7 +630,7 @@ class Grade(TypedDict):
     score: Annotated[
         bool,
         ...,
-        "Return True if the answer is fully frounded in the source documents, otherwise False.",
+        "Return True if the answer is fully grounded in the source documents, otherwise False.",
     ]
 
 judge_llm = init_chat_model("gpt-4o").with_structured_output(Grade)
@@ -644,7 +644,7 @@ def test_grounded_in_source_info() -> None:
     result = agent.invoke({"messages": [{"role": "user", "content": query}]})
 
     # Grab all the search calls made by the LLM
-    search_results = "\n\n".join(
+    search_results = "\\n\\n".join(
         msg.content
         for msg in result["messages"]
         if msg.type == "tool" and msg.name == search_tool.name
@@ -666,8 +666,8 @@ def test_grounded_in_source_info() -> None:
             "Return False if the ANSWER is not grounded in the DOCUMENTS."
         )
         answer_and_docs = (
-            f"ANSWER: {result['structured_response'].get('text_answer', '')}\n"
-            f"DOCUMENTS:\n{search_results}"
+            f"ANSWER: {result['structured_response'].get('text_answer', '')}\\n"
+            f"DOCUMENTS:\\n{search_results}"
         )
 
         # Run the judge LLM
@@ -836,7 +836,7 @@ module.exports = {
       value: "python",
       label: "Pytest",
       language: "bash",
-      content: `pytest --output=ls tests`,
+      content: `pytest --langsmith-output tests`,
     },
     {
       value: "vitest",
@@ -1100,7 +1100,7 @@ class Grade(TypedDict):
     score: Annotated[
         bool,
         ...,
-        "Return True if the answer is fully frounded in the source documents, otherwise False.",
+        "Return True if the answer is fully grounded in the source documents, otherwise False.",
     ]
 
 judge_llm = init_chat_model("gpt-4o").with_structured_output(Grade)
@@ -1114,7 +1114,7 @@ def test_grounded_in_source_info() -> None:
     result = agent.invoke({"messages": [{"role": "user", "content": query}]})
 
     # Grab all the search calls made by the LLM
-    search_results = "\n\n".join(
+    search_results = "\\n\\n".join(
         msg.content
         for msg in result["messages"]
         if msg.type == "tool" and msg.name == search_tool.name
@@ -1136,8 +1136,8 @@ def test_grounded_in_source_info() -> None:
             "Return False if the ANSWER is not grounded in the DOCUMENTS."
         )
         answer_and_docs = (
-            f"ANSWER: {result['structured_response'].get('text_answer', '')}\n"
-            f"DOCUMENTS:\n{search_results}"
+            f"ANSWER: {result['structured_response'].get('text_answer', '')}\\n"
+            f"DOCUMENTS:\\n{search_results}"
         )
 
         # Run the judge LLM