diff --git a/docs/howtos/integrations/langfuse.ipynb b/docs/howtos/integrations/langfuse.ipynb index e230f1650..746bf6326 100644 --- a/docs/howtos/integrations/langfuse.ipynb +++ b/docs/howtos/integrations/langfuse.ipynb @@ -27,8 +27,8 @@ "import os\n", "# TODO REMOVE ENVIRONMENT VARIABLES!!!\n", "# get keys for your project from https://cloud.langfuse.com\n", - "os.environ[\"LANGFUSE_PUBLIC_KEY\"] = \"pk-lf-83fe3fe9-b4c5-4269-b387-86257297cc3a\"\n", - "os.environ[\"LANGFUSE_SECRET_KEY\"] = \"sk-lf-9ef3e2b8-c5ec-4d51-b530-277e5bb98b26\"\n", + "os.environ[\"LANGFUSE_PUBLIC_KEY\"] = \"\"\n", + "os.environ[\"LANGFUSE_SECRET_KEY\"] = \"\"\n", " \n", "# your openai key\n", "#os.environ[\"OPENAI_API_KEY\"] = \"\"" @@ -99,7 +99,7 @@ "For going to measure the following aspects of an RAG system. These metrics and from the Ragas library.\n", "\n", "1. [faithfulness](https://docs.ragas.io/en/latest/concepts/metrics/faithfulness.html): This measures the factual consistency of the generated answer against the given context.\n", - "2. [answer_relevency](https://docs.ragas.io/en/latest/concepts/metrics/answer_relevance.html): Answer Relevancy, focuses on assessing how to-the-point and relevant the generated answer is to the given prompt.\n", + "2. [answer_relevancy](https://docs.ragas.io/en/latest/concepts/metrics/answer_relevance.html): Answer Relevancy, focuses on assessing how to-the-point and relevant the generated answer is to the given prompt.\n", "3. [context precision](https://docs.ragas.io/en/latest/concepts/metrics/context_precision.html): Context Precision is a metric that evaluates whether all of the ground-truth relevant items present in the contexts are ranked higher or not. Ideally, all the relevant chunks must appear at the top ranks. This metric is computed using the question and the contexts, with values ranging between 0 and 1, where higher scores indicate better precision.\n", "4. [aspect_critique](https://docs.ragas.io/en/latest/concepts/metrics/critique.html): This is designed to assess submissions based on predefined aspects such as harmlessness and correctness. Additionally, users have the flexibility to define their own aspects for evaluating submissions according to their specific criteria.\n", "\n", @@ -144,7 +144,7 @@ "source": [ "## The Setup\n", "You can use model-based evaluation with Ragas in 2 ways\n", - "1. Score each Trace: This means you will run the evalalutions for each trace item. This gives you much better idea since of how each call to your RAG pipelines is performing but can be expensive\n", + "1. Score each Trace: This means you will run the evaluations for each trace item. This gives you much better idea since of how each call to your RAG pipelines is performing but can be expensive\n", "2. Score as Batch: In this method we will take a random sample of traces on a periodic basis and score them. This brings down cost and gives you a rough estimate the performance of your app but can miss out on important samples.\n", "\n", "In this cookbook, we'll show you how to setup both." @@ -331,7 +331,7 @@ "id": "4fd68b13-9743-424f-830a-c6d32e3d09c6", "metadata": {}, "source": [ - "Note that the scoring is blocking so make sure that you sent the generated answer before waiting for the scores to get computed. Alternatively you can run `score_with_ragas()` in a seperate thread and pass in the trace_id to log the scores.\n", + "Note that the scoring is blocking so make sure that you sent the generated answer before waiting for the scores to get computed. Alternatively you can run `score_with_ragas()` in a separate thread and pass in the trace_id to log the scores.\n", "\n", "Or you can consider" ] @@ -343,7 +343,7 @@ "source": [ "## Scoring as batch\n", "\n", - "Scoring each production trace can be time-consuming and costly depending on your application architecture and traffic. In that case, it's better to start off with a batch scoring method. Decide a timespan you want to run the batch process and the number of traces you want to _sample_ from that time slice. Create a dataset and call `ragas.evaluate` to analyse the result.\n", + "Scoring each production trace can be time-consuming and costly depending on your application architecture and traffic. In that case, it's better to start off with a batch scoring method. Decide a timespan you want to run the batch process and the number of traces you want to _sample_ from that time slice. Create a dataset and call `ragas.evaluate` to analyze the result.\n", "\n", "You can run this periodically to keep track of how the scores are changing across timeslices and figure out if there are any discrepancies. \n", "\n", @@ -372,7 +372,7 @@ " output={'answer': answer}\n", " ))\n", "\n", - "# await that Langfuse SDK has processed all events before rying to retrieve it in the next step\n", + "# await that Langfuse SDK has processed all events before trying to retrieve it in the next step\n", "langfuse.flush()" ] }, @@ -441,7 +441,7 @@ "id": "d37337dd-fe8a-4a4e-b2fc-f7435cf0eb51", "metadata": {}, "source": [ - "Now lets make a batch and score it. Ragas uses huggingface dataset object to build the dataset and run the evaluation. If you run this on your own production data, use the right keys to extract the question, contrexts and answer from the trace" + "Now lets make a batch and score it. Ragas uses huggingface dataset object to build the dataset and run the evaluation. If you run this on your own production data, use the right keys to extract the question, contexts and answer from the trace" ] }, {