This repository is an exploration into understanding and using the HumanLoop library for evaluating AI models, specifically focusing on running evaluations and comparing prompts using models hosted on Microsoft Azure's AI Foundry instances. The repository includes two main files: eval.py
and legalbench_abercrombie.ipynb
.
- Introduction
- Prerequisites
- Setting Up the Environment
- Running the Sample Code
- Comparing Prompts using gpt-4o-mini and Llama-3.3-70B
- Additional Information
This repository demonstrates the use of HumanLoop for managing AI model evaluations, extracting specific information, and comparing model performance on different tasks. The evaluations are designed to provide insights into how well the models understand and generate content based on specific templates and datasets. Please note, this repo only uses the Free Tier, so no custom evaluations could be used, though an example has been added to the "evaluators" folder slightly modifying the demo version.
Before you begin, make sure you have the following:
- Python 3.11 or later
- An Azure account with access to Azure AI Foundry
- API keys for HumanLoop and Azure AI, stored in a
.env
file - Basic knowledge of Python and Jupyter notebooks
-
Clone the Repository:
git clone <repository-url> cd <repository-folder>
-
Install Required Packages:
pip install -r requirements.txt
-
Configure Environment Variables: Create a
.env
file in the root directory of your project with the following variables:HL_API_KEY=<your-humanloop-api-key> AZURE_API_KEY=<your-azure-api-key> LLAMA_ENDPOINT=<llama-endpoint> AZURE_API_KEY_GPT=<your-azure-gpt-key> AZURE_ENDPOINT_GPT=<gpt-endpoint> GPT_4O_MINI_MODEL=<gpt-4o-mini-model> AZURE_API_KEY_LLAMA=<your-azure-llama-key> AZURE_ENDPOINT_LLAMA=<llama-endpoint> LLAMA_3_3_70B_MODEL=<llama-3.3-70b-model>
This script demonstrates how to run a basic human loop prompt evaluation for extracting first names from full names.
-
Load Environment Variables: Ensure that your environment variables are correctly loaded.
-
Initialize HumanLoop and Azure Clients: The script initializes these clients using the provided API keys.
-
Execute the Evaluation: Run the script to evaluate the model's ability to extract first names. The evaluation utilizes a set of predefined evaluators like Exact Match and Levenshtein Distance.
python eval.py
- Prompt Development: Modify and develop new prompt templates as needed to test different model capabilities.
- Dataset Expansion: Add more data points to the dataset to test the prompt with varied inputs and improve evaluation breadth.
The notebook provides a framework for comparing two different language models on a distinct task using the Abercrombie distinctiveness scale.
-
Choose a Model: Configure the script to use either gpt-4o-mini or Llama-3.3-70B by modifying the
selected_model
variable. -
Prepare the Template: The script uses a specialized template asking the model to classify text according to a legal distinctiveness scale.
-
Run the Evaluation: Execute the notebook to run evaluations on the dataset from HumanLoop.
jupyter notebook legalbench_abercrombie.ipynb
-
Analyze Results: The notebook runs evaluations and presents the results for insights on the performance of each model.
-
Results Output: The output results comparing the two models on the same prompt has be exported and stored in the "results" folder for this experiment.
- HumanLoop Documentation: Refer to HumanLoop's official documentation for in-depth understanding and API usage.
- Azure AI Documentation: Access Azure AI documentation for more details on AI Foundry Models.
By following these instructions, you will be able to run and comprehend the evaluation of AI models using HumanLoop and Azure AI, allowing for further explorations and developments in this field.