Pacific AI Releases LangTest 2.7.0: AMEGA Clinical Guideline Benchmark, MedFuzz Robustness Testing, Randomized QA Options, Expanded Clinical Summarization Support, MentalChat16K Integration, and Enhanced Security #1224
chakravarthik27
announced in
Announcements
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
📢 Highlights
We’re thrilled to announce the latest LangTest release, bringing advanced benchmarks, new robustness testing, and improved developer experience to your model evaluation workflows.
🩺 Autonomous Medical Evaluation for Guideline Adherence (AMEGA):
We’ve integrated AMEGA, a comprehensive benchmark for assessing LLM adherence to clinical guidelines. Covering 20 diagnostic scenarios across 13 specialties. The benchmark includes 135 questions and 1,337 weighted scoring elements, providing one of the most rigorous frameworks for evaluating medical knowledge in real-world clinical settings.
🧪 MedFuzz Robustness Testing:
To better reflect real-world clinical complexities, we're introducing MedFuzz, a healthcare-specific robustness approach that probes LLMs beyond standard benchmarks
🎲 Randomized Options in QA Tasks:
Introducing a new robustness test to mitigate positional bias in multiple-choice evaluations, LangTest now supports the randomized option ordering test type in the robustness category.
📝 ACI-Bench: Ambient Clinical Intelligence Benchmark:
LangTest now supports evaluation with ACI-Bench, a novel benchmark for automatic visit note generation in clinical contexts
💬 MTS-Dialog: Clinical Summary Evaluation:
We’ve added support for the MTS-Dialog dataset to evaluate models on dialogue-to-summary generation and to support sectioned summaries (headers + contents) for more structured evaluation
🧠 MentalChat16K Clinical Evaluation Support:
LangTest now supports the MentalChat16K dataset, enabling evaluation of LLMs in mental health–focused conversational contexts.
🔒Security Enhancements:
Critical vulnerabilities and security issues have been addressed, reinforcing the LangTest's overall stability and safety.
🔥 Key Enhancements
🩺 Autonomous Medical Evaluation for Guideline Adherence (AMEGA)
We’ve integrated AMEGA, a rigorous benchmark for assessing LLM adherence to clinical guidelines. This benchmark spans 20 diagnostic scenarios across 13 specialties, comprising 135 questions and 1,337 weighted scoring elements.
Key Features:
How it works:
Harness setup:
execution:
🧪 MedFuzz Robustness Testing
Introducing MedFuzz, a healthcare-specific fuzz testing approach built to stress-test LLMs against clinical complexity beyond conventional benchmarks.
Key Features:
How it works:
Harness setup:
Prompt:
Harness Config:
execution:
Example:

🎲 Randomized Options in QA Tasks
LangTest now supports randomized option ordering in multiple-choice evaluations to mitigate positional bias.
Key Features:
How it works:
Prompt:
Harness setup:
Config:
execution:
Results:

📝 ACI-Bench: Ambient Clinical Intelligence Benchmark
We’ve added support for ACI-Bench, a new benchmark focused on automatic visit note generation in clinical contexts.
Key Features:
How it Works:
Harness Config:
Execution:
💬 MTS-Dialog: Clinical Summary Evaluation
LangTest now supports the MTS-Dialog dataset for dialogue-to-summary generation, including structured (sectioned) summaries.
Key Features:
How it Works:
Harness Config:
Execution:
🧠 MentalChat16K Clinical Evaluation Support
We’ve added support for the MentalChat16K dataset, a specialized benchmark for evaluating LLMs in mental health–related dialogues. This dataset focuses on empathy, coherence, and safety, making it particularly valuable for sensitive clinical evaluation tasks.
Key Features:
How it works:
Harness Setup:
What's Changed
Full Changelog: 2.6.0...2.7.0
This discussion was created from the release Pacific AI Releases LangTest 2.7.0: AMEGA Clinical Guideline Benchmark, MedFuzz Robustness Testing, Randomized QA Options, Expanded Clinical Summarization Support, MentalChat16K Integration, and Enhanced Security.
Beta Was this translation helpful? Give feedback.
All reactions