Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
68 changes: 68 additions & 0 deletions docs/content/changelog.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,74 @@ This is the aggregated changelog for the entire Rhesis repository. For detailed
- [Frontend Changelog](https://github.com/rhesis-ai/rhesis/blob/main/apps/frontend/CHANGELOG.md)
- [Polyphemus Changelog](https://github.com/rhesis-ai/rhesis/blob/main/apps/polyphemus/CHANGELOG.md)

## [0.6.10] - 2026-03-23

### Platform Release

This release includes the following component versions:
- **Backend 0.6.9**
- **Frontend 0.6.10**
- **SDK 0.6.10**
- **Polyphemus 0.2.8**

### Summary

This release introduces **adaptive testing overwrite controls and suggestion/evaluation workflows**, **typed SDK
statistics collection methods**, and **NIST-aligned password hardening** across authentication flows.

### Featured Capabilities

**Adaptive Testing Iteration Loop**

Adaptive testing now supports a tighter edit-run-evaluate workflow:

- **Delete Adaptive Test Sets**: Adaptive testing sets can now be deleted directly from the adaptive testing flow
- **Overwrite Controls**: Output generation and evaluation support `overwrite` behavior so teams can explicitly
re-run existing tests instead of only processing missing outputs/results
- **Suggestion Pipeline**: Added suggestion generation, suggestion output generation, and suggestion evaluation
endpoints to iterate on test quality before persisting new tests
- **Bulk Test Deletion**: Adaptive testing UI supports bulk deletion for faster curation

**SDK Statistics API Enhancements**

Stats access in the SDK now has clearer, typed entry points:

- **Collection Methods**: `TestRuns.stats()` and `TestResults.stats()` expose typed response models
- **Per-Run Shortcut**: `TestRun.stats()` delegates to run-scoped stats for convenience
- **DataFrame Conversion**: Stats responses support `to_dataframe()` for optional pandas workflows

**Authentication Hardening**

Password validation now aligns with NIST-oriented best practices:

- **Minimum Length**: Default minimum password length is now 12 characters
- **Strength Scoring**: zxcvbn-based strength checks are enforced with configurable minimum score
- **Context Blocking**: Passwords are rejected if they contain user/service context words
- **Breach Screening**: Optional HaveIBeenPwned k-anonymity checks reject known compromised passwords

### Backend Highlights
- Added adaptive testing evaluate and suggestion endpoints, including overwrite support
- Added deletion endpoint for adaptive testing test sets
- Exposed password policy settings through auth provider metadata
- Hardened password validation with zxcvbn and breach checks
- Continued metric evaluation refactor toward strategy-based execution

### Frontend Highlights
- Added adaptive testing delete test set action and bulk test deletion support
- Added overwrite toggles for adaptive testing output generation and evaluation
- Added suggestion generation/evaluation UI flow in adaptive testing detail
- Updated auth screens to enforce the new password policy and improve rate-limit/registration errors
- Added attachments count column in tests grid

### SDK Highlights
- Added typed stats collection methods for test runs and test results
- Added stable transcript formatting support via `ConversationHistory.format_conversation()`
- Kept async-first metric/model improvements and strategy-based evaluation integration

### Polyphemus Highlights
- No component version change in this platform release (`0.2.8`)
- Existing `POST /generate_batch` capability remains available for batched generation workloads

## [0.6.9] - 2026-03-12

### Platform Release
Expand Down
36 changes: 36 additions & 0 deletions docs/content/contribute/backend/authentication.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -107,6 +107,42 @@ AUTH_REGISTRATION_ENABLED=true
FRONTEND_URL=http://localhost:3000`}
</CodeBlock>

## Password Policy (NIST-aligned)

Password validation is enforced server-side during:

- `POST /auth/register`
- `POST /auth/reset-password`

Both flows call `validate_password(...)` with user context (email and name), so context words are checked in
addition to length and strength.

### Policy values and defaults

| Environment variable | Default | Description |
| --- | --- | --- |
| `PASSWORD_MIN_LENGTH` | `12` | Minimum password length |
| `PASSWORD_MAX_LENGTH` | `128` | Maximum password length |
| `PASSWORD_MIN_STRENGTH_SCORE` | `2` | Minimum zxcvbn score (0-4) |
| `PASSWORD_CHECK_BREACHED` | `true` | Enables HaveIBeenPwned k-anonymity breach check |

### Frontend policy discovery

The frontend reads password policy from:

- `GET /auth/providers` -> `password_policy.min_length`
- `GET /auth/providers` -> `password_policy.max_length`
- `GET /auth/providers` -> `password_policy.min_strength_score`

This lets clients validate early while preserving backend enforcement as the source of truth.

### Security behavior

- Passwords must not be whitespace-only.
- Passwords must not include context words from user identity and service context.
- Breach checks use HaveIBeenPwned k-anonymity (`/range` API), sending only the first 5 SHA-1 hash chars.
- If the breach API is unavailable, registration/reset does not hard-fail due only to transient API errors.

## Token System

The authentication system uses multiple token types:
Expand Down
5 changes: 5 additions & 0 deletions docs/content/contribute/environment-variables.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -24,11 +24,16 @@ Create a `.env` file in `apps/backend/` with the following variables:
['`DB_ENCRYPTION_KEY`', '**Required**', '32-byte URL-safe base64-encoded encryption key for database field encryption. Generate with: `python -c "from cryptography.fernet import Fernet; print(Fernet.generate_key().decode())"`. **Never commit to version control**'],
['`LOG_LEVEL`', 'Default: `DEBUG`', 'Logging level: `DEBUG`, `INFO`, `WARNING`, `ERROR`'],
['`JWT_SECRET_KEY`', '**Required**', 'Secret key for JWT token signing. Generate with: `openssl rand -hex 64`'],
['`SESSION_SECRET_KEY`', '**Required**', 'Secret key used by backend `SessionMiddleware` for session cookie signing. Must be distinct from `JWT_SECRET_KEY` (key separation)'],
['`JWT_ALGORITHM`', 'Default: `HS256`', 'JWT signing algorithm'],
['`JWT_ACCESS_TOKEN_EXPIRE_MINUTES`', 'Default: `15`', 'JWT access token expiration time in minutes'],
['`JWT_REFRESH_TOKEN_EXPIRE_DAYS`', 'Default: `7`', 'Refresh token expiration time in days'],
['`AUTH_EMAIL_PASSWORD_ENABLED`', 'Default: `true`', 'Enable email/password authentication'],
['`AUTH_REGISTRATION_ENABLED`', 'Default: `true`', 'Enable new user registration via email'],
['`PASSWORD_MIN_LENGTH`', 'Default: `12`', 'Minimum password length for registration and password reset'],
['`PASSWORD_MAX_LENGTH`', 'Default: `128`', 'Maximum password length for registration and password reset'],
['`PASSWORD_MIN_STRENGTH_SCORE`', 'Default: `2`', 'Minimum zxcvbn password strength score (0-4)'],
['`PASSWORD_CHECK_BREACHED`', 'Default: `true`', 'Enable HaveIBeenPwned k-anonymity breach check during password validation'],
['`GOOGLE_CLIENT_ID`', 'Optional', 'Google OAuth client ID. Enables Google sign-in when configured'],
['`GOOGLE_CLIENT_SECRET`', 'Optional', 'Google OAuth client secret'],
['`GH_CLIENT_ID`', 'Optional', 'GitHub OAuth client ID. Enables GitHub sign-in when configured'],
Expand Down
1 change: 1 addition & 0 deletions docs/content/docs/test-sets/_meta.tsx
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@ import type { MetaRecord } from "nextra";

const meta: MetaRecord = {
index: "Overview",
"adaptive-testing": "Adaptive Testing",
"import-from-file": "Import from File",
"import-from-garak": "Import from Garak",
};
Expand Down
113 changes: 113 additions & 0 deletions docs/content/docs/test-sets/adaptive-testing.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
import { CodeBlock } from '@/components/CodeBlock'

# Adaptive Testing

Adaptive Testing is a topic-based workflow for expanding and maintaining single-turn test sets over time.
Instead of treating a test set as static, you can organize tests into topic trees, generate outputs in bulk,
evaluate results with a selected metric, and iterate with AI-generated suggestions.

## When to Use Adaptive Testing

Use Adaptive Testing when you need to:

- Grow coverage across specific risk areas or product domains
- Re-evaluate existing tests without regenerating all outputs
- Curate suggestions before saving them to your test set
- Keep one evolving test set instead of creating many one-off sets

## End-to-End Workflow

Adaptive Testing in the UI follows this loop:

1. Create an adaptive test set
2. Organize tests in topics
3. Generate outputs from a selected endpoint
4. Evaluate tests with a selected metric
5. Generate and review suggestions
6. Accept selected suggestions into the test set

<Callout type="info">
Topic operations are hierarchical. Renaming a topic cascades to child topics and tests; removing a topic
removes subtopics and moves tests to the parent topic.
</Callout>

## Generate Outputs and Evaluate with Overwrite Control

Two actions drive most iteration cycles:

- **Generate outputs**: invoke an endpoint for test inputs and store output in test metadata
- **Evaluate**: run a selected metric against test input/output pairs and store label and score in test metadata

Both actions support an `overwrite` option.

| Parameter | Type | Default | Behavior |
| --- | --- | --- | --- |
| `topic` | `string \| null` | `null` | Limits processing to a topic |
| `include_subtopics` | `boolean` | `true` | Includes descendant topics when `topic` is set |
| `overwrite` | `boolean` | `false` | Replaces existing outputs/results instead of skipping |
| `test_ids` | `string[] \| null` | `null` | Optional explicit subset of tests |

<Callout type="info">
With `overwrite=false`, tests that already have outputs or evaluation labels are skipped. The response includes
`generated` or `evaluated`, plus `skipped` and `failed` counts.
</Callout>

### Generate Outputs API Example

<CodeBlock filename="generate_outputs.sh" language="bash">
{`curl -X POST "$RHESIS_BASE_URL/adaptive_testing/$TEST_SET_ID/generate_outputs" \\
-H "Authorization: Bearer $RHESIS_API_TOKEN" \\
-H "Content-Type: application/json" \\
-d '{
"endpoint_id": "00000000-0000-0000-0000-000000000000",
"topic": "Safety/Jailbreak",
"include_subtopics": true,
"overwrite": false
}'`}
</CodeBlock>

### Evaluate API Example

<CodeBlock filename="evaluate_tests.sh" language="bash">
{`curl -X POST "$RHESIS_BASE_URL/adaptive_testing/$TEST_SET_ID/evaluate" \\
-H "Authorization: Bearer $RHESIS_API_TOKEN" \\
-H "Content-Type: application/json" \\
-d '{
"metric_names": ["answer_relevancy"],
"topic": "Safety/Jailbreak",
"include_subtopics": true,
"overwrite": true
}'`}
</CodeBlock>

## Suggestions Workflow

Suggestion endpoints are non-persisted until you accept results in the UI.

1. `POST /adaptive_testing/\{id\}/generate_suggestions`
2. `POST /adaptive_testing/\{id\}/generate_suggestion_outputs`
3. `POST /adaptive_testing/\{id\}/evaluate_suggestions`
4. Accept selected suggestions to create real tests in the set

| Suggestion parameter | Type | Default | Notes |
| --- | --- | --- | --- |
| `num_examples` | `int` | `10` | Existing tests sampled as examples |
| `num_suggestions` | `int` | `20` | Requested number of suggestions |
| `topic` | `string \| null` | `null` | Optional topic focus |

<CodeBlock filename="generate_suggestions.sh" language="bash">
{`curl -X POST "$RHESIS_BASE_URL/adaptive_testing/$TEST_SET_ID/generate_suggestions" \\
-H "Authorization: Bearer $RHESIS_API_TOKEN" \\
-H "Content-Type: application/json" \\
-d '{
"topic": "Safety/Jailbreak",
"num_examples": 10,
"num_suggestions": 20
}'`}
</CodeBlock>

## Related Pages

- [Test Sets Overview](/docs/test-sets)
- [Tests](/docs/tests)
- [Metrics](/docs/metrics)
58 changes: 58 additions & 0 deletions docs/content/sdk/metrics/conversational.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -105,6 +105,64 @@ print(conversation.get_assistant_tool_calls())`}
These fields are optional. If they are omitted for a turn, helper methods return `None` for that position.
</Callout>

### Formatting a conversation transcript (v0.6.10+)

Use `ConversationHistory.format_conversation()` when you want a structured, numbered transcript that keeps
assistant `context`, `metadata`, and `tool_calls` attached to the correct turn.

This is especially useful for custom conversational judges and prompt templates.

<CodeBlock filename="format_conversation.py" language="python">
{`from rhesis.sdk.metrics import ConversationHistory

conversation = ConversationHistory.from_messages([
{"role": "user", "content": "Find policy details for claim #A-123."},
{
"role": "assistant",
"content": None,
"tool_calls": [{"id": "call_1", "function": {"name": "lookup_claim"}}],
"metadata": {"latency_ms": 420, "model": "rhesis-default"},
"context": [{"source": "claims_db", "id": "A-123"}],
},
{"role": "assistant", "content": "I found the claim and policy summary."},
])

print(conversation.format_conversation())`}
</CodeBlock>

Expected output shape:

<CodeBlock filename="formatted_output.txt" language="text">
{`Turn 1:
User: Find policy details for claim #A-123.
Context: [
{
"source": "claims_db",
"id": "A-123"
}
]
Metadata: {
"latency_ms": 420,
"model": "rhesis-default"
}
Tool Calls: [
{
"id": "call_1",
"function": {
"name": "lookup_claim"
}
}
]

Turn 2:
Assistant: I found the claim and policy summary.`}
</CodeBlock>

<Callout type="info">
`to_text()` returns a simpler role-prefixed transcript and excludes `metadata`, `context`, and `tool_calls`.
Use `format_conversation()` when those fields must be visible to the evaluating model.
</Callout>

## Quick Start

### Turn Relevancy
Expand Down
Loading