This project provides a LangSmith evaluation setup for Torah Q&A systems using various AI models.
- Install dependencies:
uv sync- Create
.envfile with your API keys:
LANGSMITH_TRACING=true
LANGSMITH_API_KEY=your_langsmith_api_key
ANTHROPIC_API_KEY=your_anthropic_api_keyuv run langsmith_evaluation.py listuv run langsmith_evaluation.py anthropic_sonnet
uv run langsmith_evaluation.py anthropic_js_api
uv run langsmith_evaluation.py simple_template# Use only correctness and helpfulness evaluators
uv run langsmith_evaluation.py anthropic_sonnet correctness,helpfulness
# Use only Torah-specific evaluators
uv run langsmith_evaluation.py anthropic_sonnet torah_citations,hebrew_handlinguv run langsmith_evaluation.pyThe evaluation system supports multiple target functions defined in the targets/ directory:
- anthropic_sonnet: Uses Claude 3.5 Sonnet via Python SDK (high quality)
- anthropic_js_api: Uses Anthropic API via JavaScript server with distributed tracing
- simple_template: Template-based baseline responses
The anthropic_js_api target requires running a separate JavaScript server:
# Navigate to the anthropic-js directory
cd targets/anthropic-js
# Install dependencies
npm install
# Set up .env with your API keys
cp ../../.env .env # or create manually
# Start the server
PORT=8334 npm startThe server will run on http://localhost:8334 and provides distributed tracing integration.
The anthropic_js_api target implements distributed tracing to maintain evaluation context across the HTTP boundary:
- Trace Continuity: The Python evaluation framework passes LangSmith tracing headers to the JavaScript server
- Automatic Context: The server automatically extracts trace headers using
RunTree.fromHeaders() - Usage Tracking: API calls, tokens, and response metadata are tracked within the trace context
- Seamless Integration: No additional configuration needed - tracing works automatically when both Python and JavaScript components have LangSmith configured with
LANGSMITH_TRACING=trueandLANGSMITH_API_KEY
This enables complete visibility into the evaluation pipeline across different technologies while maintaining performance and cost tracking.
The system includes several evaluators to comprehensively assess Torah Q&A responses:
- correctness: Compares output against reference answer (requires ground truth)
- helpfulness: Measures how well the response addresses the input question
- torah_citations: Checks if responses include proper source citations and follow scholarly conventions
- hebrew_handling: Evaluates correct interpretation of Hebrew/Aramaic text and Jewish concepts
- depth_analysis: Assesses the depth and sophistication of Torah analysis
To add a new target function:
- Create a new Python file in the
targets/directory (e.g.,my_target.py) - Implement a function that takes
inputs: dictand returnsoutputs: dict - Import it in
targets/__init__.py - Add it to the
TARGET_FUNCTIONSregistry
Example in targets/my_target.py:
def my_new_target(inputs: dict) -> dict:
# Your implementation here
return {"answer": "Some response"}Then in targets/__init__.py:
from .my_target import my_new_target
TARGET_FUNCTIONS = {
# ... existing targets
"my_target": my_new_target,
}To add a custom evaluator:
- Open
evaluators.py - Create a new evaluator function
- Add it to the
EVALUATOR_FUNCTIONSregistry
Example:
def my_custom_evaluator(inputs: dict, outputs: dict, reference_outputs: dict):
# Your evaluation logic here
return {"key": "my_metric", "score": True, "comment": "Good response"}
# Add to registry
EVALUATOR_FUNCTIONS["my_metric"] = my_custom_evaluatorThe evaluation uses Q1-dataset.json which contains Hebrew Torah scholarship questions and reference answers.
You can test individual target functions before running full evaluations:
# Test simple template target
from targets import get_target_function
simple_target = get_target_function('simple_template')
result = simple_target({'question': 'What does Divrei Yoel teach about prayer?'})
print(result['answer'])
# Test anthropic sonnet target (requires ANTHROPIC_API_KEY)
anthropic_target = get_target_function('anthropic_sonnet')
result = anthropic_target({'question': 'What is the meaning of Bereishit?'})
print(result['answer'])
# Test anthropic JS API target (requires server running on localhost:8334)
js_target = get_target_function('anthropic_js_api')
result = js_target({'question': 'What is the Jewish view on charity?'})
print(result['answer'])
print(result.get('usage_metadata', 'No usage data'))# Start the server
cd targets/anthropic-js
npm start
# In another terminal, test the API directly
curl -X POST http://localhost:8334/chat \
-H "Content-Type: application/json" \
-d '{"question": "What does the Torah say about kindness?"}'
# Check server health
curl http://localhost:8334/healthAfter running an evaluation, you'll get a link to view results in the LangSmith UI where you can compare different target functions' performance.