diff --git a/docs/ai-assessment.md b/docs/ai-assessment.md new file mode 100644 index 0000000..7d36d7d --- /dev/null +++ b/docs/ai-assessment.md @@ -0,0 +1,74 @@ +# AI Review – Prompt Best Practices + +**Service:** Invoice Copilot +**Model:** gpt-4o-mini (prod) +**Repo scan date:** 2025-08-20 +**Assessor:** handit assess (repo-only, no production data) + +## 1) Scorecard + +| Criterion | Data Analysis Assistant | +|----------------------------|-------------------------| +| Context & Rationale (0–5) | 2 | +| Format/Output Contract | 1 | +| Examples & Edge Cases | 0 | +| Determinism & Guardrails | 1 | +| Testability | 1 | + +**Overall Risk:** 🔴 High (single prompt) + +## 2) Risk Heatmap (top issues) + +- **Lack of explicit output format** → May lead to inconsistent outputs (**🔴 High**) +- **No examples or edge cases provided** → Reduces reliability in varied scenarios (**🔴 High**) +- **Minimal context and rationale** → Limits alignment with user needs (**🔶 Medium**) + +## 3) Findings & Improvement Levers + +### A. Prompt: Data Analysis Assistant + +#### Original + +``` +You are a helpful assistant that specializes in data analysis. +... +``` + +#### Strengths + +- Assigns a clear role or persona to the AI, which can help in maintaining a consistent style and expertise. + +#### Risks + +- The prompt lacks explicit instructions on the desired output format, which can lead to inconsistent responses. +- No examples or edge cases are provided, reducing the model's ability to handle varied scenarios effectively. +- Minimal context and rationale are given, which may limit the model's alignment with specific user needs. +- Absence of determinism and guardrails could lead to unpredictable outputs. + +#### Improvement Levers (not fixes) + +- **Add explicit output format instructions**: Specify the structure or sections required in the response to improve consistency (Best Practice 5). +- **Include examples and edge cases**: Provide few-shot examples to guide the model in handling different scenarios (Best Practice 3). +- **Enhance context and rationale**: Offer more background information to better align the model's responses with user expectations (Best Practice 2). +- **Implement guardrails**: Set explicit constraints on length, tone, and format to ensure predictable outputs (Best Practice 9). + +#### Evidence (static scan) +The prompt is minimalistic and lacks detailed instructions or examples, which are critical for ensuring reliable and consistent AI behavior. +→ Potential production impact includes inconsistent data analysis outputs and reduced user satisfaction. + +## 4) Next Steps (with handit) + +1. **Connect handit** (1-line setup) +2. **Run baseline evals** on your real logs: Evaluate consistency and alignment with user needs. +3. **handit will experiment** with improvements automatically. +4. **If a candidate passes** tests on your real data, handit opens the PR. + +## 5) Business Impact (expected if improved) + +- **Increased consistency and reliability** in AI outputs, leading to higher user satisfaction and trust. +- **Improved handling of diverse scenarios** through the inclusion of examples and edge cases, enhancing the model's versatility. +- **Better alignment with user needs** by providing more context and rationale, potentially increasing the adoption and effectiveness of the service. + +--- + +*This PR was automatically generated by handit.ai Autonomous engineer* \ No newline at end of file