-
Notifications
You must be signed in to change notification settings - Fork 159
Description
Description of the bug
The tailor stage sets max_tokens=2048 when calling the LLM. For candidates with extensive work history, the prompt alone can exceed 5,000 tokens, and thinking models like gemini-2.5-flash consume additional tokens for reasoning. This leaves insufficient tokens for the full JSON response, causing truncated output and repeated EXHAUSTED_RETRIES failures.
To Reproduce
Set up a profile with 15+ years of work history and run the tailor stage with gemini-2.5-flash. Every job will hit EXHAUSTED_RETRIES with finishReason: MAX_TOKENS visible in the API logs. The response cuts off mid-JSON after only 82-418 tokens.
"finishReason": "MAX_TOKENS",
"candidatesTokenCount": 82,
"promptTokenCount": 5380,
"thoughtsTokenCount": 7769
Expected behavior
The max_tokens limit should be high enough to accommodate long resumes, or better yet, be a configurable parameter in profile.json or as a CLI flag so users can tune it for their situation.
Fix
In tailor.py around line 403, change:
raw = client.chat(messages, max_tokens=2048, temperature=0.4)
to:
raw = client.chat(messages, max_tokens=16384, temperature=0.4)
Environment
- Resume length: 30 years of experience
- Model: gemini-2.5-flash
- Observed prompt token count: ~5,700 tokens