-
Notifications
You must be signed in to change notification settings - Fork 156
Description
Field Report: Running ApplyPilot End-to-End
I've been running ApplyPilot as my primary job search tool for about about two days (Senior/Staff Backend/Platform Engineer, Seattle/Remote). This is a detailed field report covering setup experience, pipeline results, bugs I encountered and fixed, and suggestions for improvement.
Note: this report I generated, and contains references to features I build which are not integrated into ApplyPilot, I will need time to clean my fork before publishing since it has substantial changes, they are detailed here in this report as well. I'm also building the next pipeline which is a job tracker, markdown + cli based that auto enriches with information.
And also note for whomever is reading this, if you want to interview or hire me, you can reach me out via LinkedIn: https://www.linkedin.com/in/elninja/
TL;DR: The concept is excellent and the architecture is sound. I got 112 successful applications out of 1,503 discovered jobs (7.5%
end-to-end conversion). The Tier 1→2 pipeline works great after some fixes. Tier 3 (auto-apply) is where most friction lives — Workday login
walls block ~70% of apply attempts.
Pipeline Results
Funnel
| Stage | Count | Drop-off |
|---|---|---|
| Discovered | 1,503 | — |
| Enriched | 1,420 | 5.5% lost (detail errors) |
| Scored | 1,420 | 0% (all enriched jobs scored) |
| Score 7+ | 543 | 62% filtered (working as intended) |
| Tailored | 543 | 0% (100% of 7+ tailored) |
| Cover Letter | 543 | 0% (100% covered) |
| Applied | 112 | 79% blocked at apply stage |
Score Distribution
| Score | Count |
|---|---|
| 10 | 98 |
| 9 | 249 |
| 8 | 106 |
| 7 | 90 |
| 6 | 148 |
| 5 | 185 |
| 4 | 157 |
| 3 | 171 |
| 2 | 70 |
| 1 | 146 |
Scoring skews high — 36% of jobs scored 7+. For my profile (10+ years, Go/Kotlin/Python/K8s/AWS), this seems reasonable but I wonder if the
scoring prompt could be tighter.
Sources
| Source | Jobs |
|---|---|
| 601 | |
| Indeed | 235 |
| Thomson Reuters (Workday) | 174 |
| Netflix (Workday) | 103 |
| Dice | 81 |
| SimplyHired | 77 |
| Moderna (Workday) | 46 |
| Motorola (Workday) | 38 |
| NVIDIA (Workday) | 34 |
| Talent.com | 26 |
| HN "Who is Hiring" | ~30 |
| Others | ~58 |
Apply Results
112 successful applications to companies including Netflix (7), Airbnb, HubSpot, Mastercard, NVIDIA, Grafana Labs, Twilio, Rippling, and others.
Apply errors (589 total):
| Error | Count | % of Errors |
|---|---|---|
workday_login_required |
470 | 80% |
not_eligible_location |
40 | 7% |
expired |
26 | 4% |
email_verification |
10 | 2% |
stuck / captcha / account_required |
18 | 3% |
| Other | 25 | 4% |
The Workday login wall is the #1 blocker. 470 out of 589 apply errors (80%) are because Workday requires an authenticated session that the agent
can't handle.
Artifacts Generated
- 2,019 tailored resume files (txt + pdf pairs)
- 1,083 cover letter files (txt + pdf pairs)
- 336 apply agent log files
Time Estimates Per Stage
These are rough estimates from running the pipeline across multiple sessions:
| Stage | Time | Notes |
|---|---|---|
| Discovery | ~5-10 min | JobSpy + Workday scraping in parallel. Workday is the bottleneck. |
| Enrichment | ~15-20 min | 1,420 jobs, mostly fast. Some sites need Playwright fallback. |
| Scoring | ~20-30 min | Gemini Flash handles this well. Rate limits add some wait time. |
| Tailoring | ~45-60 min | Quality model (Gemini Pro), validation loop adds retries. Most time-intensive Tier 2 stage. |
| Cover Letters | ~20-30 min | Similar to scoring in complexity. |
| Auto-Apply | ~4-6 hours total | 2 workers, ~2-5 min per job. Chrome startup + form navigation is slow. |
| Total | ~6-8 hours | Spread across multiple sessions over ~1 week |
The --stream flag for running score + tailor + cover concurrently is a huge time saver.
Bugs I Encountered and Fixed
1. Gemini Thinking Token Budget (Related to #12)
Gemini 2.5+ models use "thinking tokens" that consume the max_tokens budget. The default 2048 was far too low — a simple scoring response needs
~30 visible tokens but the model burns 1200+ on thinking. I had to increase to:
- Scoring: 8,192
- Tailoring validation: 4,096, generation: 16,384
- Cover letters: 8,192
2. LLM Client Singleton / Stale Environment (Related to #9)
llm.py reads API keys at module import time. If config.load_env() isn't called before importing llm, the client has no keys. I restructured
the import order to ensure env loading happens first.
3. Model Fallback Chain Needed Updating
The original model list included deprecated Gemini models. I rebuilt the cascade:
- Fast:
gemini-2.5-flash → gemini-3-flash → gemini-2-flash → gemini-2-flash-lite → gpt-4.1-nano → gpt-4.1-mini → claude-haiku-4-5 - Quality:
gemini-3.1-pro-preview → gemini-2.5-pro → gemini-3-pro → gemini-2.5-flash → gpt-4.1-mini → gpt-4.1-nano → claude-sonnet-4-5 → claude-haiku-4-5
The 429 rate-limit handling (mark model exhausted for 5 min, fall to next) works great once the chain is populated.
4. Docker MCP Toolkit Interference
If Docker Desktop with MCP Toolkit is installed, it exposes mcp__MCP_DOCKER__browser_* tools that shadow the local Playwright MCP server. These
Docker-based tools can't access host files, breaking resume/cover letter uploads. Fix: pass --strict-mcp-config to the Claude subprocess in
launcher.py.
5. URL Normalization
Many Workday scraped URLs were relative (e.g., /en/sites/CX/job/12345). These broke enrichment. I added URL normalization at insert time using
base URLs from sites.yaml.
6. Company Extraction
Jobs from aggregators (Indeed, LinkedIn) had no company field, making it hard to spread applications across employers. I added company
extraction from application_url domains (patterns for Workday, Greenhouse, Lever, iCIMS, Ashby).
7. ANTHROPIC_API_KEY Leaking to Subprocess
When the apply launcher spawns claude subprocesses, if ANTHROPIC_API_KEY is in the environment, it overrides Max plan auth and bills to the
API key instead. I added explicit env stripping in launcher.py.
8. Fabrication in Cover Letters
One cover letter for SeatGeek fabricated a company name from my resume ("Underground Elephant" was a real company I worked at, but the LLM used
it in the wrong context). The resume_facts system helps but isn't bulletproof.
9. Banned Word False Positives (Related to #10)
The fabrication watchlist used substring matching — "rust" matched "TrustSec", "dedicated" matched a legitimate resume phrase. I changed banned
words to warnings rather than hard errors, letting the LLM judge handle tone.
10. Chrome Extension Path Resolution
The apply agent loads uBlock Origin and 1Password from the user's Chrome profile. Extension paths include version directories that change on
updates. I added dynamic resolution that picks the latest version directory and silently skips uninstalled extensions.
What We Built On Top
Beyond bug fixes, here are features I added to my fork:
- Hacker News "Who is Hiring" scraper — Parses monthly threads, deobfuscates emails, creates synthetic URLs for contact-only posts
- HTML Dashboard (
applypilot dashboard) — Rich dashboard with Active/Archive/Applied tabs, fit score badges, company grouping, one-click
links to applications - Company-aware apply prioritization —
ROW_NUMBER() PARTITION BY companyin the job acquisition query spreads applications across
employers instead of applying to 10 Netflix jobs in a row - Two-tier model strategy — Flash models for speed-critical tasks, Pro models for quality writing
- Streaming pipeline mode —
applypilot run score tailor cover --streamruns stages concurrently - Chrome extension loading — uBlock (faster page loads) + 1Password (credential auto-fill) loaded dynamically
- Workday employer registry —
employers.yamlwith 48 Workday employer portals for direct scraping - Smart Extract fallback — AI-powered extraction when JSON-LD and CSS selectors fail
- Apply agent verification — Post-submission confidence scoring to verify applications actually went through
Suggestions for Improvement
High Priority
-
Workday auth strategy — 80% of apply failures are
workday_login_required. Options:- Pre-authenticated browser session (load cookies/profile)
- Account creation flow before application
- Mark Workday jobs as "manual apply" and generate a manual actions list
-
Model config should be externalized — Hardcoded model lists break when Google deprecates models. A
models.yamlconfig would let users
update without code changes. -
max_tokens should scale with task — Default 2048 is too low for thinking models. The project should detect thinking model capabilities and
auto-adjust, or at minimum document this prominently. -
Apply error categorization — Currently errors are free-text strings. A structured error taxonomy would enable better retry logic
(permanent vs transient errors) and reporting.
Medium Priority
-
Resume validation strictness should default to
normalorlenient— The strict mode causes excessive retries (related to [EXHAUSTED_RETRIES] attempts=4 for tailored resume #4, Must fit 1 page" hard rule conflicts with preserved_companies requirement causing EXHAUSTED_RETRIES #14). Most
"failures" are false positives from substring matching. -
Duplicate PDF filename collisions (related to Duplicate job titles cause tailored resume and cover letter PDFs to overwrite each other #11, Duplicate job titles cause tailored resume and cover letter PDFs to overwrite each other; batch size limit compounds the problem #17) — Jobs with identical titles overwrite each other's PDFs. Hash the URL into the
filename. -
Company field should be first-class — Add company extraction at discovery/enrichment time, not just from
application_url. This enables
better deduplication and employer diversity. -
Dashboard should be a long-running web server — The current
applypilot dashboardgenerates a static HTML file. A live dashboard with
auto-refresh would be much more useful during active pipeline runs.
Nice to Have
-
Job deduplication across sources — Same job appears on LinkedIn, Indeed, and the company's Workday portal. Fuzzy matching on title +
company could reduce noise. -
Apply success verification — After submitting, check for confirmation emails or "application received" pages to verify success beyond the
agent's self-reported confidence. -
Metrics/analytics — Track conversion rates over time, cost per application, which sources yield the best fit scores, etc.
-
Config validation (
applypilot doctor) — v0.3.0 added this, which is great. Expanding it to validate API key quotas, model availability,
and browser setup would help a lot with onboarding.
Setup Notes for Other Users
Things I wish I knew before starting:
- Call
config.load_env()before importingllm— The LLM client reads API keys at import time. Get the order wrong and you get silent
failures. - Set high
max_tokens— If you're using Gemini 2.5+, thinking tokens eat your budget. 2048 is not enough. pip install -e .— Editable install means source edits take effect immediately. Great for iterating.- Docker MCP Toolkit — If you have Docker Desktop, disable MCP Toolkit or use
--strict-mcp-configfor apply. - Workday jobs are the majority — Many "LinkedIn" and "Indeed" jobs link to Workday portals. Expect login walls.
- The Gemini free tier works — But you'll hit 429s frequently. The fallback chain handles it, just takes longer.
Summary
ApplyPilot is an impressive project that delivers on its core promise — I went from zero to 112 real job applications in about a week with
minimal manual intervention. The Tier 1→2 pipeline (discover → enrich → score → tailor → cover) is solid. The Tier 3 auto-apply works but is
bottlenecked by Workday login requirements.
The architecture is well-designed and extensible. I was able to add significant features (HN scraper, dashboard, company-aware prioritization,
two-tier models, Chrome extensions) without fighting the codebase. The three-tier separation of concerns is clean.
Thank you @Pickle-Pixel for open-sourcing this! Happy to contribute any of my fixes/features back upstream if there's interest. Just note it might take me a while since I performed a lot of changes, I was using this pipeline at the same time I was building and need to clean it up as well as sanitize it from my information.