Skip to content

Field report: 1,503 jobs discovered, 112 applied — real-world pipeline results, fixes, and improvement ideas #22

@ibarrajo

Description

@ibarrajo

Field Report: Running ApplyPilot End-to-End

I've been running ApplyPilot as my primary job search tool for about about two days (Senior/Staff Backend/Platform Engineer, Seattle/Remote). This is a detailed field report covering setup experience, pipeline results, bugs I encountered and fixed, and suggestions for improvement.

Note: this report I generated, and contains references to features I build which are not integrated into ApplyPilot, I will need time to clean my fork before publishing since it has substantial changes, they are detailed here in this report as well. I'm also building the next pipeline which is a job tracker, markdown + cli based that auto enriches with information.

And also note for whomever is reading this, if you want to interview or hire me, you can reach me out via LinkedIn: https://www.linkedin.com/in/elninja/

TL;DR: The concept is excellent and the architecture is sound. I got 112 successful applications out of 1,503 discovered jobs (7.5%
end-to-end conversion). The Tier 1→2 pipeline works great after some fixes. Tier 3 (auto-apply) is where most friction lives — Workday login
walls block ~70% of apply attempts.


Pipeline Results

Funnel

Stage Count Drop-off
Discovered 1,503
Enriched 1,420 5.5% lost (detail errors)
Scored 1,420 0% (all enriched jobs scored)
Score 7+ 543 62% filtered (working as intended)
Tailored 543 0% (100% of 7+ tailored)
Cover Letter 543 0% (100% covered)
Applied 112 79% blocked at apply stage

Score Distribution

Score Count
10 98
9 249
8 106
7 90
6 148
5 185
4 157
3 171
2 70
1 146

Scoring skews high — 36% of jobs scored 7+. For my profile (10+ years, Go/Kotlin/Python/K8s/AWS), this seems reasonable but I wonder if the
scoring prompt could be tighter.

Sources

Source Jobs
LinkedIn 601
Indeed 235
Thomson Reuters (Workday) 174
Netflix (Workday) 103
Dice 81
SimplyHired 77
Moderna (Workday) 46
Motorola (Workday) 38
NVIDIA (Workday) 34
Talent.com 26
HN "Who is Hiring" ~30
Others ~58

Apply Results

112 successful applications to companies including Netflix (7), Airbnb, HubSpot, Mastercard, NVIDIA, Grafana Labs, Twilio, Rippling, and others.

Apply errors (589 total):

Error Count % of Errors
workday_login_required 470 80%
not_eligible_location 40 7%
expired 26 4%
email_verification 10 2%
stuck / captcha / account_required 18 3%
Other 25 4%

The Workday login wall is the #1 blocker. 470 out of 589 apply errors (80%) are because Workday requires an authenticated session that the agent
can't handle.

Artifacts Generated

  • 2,019 tailored resume files (txt + pdf pairs)
  • 1,083 cover letter files (txt + pdf pairs)
  • 336 apply agent log files

Time Estimates Per Stage

These are rough estimates from running the pipeline across multiple sessions:

Stage Time Notes
Discovery ~5-10 min JobSpy + Workday scraping in parallel. Workday is the bottleneck.
Enrichment ~15-20 min 1,420 jobs, mostly fast. Some sites need Playwright fallback.
Scoring ~20-30 min Gemini Flash handles this well. Rate limits add some wait time.
Tailoring ~45-60 min Quality model (Gemini Pro), validation loop adds retries. Most time-intensive Tier 2 stage.
Cover Letters ~20-30 min Similar to scoring in complexity.
Auto-Apply ~4-6 hours total 2 workers, ~2-5 min per job. Chrome startup + form navigation is slow.
Total ~6-8 hours Spread across multiple sessions over ~1 week

The --stream flag for running score + tailor + cover concurrently is a huge time saver.


Bugs I Encountered and Fixed

1. Gemini Thinking Token Budget (Related to #12)

Gemini 2.5+ models use "thinking tokens" that consume the max_tokens budget. The default 2048 was far too low — a simple scoring response needs
~30 visible tokens but the model burns 1200+ on thinking. I had to increase to:

  • Scoring: 8,192
  • Tailoring validation: 4,096, generation: 16,384
  • Cover letters: 8,192

2. LLM Client Singleton / Stale Environment (Related to #9)

llm.py reads API keys at module import time. If config.load_env() isn't called before importing llm, the client has no keys. I restructured
the import order to ensure env loading happens first.

3. Model Fallback Chain Needed Updating

The original model list included deprecated Gemini models. I rebuilt the cascade:

  • Fast: gemini-2.5-flash → gemini-3-flash → gemini-2-flash → gemini-2-flash-lite → gpt-4.1-nano → gpt-4.1-mini → claude-haiku-4-5
  • Quality: gemini-3.1-pro-preview → gemini-2.5-pro → gemini-3-pro → gemini-2.5-flash → gpt-4.1-mini → gpt-4.1-nano → claude-sonnet-4-5 → claude-haiku-4-5

The 429 rate-limit handling (mark model exhausted for 5 min, fall to next) works great once the chain is populated.

4. Docker MCP Toolkit Interference

If Docker Desktop with MCP Toolkit is installed, it exposes mcp__MCP_DOCKER__browser_* tools that shadow the local Playwright MCP server. These
Docker-based tools can't access host files, breaking resume/cover letter uploads. Fix: pass --strict-mcp-config to the Claude subprocess in
launcher.py.

5. URL Normalization

Many Workday scraped URLs were relative (e.g., /en/sites/CX/job/12345). These broke enrichment. I added URL normalization at insert time using
base URLs from sites.yaml.

6. Company Extraction

Jobs from aggregators (Indeed, LinkedIn) had no company field, making it hard to spread applications across employers. I added company
extraction from application_url domains (patterns for Workday, Greenhouse, Lever, iCIMS, Ashby).

7. ANTHROPIC_API_KEY Leaking to Subprocess

When the apply launcher spawns claude subprocesses, if ANTHROPIC_API_KEY is in the environment, it overrides Max plan auth and bills to the
API key instead. I added explicit env stripping in launcher.py.

8. Fabrication in Cover Letters

One cover letter for SeatGeek fabricated a company name from my resume ("Underground Elephant" was a real company I worked at, but the LLM used
it in the wrong context). The resume_facts system helps but isn't bulletproof.

9. Banned Word False Positives (Related to #10)

The fabrication watchlist used substring matching — "rust" matched "TrustSec", "dedicated" matched a legitimate resume phrase. I changed banned
words to warnings rather than hard errors, letting the LLM judge handle tone.

10. Chrome Extension Path Resolution

The apply agent loads uBlock Origin and 1Password from the user's Chrome profile. Extension paths include version directories that change on
updates. I added dynamic resolution that picks the latest version directory and silently skips uninstalled extensions.


What We Built On Top

Beyond bug fixes, here are features I added to my fork:

  1. Hacker News "Who is Hiring" scraper — Parses monthly threads, deobfuscates emails, creates synthetic URLs for contact-only posts
  2. HTML Dashboard (applypilot dashboard) — Rich dashboard with Active/Archive/Applied tabs, fit score badges, company grouping, one-click
    links to applications
  3. Company-aware apply prioritizationROW_NUMBER() PARTITION BY company in the job acquisition query spreads applications across
    employers instead of applying to 10 Netflix jobs in a row
  4. Two-tier model strategy — Flash models for speed-critical tasks, Pro models for quality writing
  5. Streaming pipeline modeapplypilot run score tailor cover --stream runs stages concurrently
  6. Chrome extension loading — uBlock (faster page loads) + 1Password (credential auto-fill) loaded dynamically
  7. Workday employer registryemployers.yaml with 48 Workday employer portals for direct scraping
  8. Smart Extract fallback — AI-powered extraction when JSON-LD and CSS selectors fail
  9. Apply agent verification — Post-submission confidence scoring to verify applications actually went through

Suggestions for Improvement

High Priority

  1. Workday auth strategy — 80% of apply failures are workday_login_required. Options:

    • Pre-authenticated browser session (load cookies/profile)
    • Account creation flow before application
    • Mark Workday jobs as "manual apply" and generate a manual actions list
  2. Model config should be externalized — Hardcoded model lists break when Google deprecates models. A models.yaml config would let users
    update without code changes.

  3. max_tokens should scale with task — Default 2048 is too low for thinking models. The project should detect thinking model capabilities and
    auto-adjust, or at minimum document this prominently.

  4. Apply error categorization — Currently errors are free-text strings. A structured error taxonomy would enable better retry logic
    (permanent vs transient errors) and reporting.

Medium Priority

  1. Resume validation strictness should default to normal or lenient — The strict mode causes excessive retries (related to [EXHAUSTED_RETRIES] attempts=4 for tailored resume #4, Must fit 1 page" hard rule conflicts with preserved_companies requirement causing EXHAUSTED_RETRIES #14). Most
    "failures" are false positives from substring matching.

  2. Duplicate PDF filename collisions (related to Duplicate job titles cause tailored resume and cover letter PDFs to overwrite each other #11, Duplicate job titles cause tailored resume and cover letter PDFs to overwrite each other; batch size limit compounds the problem #17) — Jobs with identical titles overwrite each other's PDFs. Hash the URL into the
    filename.

  3. Company field should be first-class — Add company extraction at discovery/enrichment time, not just from application_url. This enables
    better deduplication and employer diversity.

  4. Dashboard should be a long-running web server — The current applypilot dashboard generates a static HTML file. A live dashboard with
    auto-refresh would be much more useful during active pipeline runs.

Nice to Have

  1. Job deduplication across sources — Same job appears on LinkedIn, Indeed, and the company's Workday portal. Fuzzy matching on title +
    company could reduce noise.

  2. Apply success verification — After submitting, check for confirmation emails or "application received" pages to verify success beyond the
    agent's self-reported confidence.

  3. Metrics/analytics — Track conversion rates over time, cost per application, which sources yield the best fit scores, etc.

  4. Config validation (applypilot doctor) — v0.3.0 added this, which is great. Expanding it to validate API key quotas, model availability,
    and browser setup would help a lot with onboarding.


Setup Notes for Other Users

Things I wish I knew before starting:

  1. Call config.load_env() before importing llm — The LLM client reads API keys at import time. Get the order wrong and you get silent
    failures.
  2. Set high max_tokens — If you're using Gemini 2.5+, thinking tokens eat your budget. 2048 is not enough.
  3. pip install -e . — Editable install means source edits take effect immediately. Great for iterating.
  4. Docker MCP Toolkit — If you have Docker Desktop, disable MCP Toolkit or use --strict-mcp-config for apply.
  5. Workday jobs are the majority — Many "LinkedIn" and "Indeed" jobs link to Workday portals. Expect login walls.
  6. The Gemini free tier works — But you'll hit 429s frequently. The fallback chain handles it, just takes longer.

Summary

ApplyPilot is an impressive project that delivers on its core promise — I went from zero to 112 real job applications in about a week with
minimal manual intervention. The Tier 1→2 pipeline (discover → enrich → score → tailor → cover) is solid. The Tier 3 auto-apply works but is
bottlenecked by Workday login requirements.

The architecture is well-designed and extensible. I was able to add significant features (HN scraper, dashboard, company-aware prioritization,
two-tier models, Chrome extensions) without fighting the codebase. The three-tier separation of concerns is clean.

Thank you @Pickle-Pixel for open-sourcing this! Happy to contribute any of my fixes/features back upstream if there's interest. Just note it might take me a while since I performed a lot of changes, I was using this pipeline at the same time I was building and need to clean it up as well as sanitize it from my information.


Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions