Field report: 1,503 jobs discovered, 112 applied — real-world pipeline results, fixes, and improvement ideas

  ## Field Report: Running ApplyPilot End-to-End

  I've been running ApplyPilot as my primary job search tool for about about two days (Senior/Staff Backend/Platform Engineer, Seattle/Remote). This is a detailed field report covering setup experience, pipeline results, bugs I encountered and fixed, and suggestions for improvement.

**Note:** this report I generated, and contains references to features I build which are not integrated into ApplyPilot, I will need time to clean my fork before publishing since it has substantial changes, they are detailed here in this report as well. I'm also building the next pipeline which is a job tracker, markdown + cli based that auto enriches with information. 

And also note for whomever is reading this, if you want to interview or hire me, you can reach me out via LinkedIn: https://www.linkedin.com/in/elninja/


  **TL;DR:** The concept is excellent and the architecture is sound. I got 112 successful applications out of 1,503 discovered jobs (7.5%
  end-to-end conversion). The Tier 1→2 pipeline works great after some fixes. Tier 3 (auto-apply) is where most friction lives — Workday login
  walls block ~70% of apply attempts.

  ---

  ## Pipeline Results

  ### Funnel

  | Stage | Count | Drop-off |
  |-------|-------|----------|
  | Discovered | 1,503 | — |
  | Enriched | 1,420 | 5.5% lost (detail errors) |
  | Scored | 1,420 | 0% (all enriched jobs scored) |
  | Score 7+ | 543 | 62% filtered (working as intended) |
  | Tailored | 543 | 0% (100% of 7+ tailored) |
  | Cover Letter | 543 | 0% (100% covered) |
  | Applied | 112 | **79% blocked at apply stage** |

  ### Score Distribution

  | Score | Count |
  |-------|-------|
  | 10 | 98 |
  | 9 | 249 |
  | 8 | 106 |
  | 7 | 90 |
  | 6 | 148 |
  | 5 | 185 |
  | 4 | 157 |
  | 3 | 171 |
  | 2 | 70 |
  | 1 | 146 |

  Scoring skews high — 36% of jobs scored 7+. For my profile (10+ years, Go/Kotlin/Python/K8s/AWS), this seems reasonable but I wonder if the
  scoring prompt could be tighter.

  ### Sources

  | Source | Jobs |
  |--------|------|
  | LinkedIn | 601 |
  | Indeed | 235 |
  | Thomson Reuters (Workday) | 174 |
  | Netflix (Workday) | 103 |
  | Dice | 81 |
  | SimplyHired | 77 |
  | Moderna (Workday) | 46 |
  | Motorola (Workday) | 38 |
  | NVIDIA (Workday) | 34 |
  | Talent.com | 26 |
  | HN "Who is Hiring" | ~30 |
  | Others | ~58 |

  ### Apply Results

  112 successful applications to companies including Netflix (7), Airbnb, HubSpot, Mastercard, NVIDIA, Grafana Labs, Twilio, Rippling, and others.

  **Apply errors (589 total):**

  | Error | Count | % of Errors |
  |-------|-------|-------------|
  | `workday_login_required` | 470 | **80%** |
  | `not_eligible_location` | 40 | 7% |
  | `expired` | 26 | 4% |
  | `email_verification` | 10 | 2% |
  | `stuck` / `captcha` / `account_required` | 18 | 3% |
  | Other | 25 | 4% |

  The Workday login wall is the #1 blocker. 470 out of 589 apply errors (80%) are because Workday requires an authenticated session that the agent
  can't handle.

  ### Artifacts Generated

  - **2,019 tailored resume files** (txt + pdf pairs)
  - **1,083 cover letter files** (txt + pdf pairs)
  - **336 apply agent log files**

  ---

  ## Time Estimates Per Stage

  These are rough estimates from running the pipeline across multiple sessions:

  | Stage | Time | Notes |
  |-------|------|-------|
  | Discovery | ~5-10 min | JobSpy + Workday scraping in parallel. Workday is the bottleneck. |
  | Enrichment | ~15-20 min | 1,420 jobs, mostly fast. Some sites need Playwright fallback. |
  | Scoring | ~20-30 min | Gemini Flash handles this well. Rate limits add some wait time. |
  | Tailoring | ~45-60 min | Quality model (Gemini Pro), validation loop adds retries. Most time-intensive Tier 2 stage. |
  | Cover Letters | ~20-30 min | Similar to scoring in complexity. |
  | Auto-Apply | ~4-6 hours total | 2 workers, ~2-5 min per job. Chrome startup + form navigation is slow. |
  | **Total** | **~6-8 hours** | Spread across multiple sessions over ~1 week |

  The `--stream` flag for running `score + tailor + cover` concurrently is a huge time saver.

  ---

  ## Bugs I Encountered and Fixed

  ### 1. Gemini Thinking Token Budget (Related to #12)

  Gemini 2.5+ models use "thinking tokens" that consume the `max_tokens` budget. The default 2048 was far too low — a simple scoring response needs
   ~30 visible tokens but the model burns 1200+ on thinking. I had to increase to:
  - Scoring: 8,192
  - Tailoring validation: 4,096, generation: 16,384
  - Cover letters: 8,192

  ### 2. LLM Client Singleton / Stale Environment (Related to #9)

  `llm.py` reads API keys at module import time. If `config.load_env()` isn't called before importing `llm`, the client has no keys. I restructured
   the import order to ensure env loading happens first.

  ### 3. Model Fallback Chain Needed Updating

  The original model list included deprecated Gemini models. I rebuilt the cascade:
  - **Fast:** `gemini-2.5-flash → gemini-3-flash → gemini-2-flash → gemini-2-flash-lite → gpt-4.1-nano → gpt-4.1-mini → claude-haiku-4-5`
  - **Quality:** `gemini-3.1-pro-preview → gemini-2.5-pro → gemini-3-pro → gemini-2.5-flash → gpt-4.1-mini → gpt-4.1-nano → claude-sonnet-4-5 →
  claude-haiku-4-5`

  The 429 rate-limit handling (mark model exhausted for 5 min, fall to next) works great once the chain is populated.

  ### 4. Docker MCP Toolkit Interference

  If Docker Desktop with MCP Toolkit is installed, it exposes `mcp__MCP_DOCKER__browser_*` tools that shadow the local Playwright MCP server. These
   Docker-based tools can't access host files, breaking resume/cover letter uploads. Fix: pass `--strict-mcp-config` to the Claude subprocess in
  `launcher.py`.

  ### 5. URL Normalization

  Many Workday scraped URLs were relative (e.g., `/en/sites/CX/job/12345`). These broke enrichment. I added URL normalization at insert time using
  base URLs from `sites.yaml`.

  ### 6. Company Extraction

  Jobs from aggregators (Indeed, LinkedIn) had no `company` field, making it hard to spread applications across employers. I added company
  extraction from `application_url` domains (patterns for Workday, Greenhouse, Lever, iCIMS, Ashby).

  ### 7. ANTHROPIC_API_KEY Leaking to Subprocess

  When the apply launcher spawns `claude` subprocesses, if `ANTHROPIC_API_KEY` is in the environment, it overrides Max plan auth and bills to the
  API key instead. I added explicit env stripping in `launcher.py`.

  ### 8. Fabrication in Cover Letters

  One cover letter for SeatGeek fabricated a company name from my resume ("Underground Elephant" was a real company I worked at, but the LLM used
  it in the wrong context). The `resume_facts` system helps but isn't bulletproof.

  ### 9. Banned Word False Positives (Related to #10)

  The fabrication watchlist used substring matching — "rust" matched "TrustSec", "dedicated" matched a legitimate resume phrase. I changed banned
  words to warnings rather than hard errors, letting the LLM judge handle tone.

  ### 10. Chrome Extension Path Resolution

  The apply agent loads uBlock Origin and 1Password from the user's Chrome profile. Extension paths include version directories that change on
  updates. I added dynamic resolution that picks the latest version directory and silently skips uninstalled extensions.

  ---

  ## What We Built On Top

  Beyond bug fixes, here are features I added to my fork:

  1. **Hacker News "Who is Hiring" scraper** — Parses monthly threads, deobfuscates emails, creates synthetic URLs for contact-only posts
  2. **HTML Dashboard** (`applypilot dashboard`) — Rich dashboard with Active/Archive/Applied tabs, fit score badges, company grouping, one-click
  links to applications
  3. **Company-aware apply prioritization** — `ROW_NUMBER() PARTITION BY company` in the job acquisition query spreads applications across
  employers instead of applying to 10 Netflix jobs in a row
  4. **Two-tier model strategy** — Flash models for speed-critical tasks, Pro models for quality writing
  5. **Streaming pipeline mode** — `applypilot run score tailor cover --stream` runs stages concurrently
  6. **Chrome extension loading** — uBlock (faster page loads) + 1Password (credential auto-fill) loaded dynamically
  7. **Workday employer registry** — `employers.yaml` with 48 Workday employer portals for direct scraping
  8. **Smart Extract fallback** — AI-powered extraction when JSON-LD and CSS selectors fail
  9. **Apply agent verification** — Post-submission confidence scoring to verify applications actually went through

  ---

  ## Suggestions for Improvement

  ### High Priority

  1. **Workday auth strategy** — 80% of apply failures are `workday_login_required`. Options:
     - Pre-authenticated browser session (load cookies/profile)
     - Account creation flow before application
     - Mark Workday jobs as "manual apply" and generate a manual actions list

  2. **Model config should be externalized** — Hardcoded model lists break when Google deprecates models. A `models.yaml` config would let users
  update without code changes.

  3. **max_tokens should scale with task** — Default 2048 is too low for thinking models. The project should detect thinking model capabilities and
   auto-adjust, or at minimum document this prominently.

  4. **Apply error categorization** — Currently errors are free-text strings. A structured error taxonomy would enable better retry logic
  (permanent vs transient errors) and reporting.

  ### Medium Priority

  5. **Resume validation strictness should default to `normal` or `lenient`** — The strict mode causes excessive retries (related to #4, #14). Most
   "failures" are false positives from substring matching.

  6. **Duplicate PDF filename collisions** (related to #11, #17) — Jobs with identical titles overwrite each other's PDFs. Hash the URL into the
  filename.

  7. **Company field should be first-class** — Add company extraction at discovery/enrichment time, not just from `application_url`. This enables
  better deduplication and employer diversity.

  8. **Dashboard should be a long-running web server** — The current `applypilot dashboard` generates a static HTML file. A live dashboard with
  auto-refresh would be much more useful during active pipeline runs.

  ### Nice to Have

  9. **Job deduplication across sources** — Same job appears on LinkedIn, Indeed, and the company's Workday portal. Fuzzy matching on title +
  company could reduce noise.

  10. **Apply success verification** — After submitting, check for confirmation emails or "application received" pages to verify success beyond the
   agent's self-reported confidence.

  11. **Metrics/analytics** — Track conversion rates over time, cost per application, which sources yield the best fit scores, etc.

  12. **Config validation (`applypilot doctor`)** — v0.3.0 added this, which is great. Expanding it to validate API key quotas, model availability,
   and browser setup would help a lot with onboarding.

  ---

  ## Setup Notes for Other Users

  Things I wish I knew before starting:

  1. **Call `config.load_env()` before importing `llm`** — The LLM client reads API keys at import time. Get the order wrong and you get silent
  failures.
  2. **Set high `max_tokens`** — If you're using Gemini 2.5+, thinking tokens eat your budget. 2048 is not enough.
  3. **`pip install -e .`** — Editable install means source edits take effect immediately. Great for iterating.
  4. **Docker MCP Toolkit** — If you have Docker Desktop, disable MCP Toolkit or use `--strict-mcp-config` for apply.
  5. **Workday jobs are the majority** — Many "LinkedIn" and "Indeed" jobs link to Workday portals. Expect login walls.
  6. **The Gemini free tier works** — But you'll hit 429s frequently. The fallback chain handles it, just takes longer.

  ---

  ## Summary

  ApplyPilot is an impressive project that delivers on its core promise — I went from zero to 112 real job applications in about a week with
  minimal manual intervention. The Tier 1→2 pipeline (discover → enrich → score → tailor → cover) is solid. The Tier 3 auto-apply works but is
  bottlenecked by Workday login requirements.

  The architecture is well-designed and extensible. I was able to add significant features (HN scraper, dashboard, company-aware prioritization,
  two-tier models, Chrome extensions) without fighting the codebase. The three-tier separation of concerns is clean.

  Thank you @Pickle-Pixel for open-sourcing this! Happy to contribute any of my fixes/features back upstream if there's interest. Just note it might take me a while since I performed a lot of changes, I was using this pipeline at the same time I was building and need to clean it up as well as sanitize it from my information.

  ---


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Field report: 1,503 jobs discovered, 112 applied — real-world pipeline results, fixes, and improvement ideas #22

Field Report: Running ApplyPilot End-to-End

Pipeline Results

Funnel

Score Distribution

Sources

Apply Results

Artifacts Generated

Time Estimates Per Stage

Bugs I Encountered and Fixed

1. Gemini Thinking Token Budget (Related to #12)

2. LLM Client Singleton / Stale Environment (Related to #9)

3. Model Fallback Chain Needed Updating

4. Docker MCP Toolkit Interference

5. URL Normalization

6. Company Extraction

7. ANTHROPIC_API_KEY Leaking to Subprocess

8. Fabrication in Cover Letters

9. Banned Word False Positives (Related to #10)

10. Chrome Extension Path Resolution

What We Built On Top

Suggestions for Improvement

High Priority

Medium Priority

Nice to Have

Setup Notes for Other Users

Summary

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Stage	Count	Drop-off
Discovered	1,503	—
Enriched	1,420	5.5% lost (detail errors)
Scored	1,420	0% (all enriched jobs scored)
Score 7+	543	62% filtered (working as intended)
Tailored	543	0% (100% of 7+ tailored)
Cover Letter	543	0% (100% covered)
Applied	112	79% blocked at apply stage

Source	Jobs
LinkedIn	601
Indeed	235
Thomson Reuters (Workday)	174
Netflix (Workday)	103
Dice	81
SimplyHired	77
Moderna (Workday)	46
Motorola (Workday)	38
NVIDIA (Workday)	34
Talent.com	26
HN "Who is Hiring"	~30
Others	~58

Error	Count	% of Errors
`workday_login_required`	470	80%
`not_eligible_location`	40	7%
`expired`	26	4%
`email_verification`	10	2%
`stuck` / `captcha` / `account_required`	18	3%
Other	25	4%

Stage	Time	Notes
Discovery	~5-10 min	JobSpy + Workday scraping in parallel. Workday is the bottleneck.
Enrichment	~15-20 min	1,420 jobs, mostly fast. Some sites need Playwright fallback.
Scoring	~20-30 min	Gemini Flash handles this well. Rate limits add some wait time.
Tailoring	~45-60 min	Quality model (Gemini Pro), validation loop adds retries. Most time-intensive Tier 2 stage.
Cover Letters	~20-30 min	Similar to scoring in complexity.
Auto-Apply	~4-6 hours total	2 workers, ~2-5 min per job. Chrome startup + form navigation is slow.
Total	~6-8 hours	Spread across multiple sessions over ~1 week

Field report: 1,503 jobs discovered, 112 applied — real-world pipeline results, fixes, and improvement ideas #22

Description

Field Report: Running ApplyPilot End-to-End

Pipeline Results

Funnel

Score Distribution

Sources

Apply Results

Artifacts Generated

Time Estimates Per Stage

Bugs I Encountered and Fixed

1. Gemini Thinking Token Budget (Related to #12)

2. LLM Client Singleton / Stale Environment (Related to #9)

3. Model Fallback Chain Needed Updating

4. Docker MCP Toolkit Interference

5. URL Normalization

6. Company Extraction

7. ANTHROPIC_API_KEY Leaking to Subprocess

8. Fabrication in Cover Letters

9. Banned Word False Positives (Related to #10)

10. Chrome Extension Path Resolution

What We Built On Top

Suggestions for Improvement

High Priority

Medium Priority

Nice to Have

Setup Notes for Other Users

Summary

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions