Skip to content
View zanwenfu's full-sized avatar
🎯
Focusing
🎯
Focusing

Highlights

  • Pro

Block or report zanwenfu

Block user

Prevent this user from interacting with your repositories and sending you notifications. Learn more about blocking users.

You must be logged in to block users.

Maximum 250 characters. Please don't include any personal information such as legal names or email addresses. Markdown supported. This note will be visible to only you.
Report abuse

Contact GitHub support about this user’s behavior. Learn more about reporting abuse.

Report abuse
zanwenfu/README.md

Zanwen (Ryan) Fu

ML Engineer · Founder
I build agentic AI systems that actually work in the real world —
reliable, observable, and designed to survive production, not just demos.

Website LinkedIn Email


Incoming MLE at Robinhood (Agentic AI) · M.S. CS (AI/ML) at Duke · B.Comp. CS (Hons, Distinction) at NUS

The agent itself is the easy part. The harness that makes it work overnight, unsupervised, and recoverable when it fails is what actually matters. I wrote about this in Beyond the Harness: An Operating System for AI Agents — mapping OS primitives onto multi-agent infrastructure: git worktree as memory, branches as execution contexts, CLI-first tool discovery, and a transparency dashboard that turns every agent decision into an auditable commit.


What I've shipped

VYNN AI — Agentic financial analyst platform  ·  sole engineer  ·  ~500 users  ·  50K+ LOC

LangGraph supervisor orchestrates 5 specialized agents for end-to-end equity research: data scraping → DCF modeling (6 sector strategies) → news intelligence → report generation — all in under 7 minutes. The hard part wasn't the LLM calls; it was making the numbers trustworthy. Strict semantic-symbolic separation: the LLM never touches a number. All financial computation is deterministic Python; the LLM writes prose around immutable FixedNumbers. The recommendation engine uses a 3-layer trust architecture: deterministic math (RecommendationCalculator) → LLM narrative with citation IDs → regex-based validator that blocks publication if coverage drops below 95%. Built a custom 1,293-line Excel formula evaluator so the DCF workbook and downstream JSON stay perfectly consistent without requiring Excel. Nightly CI runs a golden-dataset regression suite across 100 QQQ companies and blocks deployment if valuations drift beyond threshold.

Full technical retrospective: Building VYNN AI: 50,000 Lines of Code, One Engineer, and Everything I Learned

stock-analyst (agent backend) · vynnai-web (platform frontend) · api-runner (API layer)


AutoCodeRover — Autonomous code repair agent  ·  core technology acquired by Sonar

Designed the Self-Fix Agent: when a patch fails, an LLM-as-a-Judge diagnoses which pipeline stage caused the failure, generates corrective feedback, and replays from that stage — preserving upstream state via UUID-targeted responses. Built a stateful replay mechanism so developers can inject feedback on any intermediate reasoning step and trigger selective re-execution downstream. Result: 51.6% on SWE-bench Verified (up from 38.4%), 1.8× patch precision over next-best open-source agent. Sonar's Foundation Agent, built on the AutoCodeRover core, has since reached 79.2%#1 on the SWE-bench leaderboard (Feb 2026).

I also built the JetBrains IDE plugin end-to-end in Kotlin to bring ACR into the developer workflow: GumTree 3-way AST merge that reconciles your live edits with the agent's patch at the tree-node level (Git diffs lines; GumTree diffs AST nodes, so a renamed variable and an added null-check merge cleanly), PSI-based context enrichment that extracts symbol references, cursor history, and open files to narrow the agent's search scope before it starts, and embedded SonarLint Core 10.3.0 for one-click static analysis → autonomous fix.

Full design process: From Research Agent to Acquired Product

auto-code-rover (agent backend) · jetbrains-ide-plugin (Kotlin, end-to-end)


AutoEvolve — Autonomous research agent for OpenAI's Parameter Golf

Instead of hand-tuning training scripts for the 16MB model challenge, I built an autonomous research loop — inspired by Karpathy's autoresearch — that treats GPU hours as a scarce resource and maintains structured scientific memory across experiments. The agent proposes targeted mutations to a single training script, runs bounded experiments on H100, classifies outcomes, and iterates.

The design decisions that matter: an incumbent/frontier search model where only measured improvements replace the current best, but near-misses can open short-lived exploratory branches with bounded follow-on budget. A structured memory dossier generated deterministically from git-committed experiment state — organized by proposal family, outcome classification, and timing telemetry — instead of dumping raw logs back into context. A repeat-family guard that blocks the agent from re-proposing failed approaches unless it provides a concrete mechanism for why this time is different. Two-stage research strategy: discover plausible ideas under a longer 1×H100 proxy budget, promote only the strongest to real 8×H100 final validation. Designed for unattended overnight runs with automatic rollback and per-experiment git commits.

autoevolve


LUMINA — Multi-agent citation screening for medical systematic reviews  ·  first author

Four-agent pipeline that mirrors the human peer-review process: classifier triage → PICOS-guided Chain-of-Thought screening → LLM-as-a-Judge reviewer → self-correction agent that re-screens when the reviewer and screener disagree. Evaluated across 15 SRMAs (~150K citations from BMJ, JAMA, Lancet). 98.2% sensitivity (10 of 15 at perfect 100%) with 35× fewer missed studies vs. prior baselines, at <$0.01/article.


What I think about

  • The harness is the bottleneck, not the model. VYNN's regression suite, ACR's eval loop, and AutoEvolve's memory dossier taught me the same lesson: catching agent failures and learning from them matters more than improving the agent itself. I wrote about what a real agent infrastructure layer should look like — borrowing from OS design — in Beyond the Harness.

  • Context engineering is the real leverage. Most agent failures I've debugged trace back to what the agent didn't know, not what it reasoned poorly about. PSI-based enrichment in the ACR plugin, blackboard-pattern state sharing in VYNN, AutoEvolve's memory dossier over raw transcripts — all different bets on the same problem: right information, right time, minimum noise.

  • "Usually right" isn't good enough. VYNN's 3-layer recommendation validator exists because LLMs fabricate financial numbers. ACR's GumTree merge exists because git apply fails when code has diverged. AutoEvolve's two-stage proxy→validation exists because GPU hours are finite. I'm drawn to the engineering that makes agents trustworthy enough to run unsupervised.


Writing

Beyond the Harness: An Operating System for AI Agents Git worktree as agent memory, CLI-first tool discovery, and why everyone stops at the OS metaphor.
From Research Agent to Acquired Product AST-level patch merging, interactive feedback loops, and the gap between benchmarks and developer UX.
Building VYNN AI Semantic-symbolic separation, architectural mistakes, and what 500 real users teach you about agent reliability.

Open to full-time SWE & ML engineering roles for 2027. Last updated: Mar 2026

Pinned Loading

  1. Agentic-Analyst/stock-analyst Agentic-Analyst/stock-analyst Public

    VYNN AI Agent Backend is a standalone agent execution system for financial analysis. It orchestrates LLM-based agents to scrape historical financial data, build valuation models, analyze real-time …

    Python 2

  2. Agentic-Analyst/vynnai-web Agentic-Analyst/vynnai-web Public

    VYNN AI frontend — React 18 + TypeScript dashboard with SSE-streaming AI chat, real-time WebSocket market data, portfolio management, and automated daily report generation. Part of the VYNN AI plat…

    TypeScript

  3. Agentic-Analyst/api-runner Agentic-Analyst/api-runner Public

    VYNN AI API layer — FastAPI async orchestration service that spawns ephemeral Docker containers for agent execution, manages SSE/WebSocket streaming, OAuth + passwordless auth, and daily report sch…

    Python

  4. auto-code-rover auto-code-rover Public

    A project structure aware autonomous software engineer aiming for autonomous program improvement. Resolved 37.3% tasks (pass@1) in SWE-bench lite and 51.6% tasks (pass@3) in SWE-bench verified with…

    Python

  5. jetbrains-agentic-plugin jetbrains-agentic-plugin Public

    JetBrains IDE plugin bringing autonomous code repair into the developer workflow — real-time agent streaming, interactive feedback loops, and three-way AST patch merging.

  6. Agentic-AI-for-Systematic-Reviews Agentic-AI-for-Systematic-Reviews Public

    The LUMINA agent is a LLM-based intelligent screener designed for automating the large-scale citation screening phase in medical systematic review and meta-analysis (SRMA).

    Python 1 1