ML Engineer · Founder
I build agentic AI systems that actually work in the real world —
reliable, observable, and designed to survive production, not just demos.
Incoming MLE at Robinhood (Agentic AI) · M.S. CS (AI/ML) at Duke · B.Comp. CS (Hons, Distinction) at NUS
The agent itself is the easy part. The harness that makes it work overnight, unsupervised, and recoverable when it fails is what actually matters. I wrote about this in Beyond the Harness: An Operating System for AI Agents — mapping OS primitives onto multi-agent infrastructure: git worktree as memory, branches as execution contexts, CLI-first tool discovery, and a transparency dashboard that turns every agent decision into an auditable commit.
VYNN AI — Agentic financial analyst platform · sole engineer · ~500 users · 50K+ LOC
LangGraph supervisor orchestrates 5 specialized agents for end-to-end equity research: data scraping → DCF modeling (6 sector strategies) → news intelligence → report generation — all in under 7 minutes. The hard part wasn't the LLM calls; it was making the numbers trustworthy. Strict semantic-symbolic separation: the LLM never touches a number. All financial computation is deterministic Python; the LLM writes prose around immutable FixedNumbers. The recommendation engine uses a 3-layer trust architecture: deterministic math (RecommendationCalculator) → LLM narrative with citation IDs → regex-based validator that blocks publication if coverage drops below 95%. Built a custom 1,293-line Excel formula evaluator so the DCF workbook and downstream JSON stay perfectly consistent without requiring Excel. Nightly CI runs a golden-dataset regression suite across 100 QQQ companies and blocks deployment if valuations drift beyond threshold.
Full technical retrospective: Building VYNN AI: 50,000 Lines of Code, One Engineer, and Everything I Learned
→ stock-analyst (agent backend) · vynnai-web (platform frontend) · api-runner (API layer)
AutoCodeRover — Autonomous code repair agent · core technology acquired by Sonar
Designed the Self-Fix Agent: when a patch fails, an LLM-as-a-Judge diagnoses which pipeline stage caused the failure, generates corrective feedback, and replays from that stage — preserving upstream state via UUID-targeted responses. Built a stateful replay mechanism so developers can inject feedback on any intermediate reasoning step and trigger selective re-execution downstream. Result: 51.6% on SWE-bench Verified (up from 38.4%), 1.8× patch precision over next-best open-source agent. Sonar's Foundation Agent, built on the AutoCodeRover core, has since reached 79.2% — #1 on the SWE-bench leaderboard (Feb 2026).
I also built the JetBrains IDE plugin end-to-end in Kotlin to bring ACR into the developer workflow: GumTree 3-way AST merge that reconciles your live edits with the agent's patch at the tree-node level (Git diffs lines; GumTree diffs AST nodes, so a renamed variable and an added null-check merge cleanly), PSI-based context enrichment that extracts symbol references, cursor history, and open files to narrow the agent's search scope before it starts, and embedded SonarLint Core 10.3.0 for one-click static analysis → autonomous fix.
Full design process: From Research Agent to Acquired Product
→ auto-code-rover (agent backend) · jetbrains-ide-plugin (Kotlin, end-to-end)
AutoEvolve — Autonomous research agent for OpenAI's Parameter Golf
Instead of hand-tuning training scripts for the 16MB model challenge, I built an autonomous research loop — inspired by Karpathy's autoresearch — that treats GPU hours as a scarce resource and maintains structured scientific memory across experiments. The agent proposes targeted mutations to a single training script, runs bounded experiments on H100, classifies outcomes, and iterates.
The design decisions that matter: an incumbent/frontier search model where only measured improvements replace the current best, but near-misses can open short-lived exploratory branches with bounded follow-on budget. A structured memory dossier generated deterministically from git-committed experiment state — organized by proposal family, outcome classification, and timing telemetry — instead of dumping raw logs back into context. A repeat-family guard that blocks the agent from re-proposing failed approaches unless it provides a concrete mechanism for why this time is different. Two-stage research strategy: discover plausible ideas under a longer 1×H100 proxy budget, promote only the strongest to real 8×H100 final validation. Designed for unattended overnight runs with automatic rollback and per-experiment git commits.
LUMINA — Multi-agent citation screening for medical systematic reviews · first author
Four-agent pipeline that mirrors the human peer-review process: classifier triage → PICOS-guided Chain-of-Thought screening → LLM-as-a-Judge reviewer → self-correction agent that re-screens when the reviewer and screener disagree. Evaluated across 15 SRMAs (~150K citations from BMJ, JAMA, Lancet). 98.2% sensitivity (10 of 15 at perfect 100%) with 35× fewer missed studies vs. prior baselines, at <$0.01/article.
-
The harness is the bottleneck, not the model. VYNN's regression suite, ACR's eval loop, and AutoEvolve's memory dossier taught me the same lesson: catching agent failures and learning from them matters more than improving the agent itself. I wrote about what a real agent infrastructure layer should look like — borrowing from OS design — in Beyond the Harness.
-
Context engineering is the real leverage. Most agent failures I've debugged trace back to what the agent didn't know, not what it reasoned poorly about. PSI-based enrichment in the ACR plugin, blackboard-pattern state sharing in VYNN, AutoEvolve's memory dossier over raw transcripts — all different bets on the same problem: right information, right time, minimum noise.
-
"Usually right" isn't good enough. VYNN's 3-layer recommendation validator exists because LLMs fabricate financial numbers. ACR's GumTree merge exists because
git applyfails when code has diverged. AutoEvolve's two-stage proxy→validation exists because GPU hours are finite. I'm drawn to the engineering that makes agents trustworthy enough to run unsupervised.
| Beyond the Harness: An Operating System for AI Agents | Git worktree as agent memory, CLI-first tool discovery, and why everyone stops at the OS metaphor. |
| From Research Agent to Acquired Product | AST-level patch merging, interactive feedback loops, and the gap between benchmarks and developer UX. |
| Building VYNN AI | Semantic-symbolic separation, architectural mistakes, and what 500 real users teach you about agent reliability. |
Open to full-time SWE & ML engineering roles for 2027. Last updated: Mar 2026


