What Happened After My GTM Agent Learned to Improve Itself
A follow-up to How I Built a Custom GTM Agent.
A few months ago I wrote about Conrad, the AI agent that runs go-to-market for my startup, VentureScope. The short version: he is a Cloudflare Worker powered by Claude who discovers investors, sends outreach, reads the replies, drafts responses for my approval, turns user feedback into GitHub PRs, and runs the whole thing on a cron schedule. I called him my AI head of business development.
That post ended with three words: the loop is closed.
This is what happened after. The loop started improving itself.
The first version of Conrad did the work. This version gets better at the work. It remembers what happened, grades itself across five dimensions, runs its own A/B tests, and once a week it will rewrite its own instructions if the numbers say it should. None of that existed when I hit publish in March. Here is what I built, what broke, and the parts that still feel a little strange to me.
If you did not read the first post, the architecture in one breath: a hub-and-spoke team of agents on Cloudflare D1 (SQLite), KV, R2, Vectorize, and Durable Objects, with a human approval gate on anything that leaves the building.
First, the boring but important update: the models changed
When I wrote the original post the team looked like this: Strategist on Opus 4.6, Executor and Ops on Sonnet 4.6, Coder on an OpenAI model. The current lineup:
- Strategist (hub) is now Claude Opus 4.8. It is the CEO. Every task routes through it.
- Executor and Ops (spokes) stay on Claude Sonnet 4.6. Drafting, research, CRM, pipeline.
- Coder (spoke) is GPT-5.5 Codex. It writes code to the VentureScope repo.
- A handful of cheap, high-frequency jobs (reply triage, memory reranking, a security classifier) run on Haiku 4.5.
The interesting part is not the version bump. Opus 4.8 is both more capable and cheaper per token than the 4.6 it replaced, which quietly made the most expensive parts of the system (the Opus-level review and coordination jobs) cost less while doing more. More on cost at the end.
The reason the lineup matters for the rest of this post is that the whole self-improvement story rests on one idea: use the expensive, smart model only at the moments that actually need it, and use cheap models everywhere else. That idea finally got a real implementation.
The advisor tool: Opus on tap without paying Opus prices
The single most useful thing I built this round is the advisor pattern.
Here's the problem it solves. Sonnet is great at drafting an email or running a pipeline update. But a handful of moments in any task are genuinely high-stakes: should we qualify this deal, is this the right pricing call, is this draft actually good enough to send. For those moments you want your best model. For the other ninety percent of the work, paying for it is a waste.
So now every worker agent can call an advisor() tool mid-task. When it does, the entire conversation so far, every tool call and every result, gets handed to Opus 4.8 for a second opinion. The advisor sees everything the agent has done and responds with short, specific direction. The agent takes that and keeps going.
This runs server-side through Anthropic's Managed Agents API, which was the other big architectural shift this round. Instead of hand-rolling the agent loop myself (build the prompt, serialize the tools, parse the tool calls, execute them, repeat), the API manages session state and tool routing. I register the three Anthropic agents once, then talk to them through stateful sessions. The Coder stays on the old hand-rolled path, since OpenAI does not offer the same session API, so Conrad runs a dual-path setup behind a feature flag. The hand-rolled path is still there, still tested, still the fallback.
The economics are the whole point. A typical Executor session might burn 50,000 tokens of Sonnet on research and drafting, and 2,000 tokens of Opus on the two judgment calls that actually mattered. The Strategist gets unlimited advisor access. The workers get a budget of two advisor calls per session, because if a worker needs Opus more than twice it should probably just hand the task back up the chain.
I tuned the advisor triggers per role, because "when should I ask for a second opinion" is a different question for an agent that writes emails than for one that mutates the CRM. The Executor calls the advisor before finalizing a draft, or when research turns up conflicting signals. Ops calls it before any bulk operation touching 50 or more records, or when discovery scores cluster in a way that looks off. The Strategist calls it on every pricing or qualification decision. All three share one rule: call before the work, not after, and give the answer real weight.
A memory that actually remembers
In the first post, Conrad's "memory" was a vector database he wrote things to. It worked, but it was a junk drawer. This round it became a real system.
Retrieval is hybrid now. Every memory is stored two ways: as an embedding in Vectorize for semantic search, and as a row in a D1 full-text index for keyword search. When Conrad recalls, both run in parallel and the results merge with Reciprocal Rank Fusion. Semantic search finds the conceptually related stuff ("how should I approach the Series A partner at Sequoia" surfaces notes about tier-one institutional investors even with no word overlap). Keyword search catches what embeddings fumble: exact company names, deal IDs, email addresses. You want both, and RRF balances them without me hand-tuning weights.
Memories age on a clock that fits their type. Not everything decays at the same speed. A conversation summary from two weeks ago is noise. A strategic insight from six months ago might be the most valuable thing Conrad knows. So each memory type has its own half-life:
| Memory type | Half-life |
|---|---|
| Conversation summary | 30 days |
| Template performance | 45 days |
| Prospect insight | 90 days |
| Objection response | 120 days |
| Deal outcome | 180 days |
| Competitive intel | 180 days |
| Strategy learning | 365 days |
The decay is exponential, so a strategy learning from six months ago keeps about 71% of its relevance while a conversation summary from the same day keeps about 0.4%. Conrad forgets trivia and holds onto strategy, with no garbage collector to babysit.
Memories get consolidated, like a sleep cycle. Twice a day, a job batches up a handful of raw memories and asks Sonnet to synthesize them: what do these reveal together, what is the one actionable insight, how do they connect. The output is stored as a durable strategy-learning memory and the raw inputs are marked as consolidated. Over time the consolidated insights themselves get consolidated. Raw experience turns into tactics turns into strategy.
And memory cannot be poisoned. This is the part I am most glad I got right. If Conrad has read any external web content during a session (taint level above "none"), auto-storage is skipped entirely. That closes an attack I genuinely worried about: someone plants text on a web page, Conrad scrapes it during research, and it quietly becomes a "memory" that warps his behavior forever. External content can inform the current task. It cannot rewrite what Conrad believes.
The three loops, now nested
The original Conrad had three loops: outbound outreach, inbound replies, and feedback to PRs. Those still run. On top of them now sit three feedback loops at different timescales, and they feed each other.
Loop one runs in minutes. A session does a task, calls tools, consults the advisor, stores memories, tracks cost. This is just doing the work.
Loop two runs daily. Every morning a retrospective reads the last 24 hours of activity, the 7-day reply rate, and the active experiments, then asks one question: what is the single most concrete thing to improve today. It implements that one thing and logs it. One change a day, not a heroic rewrite. Compounding beats big-bang.
In the same daily band, Conrad runs autonomous A/B tests with a Bayesian engine. Ops proposes a hypothesis ("subject lines under five words beat ones over eight"), the system splits a control and treatment group, logs sends and replies and meetings, and evaluates with a Beta-Binomial Monte Carlo simulation. The nice property of going Bayesian is that it says "not enough data yet" instead of handing me a false positive off twelve emails. Winners get promoted into Conrad's learnings file. Losers go into a "what does not work" section so the same bad idea does not get retried.
Also daily: the Strategist, on Opus 4.8, reviews the workers' output from the day before. It grades the work A through F, writes per-area observations, and stores corrective guidance as memories that surface in the workers' next sessions. The agents are coached, every day, by a smarter version of themselves.
Loop three runs weekly, and this is the one that still feels slightly uncanny to me. Conrad scores himself across five dimensions, each pulled from a different data source:
| Dimension | Weight | Where it comes from |
|---|---|---|
| Outreach quality | 0.30 | Opus review grade + 7-day reply rate |
| Discovery conversion | 0.20 | Promoted candidates / enriched candidates |
| Pipeline hygiene | 0.20 | Deals updated in 14 days / active deals |
| Research thoroughness | 0.15 | Review observations in the research area |
| Operational efficiency | 0.15 | Cost vs budget + error rate |
Each dimension is a number between 0 and 1, stored daily, building a time series. A linear regression over the trailing two weeks tells Conrad whether each one is improving, flat, or declining.
When a dimension has been declining for three days straight, the evolution engine wakes up. It pulls the last week of failure traces (escalated errors, review warnings, discarded experiments, times the system tried to edit itself and got blocked), maps them to the declining dimension, and asks Sonnet to generate three candidate rewrites of the relevant instruction document. Each candidate gets benchmarked against a held-out set of scenarios scored by Claude-as-judge. The best candidate that does not regress gets deployed. If none of them beat what is already there, all three are thrown away. If a deploy regresses on the post-check, it rolls back automatically from a snapshot.
In other words: when the small daily improvements stop being enough, Conrad redesigns his own playbook, tests the redesign, and ships it or discards it without me in the loop. The first time I watched it deploy a new version of its own outreach guidance and then roll it back forty seconds later because the benchmark dropped, I just sat there for a minute.
Guardrails, because a system that edits itself will eventually try something stupid
Letting an agent rewrite its own instructions is the kind of feature that sounds great until it quietly deletes a section it needed. So every write to an identity document passes through five gates before it touches storage:
- Size. The learnings file cannot exceed 15KB, the skills file 25KB. No unbounded growth from daily appends.
- Growth. A single change cannot add more than 500 bytes. Daily systems make small edits or none.
- Dedup. New content is embedded and compared against recent entries. Above 85% similarity, it is rejected. This stops Conrad from "discovering" the same insight every week and bloating the document.
- Structure. Required section headers have to survive the edit. You cannot accidentally delete "What Doesn't Work."
- Snapshot. The current version is saved to R2 before any write, last seven kept, so rollback is always possible.
The first four are hard stops. The snapshot and dedup gates fail open: if the snapshot write fails or the embedding API is down, the change still goes through, because a protective measure should not block a real improvement. I learned that boundary the hard way, which is the theme of every "what I got wrong" section I have ever written.
Selling smarter, not just faster
The original post was mostly about throughput: more outreach, faster replies. This round added a layer of actual sales intelligence. A few pieces are worth calling out.
Buying-signal detection. The old qualification score answered one question: is this the right kind of firm (verified email, investor title, focus area). It could not answer the more important one: is this firm in a buying window right now. A fund that closed $500M last month and a fund that has been dormant for two years scored identically. Now Conrad runs a couple of targeted searches per candidate looking for timing signals and weights them:
| Signal | Why it matters | Weight |
|---|---|---|
| New fund raised | They have to deploy capital | 0.25 |
| Hiring investors | Deal volume going up, need tooling | 0.15 |
| New partner | New priorities, open to new tools | 0.15 |
| Portfolio exit | Fund-cycle milestone, often precedes a raise | 0.10 |
| Active content | Publicly branding, more receptive | 0.05 |
A well-timed cold email to a firm that just closed Fund V beats a perfectly personalized email to a firm that is not deploying. The outreach queue is now sorted by fit and timing, not fit alone.
Conrad also keeps an explicit graph of people, firms, and the relationships between them. One tool runs a two-hop search for a warm-intro route between two contacts, ranked by the strength of the path. No path into a target firm gets flagged as a coverage gap. A path that exists means the Executor can draft an intro request instead of a cold email.
Drop in a meeting transcript and one tool call cascades into five durable outputs: a prospect-insight memory, follow-up tasks with owners and deadlines, engagement events, new edges in the knowledge graph, and any fresh objection patterns added to a shared objection library (marked "pending curation" until a human sharpens them). One call, five places it shows up later.
A win/loss loop that teaches discovery. Every deal that closes, won or lost, snapshots the firm's attributes: fund stage, geography, seat count, focus. Those snapshots update per-attribute weights with Laplace smoothing, so a single outcome does not swing the model. Over time the discovery scorer teaches itself which attribute combinations actually convert. If "Series A generalists in NYC" wins three and loses eight, that weight drifts negative. If "boutique family offices with a fintech mandate" wins six and loses zero, it drifts positive. The more Conrad sells, the better he gets at picking who to sell to next.
There is more in this layer (engagement scoring from behavioral signals, account multi-threading, renewal and referral sweeps), but the pattern is the same: turn things that happened into weights that change what happens next.
The latest layer: making it reliable
The most recent stretch of work was less glamorous and, honestly, more important than any single feature. A self-improving system is only as trustworthy as its plumbing, and a few pieces needed redoing.
Structured outputs replaced fragile parsing. A lot of Conrad's internal jobs ask a model for JSON and then parse it. The old way was a regular expression fishing a JSON blob out of prose, which breaks the moment a model adds a sentence of preamble. Those call sites now use the API's structured-output mode, which constrains the model to a schema. Every one falls open to a safe default if something goes wrong, so a parsing hiccup degrades gracefully instead of throwing.
Executor and Ops do real reasoning now, the qualifying and drafting and pipeline analysis, so they run with adaptive thinking: the model decides how much to deliberate per task. Cheap on the easy ones, deeper on the hard ones.
The nightly discovery funnel used to enrich and qualify candidates one at a time, each call waiting on the last. Those loops now run with bounded concurrency, which cut the discovery phase substantially without hammering the upstream APIs.
Research is cited. When Conrad pulls in web search or content, the results are wrapped so the model attributes each claim back to its source URL. For a system whose entire value is investor research, knowing where a claim came from is not optional.
And there is a real test gate. Before any of this ships, a suite of evals runs on every commit: the security taint model, the content rules (including the no-em-dash rule), delegation routing, the structured-output schemas, and a guard that checks the self-editing system actually rejects bad mutations. That last one is the point. A system that rewrites its own instructions needs a test proving the guardrails hold, or you are one bad self-edit away from a quiet regression you will not catch for two weeks.
The numbers, updated
Everything still runs on Cloudflare. The self-improvement and intelligence stack added real spend, but less than you might guess, because most of it is D1 queries and heuristics with zero model cost, and because Opus 4.8 is cheaper than the 4.6 it replaced.
| Component | Monthly cost |
|---|---|
| Cloudflare (Workers, D1, KV, R2, Vectorize) | ~$7 |
| Anthropic API (agents, advisor, all cron jobs) | ~$60-80 |
| Self-improvement stack (review, evolution, benchmarks, consolidation) | ~$12-15 |
| Exa.ai (discovery + buying signals) | ~$5 |
| Hunter.io (enrichment) | ~$49 |
| Total | ~$130-150/month |
The most expensive single self-improvement job is the daily Opus review, at roughly $4-5 a month. It is also the most valuable, because it is the quality signal that drives the entire fitness and evolution chain. Everything downstream depends on that grade being honest.
For a system that discovers prospects, times outreach to buying windows, drafts and tracks everything, ingests meetings, scores deals, and rewrites its own playbook when the numbers slip, $130 a month still feels faintly absurd to type.
What I got wrong this round
I let the system edit itself before I built the guardrails. For about a week, the retrospective could append to the learnings file with nothing but a size check. Nothing catastrophic happened, but it did rediscover the same "insight" three days running and bloat the file with near-duplicates. The dedup gate exists because of that. Build the constraints before you hand over the pen, not after.
Fail-open is a decision you make per system, not once. Some things should block on failure (the size and structure gates). Some things should wave the change through on failure (the snapshot, the dedup check). I initially made them all fail the same way and got the worst of both: improvements blocked by a flaky embedding endpoint, and a protective check I was relying on silently skipped. Decide, for each guard, whether its absence is worse than a missed improvement.
Cron triggers are a scarce resource and I planned for it too late. Cloudflare caps the number of cron triggers per worker. Every new background job I wanted (fitness, benchmarks, evolution, consolidation, engagement recompute) wanted its own schedule. They do not get one. Almost everything piggybacks on the handful of triggers I already had, which forced some genuinely useful discipline about what runs when. If I were starting over I would budget cron slots like I budget money.
Regex parsing of model output was a slow-motion bug. It worked until a model phrased things slightly differently, and then it failed in a way that looked like the model was wrong when really my parser was. Structured outputs should have been there from the first line of internal JSON. I replaced it everywhere this round and the class of "weird intermittent parsing failure" bug just disappeared.
The audit log turned out to be the most valuable thing in the system, and I almost treated it as exhaust. It started as observability. It is now the training data for the retrospective and the failure-trace input for the evolution engine. The system learns from its own log. If I had known that on day one I would have logged more, and more structured, from the start.
Where this leaves me
The change that surprised me most is that I touch the system less than I did three months ago, not more, even though it does substantially more. The daily retrospective handles the small stuff. The weekly evolution handles the structural stuff. My job has quietly shifted from operating Conrad to reviewing the handful of things he escalates, and occasionally vetoing a self-edit before it ships.
I still approve every email before it sends. I still review every PR before it merges. The human gates that mattered in the first post matter more now, not less, precisely because the autonomy underneath them got deeper. A system that can rewrite its own instructions is exactly the system you want a person standing at the boundary of.
What I want to break next
Two things I said I would not build still stand: no self-editing without an approval gate, and no integration surface with no one asking for it. Discipline is mostly saying no to the clever thing that has not earned its keep. But there are a few clever things I keep circling back to.
The knowledge graph is the piece I most want to push. Today it finds the shortest warm-intro path between two people and ranks it by tie strength. Shortest is not the same as strongest, and I want it reasoning about why a path is good, not just how few hops it is. The version in my head also turns every coverage gap into an outreach target on its own, instead of leaving the gap flagged for me to notice.
The evolution engine only rewrites prose right now: the learnings file, the skills file, the instruction docs. The obvious next surface is letting it propose changes to the tools and the routing logic, not just the playbook. I have not built that, and the reason is that the whole guardrail story gets harder when the thing being edited is code instead of text. A bad sentence in a playbook is a bad day. A bad edit to a tool is a bad week. That is also roughly the point where an MCP server stops being surface area for its own sake and starts being the clean way to let Conrad reach further. I can feel that consumer getting closer.
The one I cannot stop thinking about is the smallest in code and the largest in implication. Conrad is one agent that improves himself. What happens when there are several, each learning locally from its own corner of the work, and they share what they learn up to a layer they all read from? One agent's "Series A generalists in NYC do not convert" becomes every agent's prior before any of them has to lose eight deals to learn it. That could be a force multiplier. It could also be the fastest way to teach all of them the same wrong lesson at once. I do not know which, and that is exactly why it is the next thing I want to run.
The first post was about closing the loop. This one was about teaching it to learn. The next one, if it works, is about what happens when the loop starts learning from loops it has never met. I do not know yet whether that is the next good idea or the next thing I get wrong. In my experience those two feel identical right up until they don't.
VentureScope is an AI-powered deal flow management and due diligence platform for venture capital firms. Conrad Reeve is the AI agent that runs its go-to-market motion, and increasingly, improves it.
Discussion
Comments are powered by GitHub Issues. Join the conversation by opening an issue.
⊹Add Comment via GitHub