Writing on Dave Marquard

Six months building an agentic OS. What it changed about how I think about product orgs.

2026-05-29T09:00:00-07:00

Six months in, the moment I knew the way I thought about product orgs had changed came on a Tuesday morning. I was writing the rubric for an eval — the test that would decide whether the next version of one of my agents shipped — and I realized the rubric was the spec. The PRD I would have spent two days writing in 2024 was unnecessary. The eval was the artifact. Everything else was scaffolding around it.

I’ve been operating my work life on top of an agentic OS for about six months. A chief-of-staff agent reads my inbox overnight, drafts replies (never sends — drafts only), and writes a brief I read with coffee. Pre-reads for every meeting more than 72 hours out land on a private site over Tailscale. External comms run through a voice-match pass before they go anywhere I’d be embarrassed if a stranger read them. Daily, weekly, monthly — the system gets quieter and more load-bearing.

I’d describe what I have less as “a product” and more as a house. I built the foundation, the rooms have wiring and plumbing, I keep adding fixtures. Living in it for six months changed four things about how I think about building products with agents.

1. The eval is the spec #

The PRD-as-sole-artifact model breaks down once a team is shipping agentic features. You can’t reason about an agent the way you reason about a deterministic feature. You can’t write acceptance criteria in prose and have them mean the same thing twice. The only durable artifact is the rubric you grade outputs against.

Once I internalized that, the work changed. I stopped writing two-page docs and started writing fifteen-line eval rubrics. The rubric forces every fuzzy intent into a falsifiable test. If I can’t write the test, I don’t actually know what I want.

Three rubrics I actually run in my own system:

Voice-match, applied to every external draft before it reaches me. A draft fails if it contains an emoji, the words “delighted” or “thrilled” or “excited,” more than one hedging phrase, eager-alignment language (“exactly the kind of thing…”), or any claim about my background not verifiable against resume.md. It passes when sentence rhythm varies, the greeting and sign-off match the relationship register, and the closing line is forward-looking but not effusive. About a dozen lines. Falsifiable. The model grades itself before I see the draft.

Pre-read for an upcoming meeting. Did it identify every attendee by name, role, and tenure? Are the three most recent company news items from the last thirty days, each with a primary-source link? Is there a one-line “my angle” tied to a thread I’m already working with this person? If any answer is “no,” the rubric fails and the draft regenerates.

Daily brief. Calendar items present and ordered. Each open job-pipeline status reconciled against the last 24 hours of Sent mail. Anything I owe somebody flagged as such. News section trimmed to items in my actual market. No item where “I’m not sure” silently becomes “I’ll mention it anyway.”

The rubric is the spec. The prose document is the receipt.

A fifteen-line rubric for a personal agent isn’t a PRD for a payments platform. But the principle doesn’t change with scale. If your team isn’t running rigorous offline evals against business outcomes, you don’t have a product motion in agents yet. You have a demo loop. The bench is the bar.

2. Prompt caching changed unit economics. Anthropic just made the meter visible. #

Frontier-model prompt caching gives you up to a 90% discount on the cached portion of a prompt. For most agent workflows that’s the long part: the system prompt, the context bundle, the few-shot examples. That’s not a 10% optimization. It changes which features are economically possible.

Features you would have killed in Q1 for cost reasons are now fine. Features you would have built lean now have headroom for richer context. Workflows that looked like “we’ll run this once a week” are now “we run this on every request.” Most PMs haven’t internalized this because the shift is invisible at the feature level. It shows up only in the budget.

The cost surface is about to get a lot more visible. On June 15, Anthropic is unbundling Agent SDK and claude -p usage from Claude subscription limits. Pro subscribers get a separate $20/month credit for programmatic and agent usage; spend past it is metered. Subscription limits stay reserved for human-interactive Claude. Anthropic’s framing is direct: one agent can “generate thousands of requests, run tests continuously, browse the web, and recursively call models.” That’s not the cost shape of a human typing prompts, and they’re done pretending it is.

This is the forcing function. The all-you-can-eat era for agent workflows is ending. If you’re shipping an agent inside a subscription wrapper — yours or someone else’s — your team is about to have a P&L conversation that did not exist in Q1. The PM who can answer “how much did this workflow cost yesterday, and which step is most expensive?” is going to ship features the rest of the org can’t justify. The PM who can’t is going to watch their budget get reallocated to someone who can.

Cache hit rate, tokens-per-task, verifier-cost-versus-failure-cost are now product-management vocabulary, not engineering trivia. If your team isn’t fluent yet, this is the quarter to fix it. The June 15 meter will do the teaching either way; better to be ahead of it.

3. The gap between a demo and a system that survives a week is the entire job #

Anyone can build an agent that does a thing once. Five times out of ten, it’ll even do the thing well. The gap between that and “this runs every morning at 6am for six months without me touching it” is where most of the product surface lives.

A non-theoretical list, drawn from my own system in the last quarter:

The 6am cron fired but my laptop was asleep, so the brief silently didn’t run. The first signal was me wondering at 7:30 where my coffee reading was.
A skill called a downstream API and got rate-limited; the resulting pre-read shipped with an empty meetings section, and I didn’t notice until I was already at the meeting reading on my phone.
A draft email cited a job at a company I had never worked at, because an old memory file had been wrong about an employer. The draft would have gone out under my name if I’d hit send without reading. Drafts under my name now run through a resume-grounding check before they ever reach me.
Voice fidelity drifted across a few weeks as the example pool accumulated noise. The drafts got a little flatter, a little more LinkedIn-shaped. Nothing failed loudly. The output just slowly stopped sounding like me.

None of these were demoable. All of them are what makes the difference between a toy and something I actually rely on. The demo is the first sliver of the work. The rest — the part nobody can demo — is the job a real product team has to do, and most don’t yet know how big it is.

If your team’s agent roadmap doesn’t have line items for drift detection, output verifiers, graceful failure when a downstream tool is unavailable, and observability of what the agent actually said and did, you’re not building a product yet. You’re maintaining a demo.

4. Coordination cost collapsed. Org design hasn’t caught up. #

Customer interview synthesis used to take a senior PM half a day. With tools like Granola transcribing every meeting into a searchable corpus, plus a tuned agent loop that reads across that corpus and a stack of past briefs, it’s twenty minutes plus a review pass. The synthesis is better than what I used to produce by hand, because the agent doesn’t get bored on interview number eight.

Competitive briefs, same shape. Strategy-doc first drafts: two to three times the throughput, with cleaner source citation than I used to bother with. Voice-matched stakeholder comms at scale, run as a draft surface I can edit instead of a blank page I have to fill. The first time you watch an inbox triage produce twelve coherent reply drafts in eight minutes — each one cited against the thread it’s replying to — you start to reconsider what “a product manager’s day” should actually look like.

The first-order effect is that one operator carries more product surface than they could two years ago. The second-order effect is that the org chart no longer reflects coordination cost the way it used to. Teams sized around 2024 throughput assumptions are now overstaffed in some places, dangerously thin in others, and most leaders haven’t done the math.

This is the lesson I’m least confident about. The throughput shift is real — I’m living it. The org-design implication is downstream of a lot of “if this trend continues” assumptions. But if you’re a CPO running a team of 30 in 2026, the right team in 2027 is probably not a team of 30. It’s almost certainly not 30 of the same people.

What I’d tell a PM who hasn’t built with agents this year #

Stop reading about it. Build one thing end-to-end. Pick the smallest workflow in your own life that you’d pay to automate — a personal-finance digest, a weekly status roll-up, a literature scan, something that takes you an hour a week and that you wouldn’t be embarrassed to keep tinkering with on a Sunday afternoon. Build an agent that does it.

A few specific things to do on the way:

Write the eval rubric first. Before the first prompt, before the first integration, write down how you’d grade the output. Twelve lines is enough. If you can’t grade it, you don’t know what you want.

Watch your unit economics from day one. Log token counts. Look at your cache hit rate. When the June 15 Anthropic change lands, the meter will be visible whether you want it to be or not. Treat that as a teaching tool, not an antagonist. The PMs who walk into 2027 fluent in the cost shape of their own workflows are going to look very different from the ones who don’t.

Run it for a real week. Not a demo run. A week. Cron timing will surprise you. Rate limits will surprise you. State will go stale in ways you didn’t anticipate. Voice will drift. Things that worked on Tuesday won’t work on Saturday. Each of those failures is the thing your org will hit at scale next year — better to learn it now, on a workflow that doesn’t matter, than next year on one that does.

Then build the second one and connect them. Compounding only kicks in once you have two surfaces and they share context. One agent is a tool. Two agents that share state are the start of a system. That transition is where the interesting product questions actually live.

You won’t ship the most useful thing you ever built. You’ll ship the most useful product education you ever got. The PMs who’ll be load-bearing in 2026 — and the ones doing the hiring two years from now — are the ones who have shipped, broken, and re-shipped at least one agent.

The "we have AI" moat is gone. Ad tech sells outcomes pricing next.

2026-05-27T09:00:00-07:00

Three things happened last week. Google shipped Gemini Omni — an “anything-to-anything” multimodal model whose Flash tier generates video from a still image in seconds. DeepSeek made its V4-Pro 75% price cut permanent, pulling frontier-model token economics another notch toward zero. And the FTC fined Cox Media, MindSift, and 1010 Digital $930K for selling a fabricated AI voice-listening ad-targeting product — the first major FTC action specifically against fake AI-targeting capability claims.

Different stories. Same vector.

The ad-tech narrative that powered the last three years of valuation premiums — “we have ML / we have AI / we have an agent” — is no longer load-bearing. The frontier-model layer is commoditizing on price while capabilities advance. The marketers we sell to now have Gemini-Omni-grade tools to make their own creative and measurement guesses. And regulators just signaled that vague AI-targeting claims are not free anymore.

If “we have AI” was the pitch, the pitch is over.

What replaces it #

The durable moat in ad tech — what’s left when the model layer goes flat — is the closed loop. Three things, ordered by how hard they are to copy:

Proprietary signal density. First-party data with consented identity, at scale, across the right surface. Hard to copy, slow to build, expensive to maintain.
An eval-and-iteration loop tuned to advertiser outcomes, not platform metrics. The number of ad-tech companies running rigorous offline evals against business outcomes — not impressions, not click-through, not the platform’s own optimization signal — is tiny.
Outcomes pricing. Not CPM. Not even CPA. A pricing surface where the buyer pays for a measured business outcome and the seller takes the variance. The CFO-side conversation that’s been “coming next year” since 2019.

Of those three, the first two are necessary preconditions. But the third is the unlock. It’s how performance ad tech grew up in mobile — and it’s the conversation CTV has been ducking for the entire decade.

The buy side is finally asking #

The signal that the moment has arrived isn’t from the sellers. It’s from the buyers. Two industry papers landed last week that I keep re-reading:

The Coalition for Innovative Media Measurement (CIMM) published “Quality Matters,” which TV Tech summed up as “not all impressions are equal” — arguing for re-weighting impressions by quality, attention, and context. Standard-body papers don’t change pricing on their own. But CIMM is downstream of the major buyers, and CIMM moving means the buyers are demanding a vocabulary for paying differently for different impressions.
MediaVillage’s How to price media quality for CTV is the mechanism paper. Adelaide published new attention-quality (AU) scores for US CTV by daypart and format — a 23-point spread between the highest- and lowest-quality CTV segments that often transact at similar CPMs — and walks through how buyers use AU as a threshold tied to a specific KPI, with cost-per-AU as a (client-specific) market marker. Quiet, technical, almost boring. Also the most pipeline-relevant document I read all week.

The Publicis–LiveRamp deal and the Viant–TVision close — both in the last three weeks — sit on the same axis. Identity infrastructure consolidating under the buy side. Attention measurement consolidating inside a DSP. Both are moves to put the measured-quality argument where it can actually be priced.

The CFO-side question coming next: if I can pay you less when the impression is lower-quality and more when it converts, why am I paying you a flat CPM for either?

What this means for product leaders selling AI in ad tech #

If you’re running product at a measurement vendor, a DSP, an SSP, a retail-media network, or a CTV publisher, three things change:

Re-audit your AI claims against what the FTC just priced. Vague “AI-driven” language is now a regulatory exposure, not a marketing flourish. If you can’t show the buyer the model, the signal, and the eval, take it out of the deck.
Build outcomes-pricing optionality into your sales motion now, even if no one is buying it yet. The first measurement vendor that ships a credible “measured-quality CPM” surface — even as an experiment — captures the conversation. Two years from now, every RFP includes the question.
Treat the model layer as commodity. Don’t differentiate on “we use GPT-5 / Claude Opus / Gemini Omni.” Differentiate on what you do that they don’t: your signal, your eval, the closed loop. The model is the engine, not the car.

Moloco’s performance-CTV launch in April is the cleanest recent example. The pitch isn’t “we have AI.” It’s “we apply mobile-grade performance ML to CTV inventory, and we’ll show you the closed loop down to the install.” Whether Moloco is the company that wins this — that’s the actual question. The framing is right.

The thing that doesn’t survive last week’s tape is the pitch that ad tech has spent three years rehearsing. Performance buyers are going to pay for outcomes, regulators are going to price fake AI claims, and the model layer is going to keep getting cheaper. The product motion that survives builds around all three.

The companies that get this right will look like the early performance-mobile DSPs, ten years later, in a channel that’s twice the size and starting from zero on the primitives. The ones that don’t are about to have a really hard conversation with their CFOs.