Six months building an agentic OS. What it changed about how I think about product orgs.

Six months in, the moment I knew the way I thought about product orgs had changed came on a Tuesday morning. I was writing the rubric for an eval — the test that would decide whether the next version of one of my agents shipped — and I realized the rubric was the spec. The PRD I would have spent two days writing in 2024 was unnecessary. The eval was the artifact. Everything else was scaffolding around it.

I’ve been operating my work life on top of an agentic OS for about six months. A chief-of-staff agent reads my inbox overnight, drafts replies (never sends — drafts only), and writes a brief I read with coffee. Pre-reads for every meeting more than 72 hours out land on a private site over Tailscale. External comms run through a voice-match pass before they go anywhere I’d be embarrassed if a stranger read them. Daily, weekly, monthly — the system gets quieter and more load-bearing.

I’d describe what I have less as “a product” and more as a house. I built the foundation, the rooms have wiring and plumbing, I keep adding fixtures. Living in it for six months changed four things about how I think about building products with agents.

1. The eval is the spec

The PRD-as-sole-artifact model breaks down once a team is shipping agentic features. You can’t reason about an agent the way you reason about a deterministic feature. You can’t write acceptance criteria in prose and have them mean the same thing twice. The only durable artifact is the rubric you grade outputs against.

Once I internalized that, the work changed. I stopped writing two-page docs and started writing fifteen-line eval rubrics. The rubric forces every fuzzy intent into a falsifiable test. If I can’t write the test, I don’t actually know what I want.

Three rubrics I actually run in my own system:

Voice-match, applied to every external draft before it reaches me. A draft fails if it contains an emoji, the words “delighted” or “thrilled” or “excited,” more than one hedging phrase, eager-alignment language (“exactly the kind of thing…”), or any claim about my background not verifiable against resume.md. It passes when sentence rhythm varies, the greeting and sign-off match the relationship register, and the closing line is forward-looking but not effusive. About a dozen lines. Falsifiable. The model grades itself before I see the draft.

Pre-read for an upcoming meeting. Did it identify every attendee by name, role, and tenure? Are the three most recent company news items from the last thirty days, each with a primary-source link? Is there a one-line “my angle” tied to a thread I’m already working with this person? If any answer is “no,” the rubric fails and the draft regenerates.

Daily brief. Calendar items present and ordered. Each open job-pipeline status reconciled against the last 24 hours of Sent mail. Anything I owe somebody flagged as such. News section trimmed to items in my actual market. No item where “I’m not sure” silently becomes “I’ll mention it anyway.”

The rubric is the spec. The prose document is the receipt.

A fifteen-line rubric for a personal agent isn’t a PRD for a payments platform. But the principle doesn’t change with scale. If your team isn’t running rigorous offline evals against business outcomes, you don’t have a product motion in agents yet. You have a demo loop. The bench is the bar.

2. Prompt caching changed unit economics. Anthropic just made the meter visible.

Frontier-model prompt caching gives you up to a 90% discount on the cached portion of a prompt. For most agent workflows that’s the long part: the system prompt, the context bundle, the few-shot examples. That’s not a 10% optimization. It changes which features are economically possible.

Features you would have killed in Q1 for cost reasons are now fine. Features you would have built lean now have headroom for richer context. Workflows that looked like “we’ll run this once a week” are now “we run this on every request.” Most PMs haven’t internalized this because the shift is invisible at the feature level. It shows up only in the budget.

The cost surface is about to get a lot more visible. On June 15, Anthropic is unbundling Agent SDK and claude -p usage from Claude subscription limits. Pro subscribers get a separate $20/month credit for programmatic and agent usage; spend past it is metered. Subscription limits stay reserved for human-interactive Claude. Anthropic’s framing is direct: one agent can “generate thousands of requests, run tests continuously, browse the web, and recursively call models.” That’s not the cost shape of a human typing prompts, and they’re done pretending it is.

This is the forcing function. The all-you-can-eat era for agent workflows is ending. If you’re shipping an agent inside a subscription wrapper — yours or someone else’s — your team is about to have a P&L conversation that did not exist in Q1. The PM who can answer “how much did this workflow cost yesterday, and which step is most expensive?” is going to ship features the rest of the org can’t justify. The PM who can’t is going to watch their budget get reallocated to someone who can.

Cache hit rate, tokens-per-task, verifier-cost-versus-failure-cost are now product-management vocabulary, not engineering trivia. If your team isn’t fluent yet, this is the quarter to fix it. The June 15 meter will do the teaching either way; better to be ahead of it.

3. The gap between a demo and a system that survives a week is the entire job

Anyone can build an agent that does a thing once. Five times out of ten, it’ll even do the thing well. The gap between that and “this runs every morning at 6am for six months without me touching it” is where most of the product surface lives.

A non-theoretical list, drawn from my own system in the last quarter:

The 6am cron fired but my laptop was asleep, so the brief silently didn’t run. The first signal was me wondering at 7:30 where my coffee reading was.
A skill called a downstream API and got rate-limited; the resulting pre-read shipped with an empty meetings section, and I didn’t notice until I was already at the meeting reading on my phone.
A draft email cited a job at a company I had never worked at, because an old memory file had been wrong about an employer. The draft would have gone out under my name if I’d hit send without reading. Drafts under my name now run through a resume-grounding check before they ever reach me.
Voice fidelity drifted across a few weeks as the example pool accumulated noise. The drafts got a little flatter, a little more LinkedIn-shaped. Nothing failed loudly. The output just slowly stopped sounding like me.

None of these were demoable. All of them are what makes the difference between a toy and something I actually rely on. The demo is the first sliver of the work. The rest — the part nobody can demo — is the job a real product team has to do, and most don’t yet know how big it is.

If your team’s agent roadmap doesn’t have line items for drift detection, output verifiers, graceful failure when a downstream tool is unavailable, and observability of what the agent actually said and did, you’re not building a product yet. You’re maintaining a demo.

4. Coordination cost collapsed. Org design hasn’t caught up.

Customer interview synthesis used to take a senior PM half a day. With tools like Granola transcribing every meeting into a searchable corpus, plus a tuned agent loop that reads across that corpus and a stack of past briefs, it’s twenty minutes plus a review pass. The synthesis is better than what I used to produce by hand, because the agent doesn’t get bored on interview number eight.

Competitive briefs, same shape. Strategy-doc first drafts: two to three times the throughput, with cleaner source citation than I used to bother with. Voice-matched stakeholder comms at scale, run as a draft surface I can edit instead of a blank page I have to fill. The first time you watch an inbox triage produce twelve coherent reply drafts in eight minutes — each one cited against the thread it’s replying to — you start to reconsider what “a product manager’s day” should actually look like.

The first-order effect is that one operator carries more product surface than they could two years ago. The second-order effect is that the org chart no longer reflects coordination cost the way it used to. Teams sized around 2024 throughput assumptions are now overstaffed in some places, dangerously thin in others, and most leaders haven’t done the math.

This is the lesson I’m least confident about. The throughput shift is real — I’m living it. The org-design implication is downstream of a lot of “if this trend continues” assumptions. But if you’re a CPO running a team of 30 in 2026, the right team in 2027 is probably not a team of 30. It’s almost certainly not 30 of the same people.

What I’d tell a PM who hasn’t built with agents this year

Stop reading about it. Build one thing end-to-end. Pick the smallest workflow in your own life that you’d pay to automate — a personal-finance digest, a weekly status roll-up, a literature scan, something that takes you an hour a week and that you wouldn’t be embarrassed to keep tinkering with on a Sunday afternoon. Build an agent that does it.

A few specific things to do on the way:

Write the eval rubric first. Before the first prompt, before the first integration, write down how you’d grade the output. Twelve lines is enough. If you can’t grade it, you don’t know what you want.

Watch your unit economics from day one. Log token counts. Look at your cache hit rate. When the June 15 Anthropic change lands, the meter will be visible whether you want it to be or not. Treat that as a teaching tool, not an antagonist. The PMs who walk into 2027 fluent in the cost shape of their own workflows are going to look very different from the ones who don’t.

Run it for a real week. Not a demo run. A week. Cron timing will surprise you. Rate limits will surprise you. State will go stale in ways you didn’t anticipate. Voice will drift. Things that worked on Tuesday won’t work on Saturday. Each of those failures is the thing your org will hit at scale next year — better to learn it now, on a workflow that doesn’t matter, than next year on one that does.

Then build the second one and connect them. Compounding only kicks in once you have two surfaces and they share context. One agent is a tool. Two agents that share state are the start of a system. That transition is where the interesting product questions actually live.

You won’t ship the most useful thing you ever built. You’ll ship the most useful product education you ever got. The PMs who’ll be load-bearing in 2026 — and the ones doing the hiring two years from now — are the ones who have shipped, broken, and re-shipped at least one agent.