Manual agent juggling
5 tabs, 5 agents, merge conflicts everywhere
Every week a new model drops. Your team manually benchmarks it. What if that happened automatically — with PRs when something's better?
New models released per month
Time to benchmark each one
Tools that auto-PR improvements
| Company | What They Do | Gap |
|---|---|---|
| Portkey | AI gateway, routing, 1600+ LLMs | No auto-benchmarking against YOUR stack |
| Unify ($8M) | Finds best LLM for the job | Router-first, not benchmark-first |
| Braintrust ($36M, $150M val) | Eval-driven development | Reactive, not proactive |
| Us | Watch → Auto-benchmark → PR when better | — |
Connect your AI stack. Define your eval suite (or we help you build one).
We monitor every model release across all providers. Automatically.
When something beats your current setup, you get a PR with benchmarks.
Watch → Auto-benchmark → PR when better
Nobody does this for AI models. We do.
Software engineering solved "did my change break things?" 20 years ago. AI engineering still ships blind.
Push prompt change → Hope it works → Find out in production
Push prompt change → Eval runs → PR blocked if quality drops
The gap isn't that people don't have evals — Braintrust, Humanloop, and DSPy are giving them that.
Evals aren't integrated as blocking gates in deployment pipelines the way unit tests are.
Think: Braintrust's eval engine + Dependabot's automation + GitHub Actions' CI/CD — fused into one opinionated product.
| Dimension | Aemon | Us |
|---|---|---|
| Purpose | Discover new optimal solutions | Protect existing quality + incrementally improve |
| Posture | Offensive R&D | Defensive Ops |
| Buyer | R&D Lead / ML Researcher | Engineering Manager / Platform Team |
| Integration | Standalone tool | Lives in your CI/CD |
$2K – $20K/mo
Based on eval runs & endpoints
LMArena raised $150M at $1.7B valuation on public evals. Enterprises need private evals on their own data.
LMArena valuation (public evals)
Private enterprise eval market
Hugging Face's Yourbench is the open-source precursor — but it's a DIY tool requiring significant ML expertise. We productize it.
| Aemon | Private LMArena |
|---|---|
| Evolves novel algorithms | Evaluates existing models/configs |
| Research | Intelligence |
$10K – $100K/mo
Enterprise contracts
Companies spend $85K+/mo on AI infrastructure. Nobody knows if they're overpaying for quality they don't need.
Avg monthly AI spend
YoY growth
Visibility into cost-quality tradeoff
| Tool | What It Does | Missing |
|---|---|---|
| Portkey | Routing, fallbacks | No cost-quality optimization |
| Unify | Cheapest model that meets threshold | Not continuous, not production data |
| Us | Continuously optimize cost-quality frontier across entire AI stack | |
An agent that sits on top of your AI gateway:
$2K – $15K/mo
Pays for itself from savings
⚡ Easiest ROI story of all these ideas
Building good evals is harder than building the AI features themselves. We build the oracle.
Braintrust's thesis: "If your eval is right, every decision becomes simple."
DSPy's framework depends on having good metrics to optimize against.
The bottleneck in the entire AI development loop is knowing what "good" looks like.
✓ Datasets
✓ Scoring rubrics
✓ Automated judges
Output plugs into Braintrust, DSPy, or your own CI/CD.
| Aemon | Eval-as-a-Service |
|---|---|
| Assumes you have a good eval function | Creates the eval function |
| Optimizer | Oracle |
| Depends on eval quality | Is the prerequisite to everything else |
If you own the eval layer, you become the foundation every optimization tool depends on.
$5K – $30K/mo
Per eval suite built + maintenance
We operate fleets of AI agents that deliver results. Customers get outcomes. We get playbooks. Playbooks become platform.
A fat startup ships outcomes, not features. It bundles software, data, and human ops into one integrated product that actually gets the job done.
— Andrew Lee, a16z Speedrun Partner
Spinning up AI agents is now trivial. Managing them is the new bottleneck.
OpenClaw, dockerized instances, cloud GPUs
GPT-5, Claude 4, open-source alternatives
$0.01-0.10 per task, not $50/hr
Babysitting agents instead of running business
Every hour on agent issues ≠ hour on actual work
They want outcomes, not infrastructure
The Insight
Founders are too busy to become AI ops engineers. We absorb that complexity so they can focus on their actual business.
We started in sales. Then customers kept asking for more.
Lead gen + qualification
Synthetic data pipelines
Literature review + synthesis
Outbound + meeting booking
Every customer had the same problem:
"I tried spinning up agents myself. Then I spent all my time debugging them instead of running my business."
— Pattern across customers
They didn't want to manage AI. They wanted outcomes.
Why Tools Aren't Enough
Companies don't want to become AI operations experts. They want someone to absorb the complexity and just deliver results.
We operate agent fleets. Customers get outcomes. We encode playbooks.
Become pseudo-IT for AI
Setup, config, debugging
No guarantee of outcomes
You focus on your business
We've done this before (playbooks)
Pay for results, not effort
Starting with sales because the outcome is measurable: meetings booked.
Meetings booked = revenue
70% AI SDR churn = customers looking for alternatives
$5-10K/month for what works
50% of our revenue is SDR/BDR
Deep prospect intelligence
Personalized messaging
Score and prioritize leads
Book the meeting
Expansion Path
Sales → Research/Intel → Operations → Content. Each vertical = new playbook, same infrastructure.
Every engagement encodes a playbook. Playbooks make the next engagement faster. This is how we build the moat.
Every engagement becomes encoded knowledge:
What steps work for each use case
Messaging that actually converts
Which models, tools, and sequences
What breaks and how to prevent it
Figure everything out from scratch
Apply existing playbook + customize
Playbook is battle-tested
Playbooks become product
The Fat Startup Advantage
We're getting paid to build our moat. Every dollar of revenue = more encoded knowledge. Competitors starting later start from zero.
We're productizing the research consensus on what actually works.
Declarative orchestration beats autonomous agents (Microsoft, 2024-25 surveys)
Human edits train intervention policies (ReHAC, EMNLP 2024)
Prompts + tool-use are parameters to optimize (AVATAR, NeurIPS 2024)
Transparency + oversight for multi-agent systems (Nature, 2026)
Versioned configs, not imperative code
Every edit = structured training signal
Prompts, branching, model routing improve over time
Can't be prompt-injected, auditable
We log trajectories, human edits, and outcomes, then update prompts, branching logic, and model routing so the same business objective is achieved more reliably over time. The playbook is the learned policy space.
— Our technical thesis
The internal system that makes agent workflows repeatable and efficient.
Which tools? Which prompts? Which models?
Learn the same lessons repeatedly
More clients = more eng hours
Verified, tested, reusable primitives
Learnings feed back into system
More clients = richer library = faster
The Compounding Effect
Workflow #1 takes a week. Workflow #10 takes a day. Workflow #100 takes hours. The library IS the moat.
Co-Founder + Advisor
$140M exit (Mystery) · a16z funded · Built consumer products used by millions
Co-Founder & CTO
CTO of Because · $3M Seed · Deep agent infrastructure experience
Co-Founder
yapthis.com · Agentic architecture · Shipped production agent systems
Dog-fooding our own infrastructure daily
ClawView for agent monitoring
Agent Seatbelt for safety
$4K MRR, +$2K this week
What This Proves
Companies will pay for AI-powered outcomes when someone else manages the complexity. The demand is real. The model works.
Scale agent fleet + engineering team
Prove the playbooks at scale
Turn proven playbooks into self-serve templates
Agents just became capable enough
70% churn = customers looking for what works
Every month we operate = more encoded knowledge
Customers get results. We get playbooks. Playbooks become platform.
An AI that knows your context, anticipates your needs, and takes action on your behalf—not a chatbot you have to prompt.
The Vision
Imagine an AI that actually knows you—your work, your preferences, your patterns. It doesn't wait for commands. It proactively handles tasks, flags important things, and learns from every interaction.
Personal AI assistants are about to explode.
AI personal agents will arrive soon. What we do now with apps—manually, and in piecemeal fashion—will be done automatically. If a flight is cancelled, an AI agent will rebook the flight, reschedule meetings, and order food.
— Goldman Sachs, "What to Expect from AI in 2026"
Siri, Alexa, and Google Assistant lost the AI race. Here's why.
Context resets after 2-3 turns. They forget everything.
Wait for commands. Never anticipate needs.
Can't connect your email, calendar, work, and life.
"I can't do that" is their signature phrase.
Remembers weeks of interactions. Learns your patterns.
Anticipates what you need before you ask.
Sees your whole digital life—with your permission.
Browser, shell, files, messages—actual work gets done.
Microsoft's CEO called AI assistants "dumb as a rock." The truth is, they've stagnated while chatbots evolved.
— Industry Analysis, 2023-2024
Why dedicated AI devices keep failing—and what we learned.
The Lesson
Hardware failed because it created friction instead of removing it. The winning approach: software that works with your existing devices—phone, laptop, wearables—not another gadget to carry.
Both Rabbit R1 and Humane AI Pin missed a crucial opportunity: integrating with existing user bases. Why create a separate device when you could leverage smartphones and their vast ecosystem?
— Medium Analysis, July 2024
The fundamental shift in how AI should work for you.
"Hey Siri, add milk to my shopping list"
"ChatGPT, summarize this document"
You initiate every interaction. You remember to ask.
"You're almost out of milk. Added to cart—confirm?"
"Your flight changed. I rebooked + rescheduled 2 meetings."
AI monitors context. Surfaces what matters. Acts with permission.
Gartner predicts 40% of enterprise apps will embed task-specific AI agents by 2026, evolving assistants into proactive workflow partners.
— Forbes, "Agentic AI Takes Over," Dec 2025
Four converging forces make this the moment.
Models finally capable of real reasoning
Memory across weeks of interaction
Agents can control apps natively
$0.01-0.10 per task, not $50/hr
Plan to increase agentic AI budgets
Enterprise GenAI agents 2025 → 2027
95% frustrated with current assistants
Apple Intelligence proves local AI demand
From surveys, Reddit, and academic research.
"Remember what I told you last week"
"Remind me before I forget"
"Know my preferences without asking"
"My data stays mine"
93% of respondents predict agentic AI will enable more personalized, proactive, and predictive services.
— Cisco 2025 AI Study
An assistant that knows you. The future of personal assistants is when the helper learns from your data, documents, and writing style.
— AI Industry Forecast 2026
Always-on AI that learns, anticipates, and acts.
Deep prospect intelligence
Personalized messaging
Meeting coordination
Triage, draft, follow-up
Deep work on autopilot
The tasks you hate, automated
What makes this different from Siri/Alexa/Google Assistant?
Generic. Lowest common denominator.
Your context trains their models.
Only works in their ecosystem.
Lost the AI race years ago.
Deep personalization for serious work.
Local-first. You control what's shared.
Works with your existing tools.
GPT-5, Claude 4, always the best.
The Positioning
We're not competing with Siri for "set a timer." We're building the second brain for knowledge workers—people who will pay for AI that actually makes them more effective.
Co-Founder + Advisor
$140M exit (Mystery) · a16z funded · Built consumer products used by millions
Co-Founder & CTO
CTO of Because · $3M Seed · Deep agent infrastructure
Co-Founder
yapthis.com · Agentic architecture · Shipped production agent systems
Scale agent infrastructure + team
Prove the Personal AI OS at scale
Personal AI for everyone
Dogfooding OpenClaw constantly
ClawView for agent monitoring
Agent Seatbelt for guardrails
Proving demand before pitching
An AI that knows you, anticipates your needs, and takes action—not just another chatbot waiting for prompts.
The Thesis in One Line
The shift from reactive AI to proactive AI is a $56B market. We're building the operating system for it.
The SaaS pricing model is breaking. AI does the work now—so why pay for human logins? We deliver outcomes and charge when they happen.
AI is driving a shift toward outcome-based pricing. Per-seat is no longer the atomic unit of software. If AI can handle a sizable proportion of customer support, companies will need far fewer human agents, and therefore fewer software seats.
— a16z Enterprise Newsletter, December 2024
SaaS pricing is undergoing its biggest shift since the cloud. AI is killing the per-seat model.
Seat-based pricing may not fit when AI is doing the work. If an agent replaces a human task, customers will expect to pay based on outcomes, not log-ons.
— Bain Technology Report 2025
The logic of per-seat pricing breaks when AI replaces the humans who need seats.
Per-seat pricing undervalues the automation
70% churn when outcomes don't follow
2025 pilots hitting 2026 renewals—"are we really getting value?"
Not for access to tools
Customers calculate value instantly: $X per meeting = clear math
We only win when you win
The Bessemer Thesis
AI-native companies are abandoning seat-based SaaS pricing in favor of usage-, output-, and outcome-based models that directly align revenue with measurable results.
— Bessemer Venture Partners, "The AI Pricing and Monetization Playbook" (Feb 2026)
The market leaders are proving outcome-based AI pricing works at scale.
Customer Support AI
$0.99 per resolution
65% resolution rate. Aligns every team around one outcome: resolved tickets. Now deployed at 99% of conversations.
Customer Support AI
Outcome-based pricing
"First in CX industry to offer outcome-based pricing for AI agents" — August 2024 announcement.
Legal AI
Per demand package
AI + legal experts generate personal injury demand letters. Per output pricing, not hourly.
Enterprise AI Support
Per-conversation + per-resolution
Hybrid model. Usage (conversations) + outcome (resolutions). Featured in a16z podcast.
Employee Support AI
ROI-based (tickets closed)
Shifted from consumption → outcomes. Customers gained clearer ROI, business accelerated.
Data Labeling → Platform
$13.8B valuation
Started as labeling services. Became infrastructure. Services → outcomes → platform.
The Pattern
Every major AI-native company is moving toward outcome-based pricing. This isn't experimentation—it's convergence.
43% of enterprise buyers consider outcome-based pricing a significant factor in purchase decisions.
"$X per meeting booked" = CFO-ready math. No spreadsheet gymnastics.
If it doesn't work, you don't pay. Risk transferred to vendor.
More meetings = more spend = more value captured. Natural expansion.
You're paying for results. Why churn from something that works?
"Why should we pay $X per user if we could pay $Y per outcome? Aligning price with realized value improves the ROI calculus."
— Enterprise buyer sentiment (Industry research)
"The fundamental shift is to stop charging for access and start charging for work done."
— Bain Technology Report 2025
Deloitte 2026 Prediction
"Outcome- or value-based pricing is based on the real business results that SaaS applications with AI agents produce. There will be a gradual move toward a future powered by integrated, autonomous multi-agent systems."
We operate AI agent fleets that book qualified sales meetings. You pay only when meetings happen.
Outcome-based pricing isn't charity—it's better economics for everyone.
Customer pays on outcome
AI compute + tooling + human oversight
Healthy unit economics, scales with volume
Each meeting → better templates → lower cost
$250-500 per meeting is a no-brainer
Start small, scale with proof
Cost tracks linearly with value
CFO loves outcome-based spend
The Intercom Lesson
"Intercom's $0.99 per resolution aligns every team around one outcome: resolved tickets. If Fin resolves a ticket in three messages or thirty, the customer pays the same. The risk is real—but the reward is equally real: customers know exactly what they're getting, and they can calculate ROI in their sleep."
— Bessemer, Feb 2026
Outcome-based pricing has real risks. Here's how we mitigate them.
Some meetings cost more than others
Customer usage varies month to month
"Did your AI really book this?"
Customers gaming the system
Base retainer + outcome fees = floor
Cost per outcome drops with scale
Contractually defined: what counts
Every action logged, no disputes
Industry Standard Emerging
"Agreements around basic definitions for things like 'an agent,' 'a task,' 'a process,' 'an interaction,' and 'an outcome' should be clearly defined, communicated, and agreed upon contractually." — Deloitte TMT Predictions 2026
They pay for results → they get results → no reason to leave
Every invoice shows exactly what they got
"It's working—give me more"
Our wedge: sales meetings
Synthetic data pipelines
University lab assets
When you only pay for results, there's no reason to churn. Aligned incentives = sticky customers. This is why Intercom's outcome-based Fin has 99% deployment.
Co-Founder + Advisor
$140M exit (Mystery) · a16z funded · Built consumer products used by millions
Co-Founder & CTO
CTO of Because · $3M Seed · Deep agent infrastructure experience
Co-Founder
yapthis.com · Agentic architecture · Shipped production agent systems
Running on OpenClaw infrastructure
Every engagement → better templates
Customers love it → lower CAC, zero churn
Better AI = more margin for us
You post a bounty: "$500 per meeting booked." AI agents compete. Whoever performs best gets paid. We already do this with bug bounties, Kaggle, hackathons. Why not for AI agents?
— Macy Mills, a16z Speedrun Partner
61% → 30%+ outcome-based adoption wave
70% churn = customers looking for what works
43% prefer outcome-based pricing
Services → outcomes → platform
Bookkeeping outcomes, not seats
$0.99/resolution, 99% deployment
AI agents that deliver results. You only pay when they do. The future of how work gets priced.
📚 Sources
a16z Enterprise Newsletter (Dec 2024) • Bessemer "AI Pricing Playbook" (Feb 2026) • Bain Technology Report 2025 • Deloitte TMT Predictions 2026 • OpenView SaaS Benchmarks • Gartner • EY "SaaS Transformation with GenAI" (Nov 2025) • BetterCloud "AI and SaaS Industry 2026" • Intercom Fin pricing page • Zendesk AI Agents announcement (Aug 2024)
50-70% churn rates. LinkedIn bans. Domain blacklists. The "autonomous AI SDR" thesis failed. Human-in-the-loop is winning.
"AI SDRs don't work—biggest bubble in tech." — LinkedIn comment with 400+ likes
"Their AI continuously hallucinated, getting things wrong about what my company does, the industry we are in, what products we sell. 1 positive reply, 1 demo, thousands of prospects touched, $7.5K down the drain."
— r/SaaS, Dec 2025
"A CRO from a publicly traded company disclosed that while an AI SDR helped generate a substantial volume of leads over a nine-month period, it did not lead to actual sales."
— Tomasz Tunguz, Theory Ventures
"Reports emerged of Artisan accounts, including those of team members and founders, facing restrictions or bans for suspected spam and automation violations."
— Quasa.io, Jan 2026
2x the churn of human SDRs (a role notorious for turnover) — Common Room
Platform ramped up AI detection, restricting automation-heavy accounts
Gmail filtering harshened. Sender reputations destroyed in weeks.
GDPR fines up to 4% revenue. TCPA: $500-1,500 per message.
"Permanent brand damage from being publicly associated with spam" — NUACOM
TechCrunch: "AI sales rep startups are booming. So why are VCs wary?"
"When one studies any of these startups individually, it's like 'wow, that's stunning product market fit.' When all 10 of them have stunning product market fit, it's hard to answer 'How is that going to play out?'"
— Shardul Shah, Partner, Index Ventures (hasn't invested)
"Without access to differentiated data, AI SDR startups risk being overtaken by incumbents like Salesforce, HubSpot, and ZoomInfo."
— Chris Farmer, CEO, SignalFire
"Investors are not surprised by the rapid adoption of AI SDRs; they are just doubting that adoption is sticky."
— TechCrunch, Dec 2024
$1.5B → 30% Layoffs
Jasper, the AI copywriting unicorn, ran into speed bumps and had to lay off 30% of staff after ChatGPT launched. AI SDRs face the same commoditization risk.
Built on commoditized LinkedIn data = undifferentiated output
Black boxes that create more work, not less
Incumbents (Salesforce, HubSpot) can bundle this free
"The AI SDR is dead, long live the AI SDR: How the future is Human-in-the-Loop"
Can't read tone, context, or cultural nuance essential in enterprise sales
Scraped data without consent → GDPR/CCPA violations
When AI misleads, your company bears the liability
"More volume on a bad message is not a strategy. It is self-sabotage."
"Commenting on someone's hoodie feels forced because it's a hollow observation"
"Teams that use AI to support human insight consistently outperform teams trying to replace humans entirely. It's not even close."
— Matthew Metros, The AI SDR is Dead
Data mining, signal detection, prospect prioritization
Judgment, trust, closing
MarketBetter (human oversight): 4.97/5 G2 rating
"Human-in-the-loop platforms consistently outperform fully autonomous ones"
We're not building another AI SDR. We're building what should have been built from the start.
"Autonomous AI employee"
"6,000 contacts/month"
$5-10K/mo regardless of results
Become pseudo-IT for AI
No outcome guarantees
AI research + human checkpoints
Right message, right person, right time
Pay for meetings, not seats
You focus on your business
Outcomes or you don't pay
AI handles the research. Humans make the decisions. You get meetings.
Intent signals, company news, technographics, pain points
Who to contact and why, right now
Personalized outreach based on real signals
Email, LinkedIn (safely), follow-ups
Review before sending to high-value prospects
When a prospect engages, humans take over
Define who you want to reach and why
Edge cases, sensitive prospects, brand protection
Their 50-70% churn is our customer acquisition channel.
Spent $5-10K/mo, got spam complaints
Need to rebuild sender trust
The problem didn't go away
Educated by failure
Only pay for meetings that happen
Human oversight prevents embarrassments
We've learned what works across verticals
They don't manage agents, they get results
Aligned Incentives
When customers only pay for results, there's no reason to churn. If we don't deliver meetings, they don't pay. Simple.
vs. AI SDR Churn
AI SDRs charge $5-10K/mo whether or not they work. When they don't deliver, customers leave. Misaligned incentives = 50-70% churn.
Co-Founder + Advisor
$140M exit (Mystery) · a16z funded · Built consumer products used by millions
Co-Founder & CTO
CTO of Because · $3M Seed · Deep agent infrastructure
Co-Founder
yapthis.com · Agentic architecture · Production agent systems
Scale human oversight operations + agent infrastructure
Prove the anti-AI-SDR thesis at scale
Turn proven playbooks into self-serve platform
50-70% churn = massive displaced customer base
Highest G2 ratings go to human-oversight tools
Position as the safe alternative before market consolidates
AI SDRs promised automation. They delivered spam, bans, and brand damage. We deliver meetings — with human judgment where it matters. Their 50-70% churn is our customer acquisition channel.
📚 Sources
Common Room "The AI SDR is dead" (Feb 2025) · TechCrunch "AI sales rep startups are booming. So why are VCs wary?" (Dec 2024) · Reddit r/SaaS AI SDR complaints · Quasa.io Artisan LinkedIn bans (Jan 2026) · Pipeline Group "Hidden Dangers of AI SDRs" · Theory Ventures SaaStr Talk · MarketBetter G2 Reviews
Browser-layer guardrails that block irreversible AI actions before they happen.
AI agents fail not from bad models, but from bad guardrails. 84% of companies deploying agents have zero safety boundaries defined.
— GenDigital Agent Trust Hub Research, 2026
$47K overnight cloud bills
AI SDR emails competitors
Deleted production data
Pricing sent to wrong channel
Block LinkedIn "Follow" for AI SDRs
Read vs. Write vs. Irreversible
Require confirmation for risky ops
Prevent runaway loops
Chrome extension that intercepts agent browser actions
Why Browser Layer
Framework-agnostic. Works with any AI agent (OpenClaw, LangChain, AutoGen, custom). Install once, protect everything.
Autonomous agents exploding
Enterprise worried about agent security
Regulatory tailwinds for safety
Just launched - validates market
Browser-layer = framework-agnostic
Chrome extension ships fast
The seatbelt you install before giving AI the keys.
🔗 Supports These Pitches
Fat Startup • AWS of AI Work • Control Plane
Part of the human oversight layer that makes agent work reliable.
When your AI employee sends the wrong email at 3am, you'll know exactly why.
The Problem
Companies are deploying autonomous AI agents that run 24/7. When something goes wrong—and it will—they have no idea why. Current tools are built for request-response, not proactive agents.
User sends message, LLM responds
LangChain-specific, not agent-native
Built for chatbots, not employees
24/7 agents taking proactive actions
Why did it make that choice?
Shell, browser, files, messages
"The agent sent the wrong email. Logs show it ran. No idea why."
"Step 3: Agent assumed X because of context Y. Here's how to prevent this class of error."
Every decision. Every action. Every assumption. Full causal tracing.
🔗 Supports These Pitches
Fat Startup • AWS of AI Work • Control Plane
Observability layer — see what agents are doing before they go wrong.
⚠️ Why This is a Feature, Not a Company
Langfuse, LangSmith, Arize are well-funded. But none are built for autonomous agents. ClawView is our internal observability layer, not a separate product pitch.
Audit trails. Approval workflows. Compliance automation. The control layer enterprises need.
AI agents fail not from bad models, but from bad guardrails. The unlock isn't better agents—it's better safety rails.
— Industry consensus, 2026
What did the agent do at 3am?
High-stakes actions go unsupervised
EU AI Act enforcement coming
Humans can't supervise at machine speed
Every action, every decision, timestamped
Human gates for high-stakes actions
EU AI Act ready, audit reports generated
Validator agents checking worker agents
McKinsey Insight
"Organizations are moving from human in the loop to human on the loop—above the loop for strategic oversight." AgentGov enables this transition safely.
Audit trails. Approval workflows. Compliance automation. Trust at machine speed.
🔗 Supports These Pitches
Fat Startup • AWS of AI Work • Control Plane
Governance + compliance layer — enables enterprise trust.
🔬 Key Research
Gravitee 2026: Only 14.4% have full security approval for agents. 88% reported incidents.
EU AI Act: Enforcement begins 2026, mandates audit trails.
Zenity: $38M Series B validates market (but they're low-code focused, not agent-native).
10 layers an AI employee needs to fulfill an entire job description. We're building the unified platform.
The Thesis
An AI employee's value lies in performing EVERYTHING in a job description—not just one workflow. This requires a complete infrastructure stack.
What's Missing (⭐)
Layers 8-10 are the critical gaps. Everyone's building capabilities. Nobody's building supervision, agent-to-agent communication, and compliance.
Current landscape is fragmented
A unified platform that manages the full AI employee lifecycle.
All 10 layers, one platform
Job description → Working AI employee
Built-in compliance, audit, oversight
The unified platform for deploying, managing, and governing AI employees.
🔗 Framework For These Pitches
Fat Startup • AWS of AI Work • Control Plane
The 10-layer framework is how we think about what AI employees need.
⚠️ Why This is a Framework, Not a Pitch
Building all 10 layers is massive. We focus on Layers 8-10 (supervision, communication, compliance) because that's the critical gap. The framework informs strategy, not the pitch itself.
Verified working code. Real benchmarks. Pay-per-snippet micropayments. Documentation that actually works.
Claude Code chose Whisper V1 — near-deprecated — over Groq (200x faster, 10x cheaper) because OpenAI's docs are cleaner. Agents pick tools by doc quality, not performance.
— Garry Tan, YC Partner, Feb 2026
Even the best dev tools don't let you sign up via API. This is a big miss in the claude code age — claude can't sign up on its own.
— Jared Friedman, YC Partner, Feb 2026
Despite our best efforts, they will always hallucinate. That will never go away.
— Amr Awadallah, Vectara CEO, 2026
Agents pick whatever has most examples
APIs change, snippets break
Agent can't know if code actually runs
No cost/perf data to guide decisions
Code tested continuously, timestamped
"Transcribe video" → 10 services compared
Cost, latency, quality scores
$0.05 per verified snippet
Kill the API Key
No signup. No rate limits. No accounts. Agent pays per-request, gets verified code. Native to how agents want to consume services.
Groq, Deepgram, Whisper. Zero x402 servers. Garry Tan moment.
Kling, Runway, Wan. Parameter chaos unsolved.
Model selection based on task + budget
Email + phone + wallet in one API
Verified against real APIs
Same format across providers
Real numbers, updated hourly
"For fast+cheap → use Groq"
✓ They have this
✓ They have this
No continuous testing
No cost/perf data
Free only, no agent-native billing
$43M+ processed, 35M+ txns
OpenClaw: 9K→60K stars
AI agent infrastructure
Verification is table stakes soon
The x402 Thesis
25,000+ developers building on x402. Google, Cloudflare, Stripe adopting. Machine-to-machine payments are the rails for agent economy.
Real-time data from x402scan.com shows a booming agent economy — with a clear gap for developer tooling.
| Facilitator | 30d Txns | 30d Vol | What They Do |
|---|---|---|---|
| Dexter | 1.65M | $79.5K | Agent economy platform |
| Coinbase | 722K | $288.5K | Official CDP facilitator |
| Virtuals Protocol | 412K | $1.34M | AI agent tokenization |
| PayAI | 1.31M | $43.3K | Micropayments |
| RelAI | 66K | $84K | Agent payments (Solana) |
| Meridian | 19K | $315K | High-value transactions |
| Thirdweb | ~10K | ~$2K | Web3 dev platform |
| OpenX402 | 6.6K | $38.6K | Open-source facilitator |
| Polymer | 6.4K | $770 | Proof generation |
| AnySpend | ~3K | ~$5K | Multi-asset spending |
+ Corbits, OpenFacilitator, CustomPay, AgentPay (emerging)
Source: x402scan.com, Feb 27 2026
Data APIs, AI services, crypto tools, social data
Verified code snippets, curated docs, developer knowledge
Be the Stack Overflow layer on x402 rails
Why We Can Win
Top services (StableEnrich, LowPaymentFee) aggregate APIs — they don't verify code quality.
AgentDocs: Premium pricing ($0.05-0.10) justified by verification + benchmarks.
Target: 1,000+ requests/day = $2,100+/month revenue from agent micropayments alone.
Verified snippets. Real benchmarks. Agent-native payments. Stack Overflow, but for machines.
🔗 Supports These Pitches
Better documentation → better agent outputs → more reliable outcomes.
📍 Current Progress
Live: agentdocs-api.holly-3f6.workers.dev
Snippets: 15 use cases, 21 verified snippets
Status: Dogfooding internally, expanding library
AI agents can write code, deploy apps, and manage infrastructure. But they can't sign up for a Stripe account. We fix that.
Even the best developer tools mostly still don't let you sign up for an account via API. This is a big miss in the claude code age because it means that claude can't sign up on its own. Putting all your account management functions in your API should be table stakes now.
— Jared Friedman, YC Partner, Feb 27 2026
Hit this exact wall last week. Claude Code can scaffold an entire project, write tests, deploy to staging, but needs me to manually sign up for a third party service and paste in an API key. The last mile of developer tooling is still stuck in 2019.
— @advikjain_, replying to Jared
What developers said in response to Jared's tweet
"This is a real friction point for agentic workflows. The auth layer is always manual. Companies that figure out API-first account provisioning will eat the ones stuck in dashboard-only onboarding."
— @thebasedcapital
"I've watched AI tools fail at basic integration tasks because they hit the 'create account manually' wall. We're debating whether Claude can replace junior devs but it can't even sign up for Stripe."
— @OneManSaas
"Signup is just the tip. Billing, permissions, onboarding — everything assumes a human in a UI. Devtools that go full API-first for the entire lifecycle get a massive edge when agents pick their own stack."
— @wildpinesai (tagging @paulg)
"Bigger issue than just signup. Most SaaS still treats APIs as a feature for power users, not the primary interface. When your biggest customer is an agent, the whole product surface needs to be API-first."
— @twitter user
The Skeptics (and why they're wrong)
"Won't this enable bot spam?" — Valid concern, but x402 payments solve this. Agents pay real money per signup. Spam bots won't pay $1 per account.
"Companies don't want bot signups" — They want PAYING customers. Agent-initiated signups that convert to revenue are valuable.
We provision agent-{id}@portal.viewholly.com
Agent brings their own email (AgentMail, etc.)
Agent payments are live. Portal fits perfectly.
StableEnrich, httpay
Virtuals ACP ($163K/day)
StableSocial, TweetX402
StableEmail (314 txns)
Nobody solving this
Wide open
Jared's exact point
| Service | Complexity | Price | Est. Time | Status |
|---|---|---|---|---|
| Resend | Simple | $0.50 | 20s | MVP |
| Railway | Simple | $0.50 | 25s | MVP |
| Vercel | Email verify | $1.00 | 30s | Week 2 |
| Supabase | Email verify | $1.00 | 35s | Week 2 |
| Cloudflare | Email verify | $1.00 | 30s | Week 2 |
| Stripe | 2FA / Complex | $2.00 | 60s | Phase 2 |
Revenue Model
1,000 signups/day × $1 avg = $30K/month
Infrastructure cost: ~$500/month (workers + CF)
Gross margin: 98%
┌─────────────────────────────────────────────────────────────┐
│ AGENT │
│ (Claude Code, OpenClaw, any AI) │
└─────────────────────────────────────────────────────────────┘
│
│ POST /signup (x402 $1)
▼
┌─────────────────────────────────────────────────────────────┐
│ PORTAL API (CF Workers) │
│ Hono + @x402/hono + D1 job queue │
│ Returns job_id in <100ms │
└─────────────────────────────────────────────────────────────┘
│
│ Workers poll
▼
┌─────────────────────────────────────────────────────────────┐
│ WORKER FLEET (OpenClaw Instances) │
│ ┌────────┐ ┌────────┐ ┌────────┐ ┌────────┐ │
│ │Worker 1│ │Worker 2│ │Worker 3│ │Worker 4│ (4+) │
│ │Browser │ │Browser │ │Browser │ │Browser │ │
│ └────────┘ └────────┘ └────────┘ └────────┘ │
└─────────────────────────────────────────────────────────────┘
│
│ Encrypted credentials
▼
┌─────────────────────────────────────────────────────────────┐
│ CREDENTIAL VAULT (KV) │
│ One-time retrieval • 5-min TTL • Encrypted │
└─────────────────────────────────────────────────────────────┘
"I can only imagine allowing full automation when there's a direct path to monetisation. Maybe when we have a more reliable API for charging agents for specific actions automatically."
— @Everlier, replying to Jared
x402 IS that reliable API. Portal is the first service to use it for signup.
Agents can do everything except onboard to services. Portal fixes the last mile of agent autonomy.
$30-40B poured into AI agents. 95% fail to deliver. We're building the missing infrastructure that makes them actually work.
Companies are pouring billions into AI agents. Almost none deliver measurable returns.
Companies are pouring $30–40 billion into generative AI, yet an MIT study finds that 95% of enterprise pilots deliver zero measurable return.
— MIT NANDA: The GenAI Divide, 2025
The pattern is consistent. It's not the models—it's the infrastructure.
Teams reinvent every agent from scratch. Same failures, different companies.
Agents run unsupervised. High-stakes errors go uncaught. Trust collapses.
Each company learns the same lessons. No accumulated knowledge.
Multi-agent systems collapse. Stanford CooperBench: 25% success rate.
Proven prompts, integrations, and sequences. Encoded from real deployments.
Smart escalation. Approval queues. Humans handle edge cases.
What breaks and how to prevent it. Compound learning across clients.
Coordinate multi-agent work. Handle failures gracefully.
The Unlock
The 5% that succeed have infrastructure. Templates. Oversight. Failure patterns. We're building that infrastructure as a service.
The most valuable infrastructure companies started by doing the work themselves.
Data Labeling → AI Infrastructure
Started labeling images for self-driving cars (2016). Now the "Data Foundry" powering OpenAI, Meta, Google. 50% gross margins from tech-enabled services.
Bookkeeping Services → Financial Infra
"AWS for SMB accounting." Started doing bookkeeping. Now processes $3B+ in transactions. Jeff Bezos led funding.
Payments API → Financial Infrastructure
Started with simple payment processing (2010). Expanded to Connect, Radar, Atlas. Infrastructure that grows as customers grow.
The Pattern
Do the work → Encode the patterns → Become the platform. Services fund the R&D. Each engagement builds the moat. Competitors starting later start from zero.
Their journey is our playbook. Same model, different layer.
Started labeling images for AV companies. Revenue from day one.
Built pre-labeling ML that made each human 10x more efficient.
Each correction improved their models. More data = better automation.
Nucleus, Validate, Launch—from labeling to full ML lifecycle.
Operating AI agent workflows for clients. Revenue from day one.
Workflow templates + orchestration that make agents reliable.
Each engagement encodes learnings. More workflows = better templates.
Guardrails, Observability, Governance—full agent lifecycle.
Scale AI is not a traditional BPO company. It is a Data Foundry. Their technology layer is their moat—human workforce augmented by proprietary software that compounds in value.
— Takafumi Endo, "Scale AI: Deconstructing the Foundry"
Each engagement encodes a playbook. Playbooks become the platform.
Figure everything out from scratch
Apply existing playbook + customize
Playbook is battle-tested
Playbooks become product
What actually works for each use case
Which models for which tasks (cost/quality)
Integrations, APIs, credentials patterns
What to block, what to escalate
Application companies fight for customers. Infrastructure companies power the ecosystem.
Race to the bottom. Easy to copy.
Each customer = new acquisition cost
Commodity software pricing
Customers can leave anytime
Mission-critical. Hard to replicate.
Templates improve → more value → more customers
Scale AI: 18x. Stripe: higher.
Workflows built on your templates
Network effects are the underlying principle behind the success of companies like AWS, Stripe, and Salesforce. Higher network density means the product value increases.
— NFX: The Network Effects Manual
AI agents are the fastest-growing category in enterprise software. We're building the infrastructure layer.
If AI Agents are $50B, infrastructure is 20-30% of stack value:
Every month = more encoded knowledge
Services fund the platform
Failure patterns competitors don't have
Four layers that make AI agents reliable. We're building all four.
Browser-layer guardrails that block irreversible actions
Observability for autonomous agents. See what they do.
Governance, compliance, audit trails
Verified code snippets for agent tool use
Lead gen + qualification workflows
Synthetic data pipeline workflows
Literature review + synthesis workflows
Outbound + meeting booking workflows
Fat Startup Thesis
We're getting paid to build our moat. Every dollar of revenue = more encoded knowledge. Competitors starting later start from zero.
"A fat startup ships outcomes, not features. It bundles software, data, and human ops into one integrated product that actually gets the job done."
— Andrew Lee, a16z Speedrun
Prove unit economics at scale
Across 5+ verticals
Guardrails, Observability, Governance
Deploy without our team
GPT-5, Claude 4—agents can work
70-80% churn = customers seeking alternatives
No dominant player yet. First-mover wins.
EU AI Act mandates oversight, audit trails
Infrastructure that makes AI agents reliable. Workflow templates. Orchestration. Human oversight.
Every company deploying agents will need this. We're building it.
Co-Founder + Advisor
$140M exit (Mystery) · a16z funded
Co-Founder & CTO
CTO of Because · $3M Seed
Co-Founder
yapthis.com · Shipped production agents
MIT NANDA Study: 95% AI failure rate, 171% ROI when successful
MarketsandMarkets: $7.8B → $52.6B AI agents market (2025-2030)
Scale AI (Sacra): $1.5B ARR, $29B valuation, 50% gross margins
Pilot (CNBC/TechCrunch): $1.2B valuation, Bezos-backed
11x/Artisan: 70-80% churn within months (Broadn research)
RAND Corporation: 80% AI project failure rate
Post an outcome. AI agents compete. Pay only for results. We're building the outcome marketplace for the AI economy.
This is the exact model a16z partners are calling for in 2026.
Say you need 50 qualified sales meetings. Instead of buying another AI tool, you post a bounty: "$500 per meeting booked." AI agents compete. Whoever performs best gets paid. We already do this with bug bounties, Kaggle, hackathons. Why not for AI agents going after real business outcomes?
— Macy Mills, a16z Speedrun, "14 Big Ideas for 2026"
I'm especially excited about products that use AI to make previously expensive services cheaper and more accessible, sometimes using human-in-the-loop to start.
— Kenan Saleh, a16z Speedrun, "14 Big Ideas for 2026"
A fat startup ships outcomes, not features. It bundles software, data, and human ops into one integrated product that actually gets the job done.
— Andrew Lee, a16z Speedrun Partner
The freelance marketplace is $1.5T. It's about to be disrupted by AI agents.
Pay humans by the hour. Hope they deliver.
Fixed-price gigs. Still human-dependent.
Wait days. Pay premium. Quality varies.
$X per meeting, $Y per video, $Z per lead.
AI agents work 24/7. Instant scale.
More agents = better matching = better outcomes.
The Paradigm Shift
As we move to a future based on outcome-based pricing that perfectly aligns incentives between vendors and users, we'll first move away from time-based billing. — a16z Big Ideas 2026
Bounties + Escrow + AI Agents = Outcome Marketplace
"Book qualified meeting" or "Generate product video"
Pay what the outcome is worth to you
Funds held in escrow. Pay only on delivery.
Match capabilities to opportunities
Success rate → more bounties → more revenue
Verified outcome → automatic payout
Proven in bug bounties, open source, and ML competitions. Now it's time for AI work.
Imagine a tool where you describe your problem and get a solution built for you. Today we're introducing Bounties, a marketplace where you work with top creators and bring your software ideas to life.
— Replit, on launching Bounties
Replit proved bounties work for code. We're proving it works for any AI-deliverable outcome.
Over the past 5 years we've supported the funding of public goods. Started with bounties for open source, evolved to quadratic funding.
— GitCoin: $60M+ distributed
GitCoin proved bounties + crypto payments = massive coordination. We're applying this to AI agent work.
70% of tech value comes from network effects. Here's how we build them.
Network effects have been responsible for 70% of all the value created in technology since 1994. Founders who deeply understand how they work will be better positioned to build category-defining companies.
— NFX, "The Network Effects Bible"
Attracts more agents to the platform
Faster delivery, higher quality outcomes
Word of mouth, lower prices, faster delivery
What works, what fails, edge cases
Route bounties to best-fit agents
Compound knowledge competitors can't replicate
Metcalfe's Law
Value of a network grows proportional to N² (nodes squared). With agents AND buyers, we get cross-side network effects that compound faster than single-sided platforms.
The missing infrastructure for AI agent marketplaces.
Who built it, what it can do, audit trail
Track record based on actual outcomes, not reviews
"This agent is 94% on sales meetings, 78% on video"
Funds released only on verified delivery
Human or AI arbitration for edge cases
Partial credit for partial delivery
Like Uber: start premium, then open the platform.
Quality control, learn playbooks
Customers paying for outcomes
Escrow, verification, reputation
Vetted builders, revenue share
Anyone can compete for bounties
Like Uber, Airbnb, marketplace standard
The Uber Playbook
Uber started with black cars (premium, managed) before opening to UberX (open marketplace). We start with our agents, prove economics, then open to all. Services fund the platform build.
Services → Platform is a proven path to massive outcomes.
Data labeling for ML companies
Tools, workflows, quality systems
Services funded the infrastructure
Platform economics, not services multiples
Data labeling for ML
Sales, content, research, ops...
Every white-collar task that can be AI'd
GPT-5, Claude 4 can do real work
Agents can transact autonomously
OpenClaw, MCP, agent frameworks
70% churn = buyers want outcomes
Companies spending on AI, getting nothing
No AI-native outcome marketplace yet
Emerging primitives like x402 make payment settlement programmable and reactive. Smart contracts can settle a dollar payment globally in seconds. In 2026, this becomes the rails for agent commerce.
— a16z Big Ideas 2026, Part 3
Co-Founder + Advisor
$140M exit (Mystery) · a16z funded · Built consumer products used by millions
Co-Founder & CTO
CTO of Because · $3M Seed · Deep agent infrastructure experience
Co-Founder
yapthis.com · Agentic architecture · Shipped production agent systems
What Traction Proves
Companies pay for outcomes. 0% churn because incentives align. This is the business model for AI work.
Scale agent capacity, build marketplace infra
Prove economics before opening marketplace
Partner agents, then fully open
We run agents daily, know what breaks
ClawView, guardrails, workflows
$4K MRR proves the model
Post an outcome. AI agents compete. Pay for results. The marketplace that makes AI actually deliver.
🔧 Infrastructure We're Building
🛡️ Guardrails • 📊 ClawView • 🏛️ AgentGov
Trust layer that makes marketplace outcomes reliable.
📚 Sources
a16z: "14 Big Ideas for 2026" (Macy Mills, Andrew Lee, Kenan Saleh) • "Big Ideas 2026 Part 1-3" • NFX: "The Network Effects Bible" (70% of tech value) • Market Data: Scale AI ($13.8B), Upwork ($1.67B), GitCoin ($60M+ distributed) • Replit: Bounties marketplace launch
Everyone's building autonomous agents. We're building the layer that makes them actually work: purpose-built infrastructure for human oversight at scale.
The research is clear—and the industry is learning the hard way.
"Multi-agent architectures, despite their promise, can fall short on efficiency, reliability, and even accuracy... performance often degrades as coordination complexity increases."
— Berkeley/DeepMind "Why Multi-Agent LLM Systems Fail", 2025
ChatDev on ProgramDev benchmark
Across autonomous agent frameworks
In uncoordinated "bag of agents"
"42% of companies abandoned most of their AI initiatives in 2024, up from 17% the previous year. The average organization scrapped 46% of AI proof-of-concepts."
— S&P Global Research, 2024
MIT Research on enterprise deployments
RAND Corporation AI project study
AI projects vs standard software
Why This Matters
The industry is betting billions on fully autonomous agents. The research says they don't work. Someone needs to build the layer that makes them work.
The largest AI research org in the world just validated our thesis.
"We argue that human-in-the-loop agentic systems offer a promising path forward, combining human oversight and control with AI efficiency to unlock productivity from imperfect systems."
— Microsoft Research, Magentic-UI (July 2025)
Lightweight intervention, massive improvement
Minimal interaction overhead
Human + agent collaborate on plan before execution
Seamless handoff between human and agent control
Human approval for high-stakes actions
Learn from past interactions to improve
Microsoft's Conclusion
"Even as tomorrow's agents become more capable and reliable, we believe that human involvement will remain essential for preserving human agency, resolving unforeseen ambiguities, and guiding agents in adapting to an ever-changing world."
Real-world data from millions of Claude Code sessions reveals how humans actually oversee agents.
Experienced users let Claude run autonomously
They intervene more often, not less
From approving everything to watching for problems
On complex tasks vs simple ones
On the most difficult tasks
They can (and should) ask for help
"Effective oversight doesn't require approving every action but being in a position to intervene when it matters... our central conclusion is that effective oversight of agents will require new forms of post-deployment monitoring infrastructure and new human-AI interaction paradigms."
— Anthropic Research, "Measuring AI Agent Autonomy in Practice" (Feb 2026)
The Deployment Overhang
Anthropic found that "the autonomy models are capable of handling exceeds what they exercise in practice." The bottleneck isn't model capability—it's the oversight infrastructure.
The analogy everyone is converging on—and what it means for product design.
"Think of agents within your multi-agent system as the airplanes. The agents have their own autonomy to act. But air traffic control provides guardrails, coordination, and human oversight for the whole system."
— Jason Bryant, AI in Pharma (Jan 2026)
Pilots make real-time decisions
Routing, conflicts, emergencies
Technology can't modify standard procedures
Incidents become new procedures
Not 1:1 human-to-agent
Nor vice versa—complementary roles
Edge cases require human judgment
ATC isn't going away
The Thesis
As AI agents proliferate, every company will need an "air traffic control" system for their agent fleet. That's the control plane we're building.
Existing tools weren't designed for the human-agent oversight problem.
Conversational, not workflow-oriented. Can't manage 100 agents. No approval queues. No batch operations. You'd need a chat window per agent.
Great for developers. Useless for ops teams. Can't approve actions in real-time. No visual understanding of agent state or intent.
Ad hoc approvals. No context. Alert fatigue. Doesn't learn from decisions. Can't see what agent plans to do next.
Read-only visibility. No intervention capability. See problems after they happen. Can't modify agent plans mid-execution.
"Only 14.4% of enterprises have full security approval for AI agents. 88% reported agent-related incidents. The interface problem is also a governance problem."
— Gravitee State of AI Agents Report, 2026
The Gap
There's no purpose-built interface for humans to oversee AI agents at scale. Not dashboards. Not chat. Not alerts. A new category needs to exist.
Distilled from Microsoft, Anthropic research, and our own deployments.
See what agent intends to do before it acts. Edit plans. Add constraints.
Define allowed domains, tools, actions. Agent can't exceed boundaries.
Start from proven patterns. Don't reinvent for every task.
See agent actions as they happen. Browser view. Code execution. API calls.
Pause any agent instantly. Take control. Hand back.
Automatic pause for high-stakes actions. Configurable thresholds.
All pending approvals across all agents in one view.
Approve/reject patterns across many agents at once.
Route different decisions to different humans by expertise.
Human approvals become future patterns. Rejections become rules.
Auto-adjust when to ask humans based on outcomes.
Workflows improve with every human intervention.
Every complex system has a control plane. AI agents need one too.
Control plane for infrastructure. See what's happening. Alert when things break. Intervene.
Control plane for containers. Orchestrate workloads. Handle failures. Scale automatically.
Control plane for identity. Who can access what. Audit trails. Compliance.
What agents are doing. Approvals & intervention. Learning & guardrails. This category doesn't exist yet.
"The control plane provides management and orchestration across an organization's environment. It's akin to air traffic control for applications."
— Vectra AI definition
The Opportunity
Infrastructure got Datadog. Containers got Kubernetes. Identity got Okta. AI agents need their control plane. We're building it.
The VC objection—and why it's wrong.
"If humans are in the loop, doesn't that kill unit economics? Isn't the whole point to remove humans?"
Human labelers + AI. Humans as oversight.
Human bookkeepers + AI. Humans as QA.
Human analysts + AI. Humans as strategists.
"Humans as OVERSIGHT, not labor. AI does the work, humans QA. The ratio improves over time."
1 human oversees 10 agents. Heavy QA.
System learns. Fewer interventions needed.
Humans handle edge cases only. Still critical.
The Avi Medical Case Study
81% automation rate. 93% cost savings. Humans handle complex cases. HITL doesn't kill unit economics—it enables them.
Everyone's zigging toward full autonomy. We're zagging toward control.
Demo well. Break in production.
Better models. More tools. Same failure modes.
17x error amplification, per DeepMind.
The dream that keeps failing.
Makes ANY agent more reliable.
Complementary strengths. Better outcomes.
Turns bag-of-agents into functional team.
Exception handling. Strategic oversight.
"I'm especially excited about products that use AI to make previously expensive services cheaper and more accessible, sometimes using human-in-the-loop to start."
— Keenan Saleh, a16z Speedrun Partner
Our Position
We're not betting against agent capabilities improving. We're betting that oversight infrastructure will always be needed—and no one is building it well.
Co-Founder + Advisor
$140M exit (Mystery) · a16z funded · Built consumer products used by millions
Co-Founder & CTO
CTO of Because · $3M Seed · Deep agent infrastructure experience
Co-Founder
yapthis.com · Agentic architecture · Shipped production agent systems
OpenAI Operator, Anthropic Claude Code, 1000+ agent startups
95% pilot failure is now common knowledge
Microsoft, Anthropic, DeepMind all pointing to HITL
EU AI Act mandates audit trails & oversight
Proving the thesis with real customers
Dogfooding our own control plane daily
Components of the full control plane
Purpose-built infrastructure for human oversight of AI agents at scale. Plan review. Action guards. Approval queues. Learning loops. The missing layer that makes agents actually work.
Build the full control plane product
Prove control plane scales across customers
Be "Datadog for AI agents"
No one owns "AI agent control plane" yet
Microsoft, Anthropic, DeepMind alignment
Horizontal opportunity across industries
🔧 Infrastructure We're Building
🛡️ Guardrails • 📊 ClawView • 🏛️ AgentGov • 🤖 Employee OS
The Control Plane integrates all infrastructure layers into one human-facing interface.
🔬 Research Foundation
MIT: 95% of AI pilots fail · DeepMind: 17x error amplification in multi-agent · Microsoft Magentic-UI: 71% accuracy improvement with HITL · Anthropic: "New oversight infrastructure needed" · Berkeley: "Why Do Multi-Agent Systems Fail?" · S&P Global: 42% of AI initiatives abandoned
"Vibe coding" revolutionized app development—describe what you want, AI builds it. Now apply this to business outcomes. Describe the result, AI + humans deliver it.
What started as a meme became a paradigm shift. Now it's evolving beyond code.
"There's a new kind of coding I call 'vibe coding', where you fully give in to the vibes, embrace exponentials, and forget that the code even exists."
— Andrej Karpathy, Feb 2025 (coined the term)
Karpathy's early prediction about LLM capabilities
Cursor, Replit, Claude Code—describe → build
Research, writing, reporting, file operations, "glue work"
"What changed in early 2026 is that vibe coding is no longer confined to software development; it is spreading into research, writing, reporting, spreadsheet wrangling, file operations, and 'glue work' that usually fragments attention."
— Ken Huang, "The Vibe Shift" (Jan 2026)
The Pattern
Vibe coding showed that natural language → complex software works. Now we're applying the same pattern to natural language → business outcomes.
The next evolution: describe what you want to achieve, not what you want built.
$5-10K/month
Import lists, write sequences, set rules
Fix errors, adjust settings, babysit
70% churn in 3 months when it doesn't work
"50 qualified sales meetings with Series A fintech founders"
Research, outreach, qualification, scheduling
Review, approve, handle edge cases
$X per meeting delivered
The Thesis
Vibe coding proved that intent → artifact works for software. Vibe outcomes proves it works for business results. The "vibes" are the goal—the execution is handled by well-orchestrated HITL agent workflows.
Describe outcome → Agents execute → Humans QA → Outcome delivered
"Book 50 qualified meetings with Series A fintech founders in Q1"
Identifies prospects, signals, contact info
Drafts personalized messages
Approves messaging before send
Books the meeting when prospect replies
"Process this month's invoices and flag anomalies"
Pulls data from PDFs, emails, systems
Matches to POs, identifies discrepancies
Approves exceptions, flags fraud
Processed invoices, exception report
Pure AI can't deliver reliable business outcomes. The research is clear.
"Multi-agent architectures, despite their promise, can fall short on efficiency, reliability, and even accuracy... performance often degrades as coordination complexity increases."
— Berkeley/DeepMind, 2025
AI can be confidently wrong about business-critical decisions
Business has nuance AI can't anticipate
Brand damage, legal liability, lost deals
"Hybrid AI workflows, which combine automation with human oversight, are not a fallback; they're the modern standard for reliability, trust, and scalability in 2026."
— Parseur, Dec 2025
AI does 90% of work, humans verify critical decisions
System learns when to ask, when to proceed
Microsoft found lightweight intervention = massive improvement
This is the UX for the AI-native agency, control plane, and marketplace pitches.
Conversational, not outcome-oriented. Can't manage complex multi-step workflows. No approval queues.
Read-only visibility. No intervention. See problems after they happen. Can't modify plans mid-execution.
Ad hoc. No context. Alert fatigue. Can't see what agent plans to do next.
"I need X" → system figures out how
See what's happening toward your goal
Review decisions that matter
Course-correct mid-execution
Clear metrics: delivered vs requested
🔗 This Powers Our Other Pitches
⚡ Fat Startup: Vibe outcomes is how customers interact with us
🚗 Uber for AI Work: Natural language bounty posting
🎛️ Control Plane: The human oversight layer
☁️ AWS of AI Work: Workflow templates activated by intent
The shift from "tools" to "outcomes" is creating massive new markets.
70% AI SDR churn = customers seeking alternatives
95% pilot failure = demand for what works
Want outcomes, not another tool to learn
"Per-seat is no longer the atomic unit of software. When AI can handle ticket resolution, the natural pricing metric becomes successful outcomes."
— a16z Enterprise Newsletter, Dec 2024
$X per meeting, $Y per processed invoice, $Z per video
Who else is thinking about natural language → outcomes?
Sell tools. Charge per seat. You manage agents. 70% churn.
❌ Not outcome-based
Infrastructure for developers. Build your own workflows.
❌ Not outcomes, just primitives
Workflow automation. You design the flows.
❌ Not AI-native, not outcome-based
Services + HITL → platform. "We need labeled data" → delivered.
✓ Outcome-based, HITL model
"Do my bookkeeping" → done. Humans + AI.
✓ Outcome-based, HITL model
AI support priced per successful outcome.
✓ Outcome-based pricing model
Our Differentiation
Horizontal, not vertical. Scale AI = data labeling. Pilot = bookkeeping. We're building the general-purpose vibe outcomes platform—natural language to any deliverable business result.
Proving the thesis with real customers and real outcomes.
SDR/BDR for construction, startups (50% of revenue)
ML training data pipelines (30% of revenue)
University lab literature synthesis (20% of revenue)
"When you only pay for outcomes, there's no reason to churn. We deliver meetings, they pay. We don't deliver, they don't pay. Aligned incentives = sticky customers."
vs. AI Tool Churn
AI SDRs charge $5-10K/mo whether or not they work. When they don't deliver, customers leave. Misaligned incentives = 70% churn.
Technology, market, and cultural convergence make this the moment.
GPT-5, Claude 4 can execute real business workflows
OpenClaw, MCP, tool-use protocols
Agents can transact autonomously (a16z Big Ideas 2026)
Microsoft, Anthropic, DeepMind all pointing same direction
70% AI SDR churn. 95% pilot failure. Customers want what works.
Companies spending billions on AI, getting nothing
30%+ enterprise SaaS moving to outcome-based
Natural language → results is now understood
"2025 was widely labeled 'the year of AI agents.' In reality, it was the year we learned what agents can and cannot do. 2026 is the year we build systems that work reliably, repeatedly, and in production."
— Human-in-the-Loop Newsletter, Dec 2025
Co-Founder + Advisor
$140M exit (Mystery) · a16z funded · Built consumer products used by millions
Co-Founder & CTO
CTO of Because · $3M Seed · Deep agent infrastructure experience
Co-Founder
yapthis.com · Agentic architecture · Shipped production agent systems
Running OpenClaw infrastructure ourselves
Browser-layer guardrails
Agent observability
Playbooks that compound
$4K MRR from real deliverables
Built the infrastructure, not just the agents
Encoded in playbooks from real experience
Describe the outcome you want. AI agents + human QA deliver it. Pay only for results. The interaction layer for the AI economy.
Scale agent capacity + build the interface
Prove vibe outcomes across multiple verticals
Anyone can describe outcomes and get them
"Vibe outcomes" platform doesn't exist yet
Vibe coding is mainstream—extend it to business
AI agents + outcome-based pricing converging
📚 Research Foundation
Karpathy: Coined "vibe coding" Feb 2025 · RAND: 80% AI project failure · Microsoft Magentic-UI: 71% accuracy improvement with HITL · CooperBench: 30% lower success in multi-agent without coordination · a16z: Outcome-based pricing shift · Gartner: 30%+ enterprise SaaS with outcome pricing by 2025 · Bessemer: AI Pricing Playbook (Feb 2026)
🔗 Related Pitches
⚡ Fat Startup • 💰 Outcome-Based • 🚗 Uber for AI Work • 🎛️ Control Plane
Vibe Coding Outcomes is the UX/interaction layer that powers all of these.
Series A-B companies ($13M-$160M raised) with specific research they could implement but haven't.
NYC Tech Startups
Combined Funding
Research Opportunities
Building "Wall Street's first AI analyst" — LLMs for financial reasoning
R&D Opportunities:
Hook: "Your financial reasoning models could be 40% more accurate on tabular data with Chain-of-Table"
AI for finance — valuation models, deal analysis, Excel/PPT generation
R&D Opportunities:
Hook: "SpreadsheetLLM could cut your Excel generation errors by 30%"
GenAI for financial professionals — broker research, earnings calls, filings
R&D Opportunities:
Hook: "LongLoRA could let you process 10x longer earnings calls without quality loss"
Marketplace for curated AI-ready datasets (Insights Exchange)
R&D Opportunities:
Hook: "DataComp benchmarking could become your quality certification"
AI for cancer precision medicine — analyzes data to identify optimal treatments
R&D Opportunities:
Hook: "CancerGPT's few-shot approach could expand your drug combination predictions 5x faster"
AI + IoT for senior care — AUGi device for fall detection and patient monitoring
R&D Opportunities:
Hook: "RT-DETR could cut your fall detection latency by 40% while running entirely on-device"
AI for mental health — "Ash" chatbot simulates therapist-like conversations
R&D Opportunities:
Hook: "Constitutional AI could reduce harmful responses by 80% while maintaining therapeutic value"
Healthcare payment automation — streamlines insurance reimbursement
R&D Opportunities:
Hook: "Medical coding LLMs could auto-fill 60% of your claims forms"
AI-powered payroll platform for multi-state compliance
R&D Opportunities:
Open-source network automation platform
R&D Opportunities:
AI marketing for home service businesses
R&D Opportunities:
AI for sales personalization — integrates 100+ data sources
R&D Opportunities:
Hook: "Buyer intent prediction could 3x your users' reply rates"
AI search optimization — helps brands appear in AI-generated responses
R&D Opportunities:
Influencer commerce platform
R&D Opportunities:
AI for regulatory compliance — automates review of legal documents
R&D Opportunities:
Document AI — searches large document sets with citations
R&D Opportunities:
Hook: "Self-RAG could improve your citation accuracy by 25%"
SMB cybersecurity
Reforestation + carbon credits
Silicon anodes for EV batteries
P2P sports betting
High-protein nutrition bars
Laundry/dry-cleaning SaaS
Subject: Quick R&D idea for [Company] — [specific technique]
Hi [Name],
Congrats on [recent news/funding]. I've been researching [specific paper/technique] that could help with [their specific problem].
Quick version: [1-sentence benefit with number]
I put together a 2-page brief showing how this could work for [Company]. Want me to send it over?
The real market pain is downstream from R&D — it's about shipping AI to production.
AI projects fail to reach production (RAND)
GenAI pilots failing (MIT/Fortune 2025)
The gap isn't finding the right model. It's shipping AI to production.
From r/MLQuestions — 688 upvotes, Nov 2025
"I'll interview someone who can explain LoRA fine-tuning in detail but has never deployed anything beyond a Jupyter notebook."
— Startup co-founder hiring ML engineers
From Cleanlab's survey of 95 teams with AI in production
Teams satisfied with observability
Plan to improve observability next year
Rebuild AI stack every 3 months
Even among the 5% of companies that reach production, most remain early in maturity. They can't reliably know when their agents are right, wrong, or uncertain.
| ❌ OLD: "AI R&D Engineer" | ✅ NEW: "Production AI Engineer" | |
|---|---|---|
| Vibes | Research, experimentation | Deployment, reliability |
| Perception | Nice-to-have | Need-to-have |
| Target | Teams with resources | Teams with stuck projects |
| Job-to-be-done | "Find the best model" | "Ship to production this month" |
Aemon = the optimization engine
You = the shipping engine
Pain: "We have 3 AI features in Jira blocked for months"
Pain: "We want AI in our product but don't know where to start"
Pain: "Platform team of 5 supporting 20 feature teams — we're bottlenecked"
Pain: "Can't deploy AI without compliance sign-off"
Based on Garry Tan's YC video insight: agents pick tools based on doc quality, not actual performance. The Whisper/Groq problem.
Claude Code defaulted to Whisper V1 — a near-deprecated model — because it has better documentation than Groq, even though Groq is 200x faster and 10x cheaper.
— Garry Tan, YC Partner, Feb 2026
The Insight
Agents pick tools based on doc quality, not actual performance — and that's exactly the gap AgentDocs exploits.
| Wedge | Mkt | Pain | Comp | Fit | x402 | Time | Total |
|---|---|---|---|---|---|---|---|
| 🥇 LLM / Model Routing | 5 | 5 | 3 | 5 | 4 | 5 | 27/30 |
| 🥈 Video Gen | 5 | 5 | 3 | 5 | 5 | 4 | 27/30 |
| 🥉 Audio / Transcription | 3 | 5 | 2 | 4 | 4 | 5 | 23/30 |
| Deployment / Hosting | 5 | 4 | 4 | 4 | 3 | 4 | 24/30 |
| Agent Identity (email/phone) | 4 | 5 | 4 | 3 | 3 | 5 | 24/30 |
| Databases | 5 | 4 | 3 | 3 | 2 | 4 | 21/30 |
| Image Gen | 4 | 3 | 5 | 3 | 4 | 3 | 22/30 |
What agents are ACTUALLY spending on today (x402scan.com, Feb 2026)
692 servers — dominant vertical
486 servers — led by Virtuals ACP ($163K/day)
216 servers — StableEnrich, httpay
203 servers — alpha signals
Zero servers — Garry Tan example!
42 txns — essentially nothing
Nothing
Nothing
| Wedge | x402 Now | Holly Fit | Verdict |
|---|---|---|---|
| Multi-API aggregation + capability layer | ✓ 3 players, no AgentDocs | ✓ Direct fit | Best immediate wedge |
| Agent-to-agent coordination | ✓ $163K/day (Virtuals) | ✓ Holly as orchestrator | Most validated demand |
| Social data for agents | ✓ StableSocial live | ✓ Fits Wurk agents | Niche but real |
| Transcription (Whisper/Groq) | ❌ Zero on x402 | ✓ Strong routing layer | 6–12 months early |
| Video gen | ❌ Near zero | ✓ Strong dogfood | 12–18 months early |
Key Insight
Absence of transcription/video/deployment on x402scan is opportunity signal, not rejection. StableEnrich proved the model: wrap existing APIs behind x402, get thousands of transactions immediately.
Zero servers on x402. Garry Tan moment 6 days ago. First-mover window open.
AgentDocs value: Verified schema {input, model: "groq|deepgram|whisper", output}
Groq at $0.02/min → charge $0.03/min
"Your agent would have chosen Whisper V1. Ours chose Groq."
Parameter chaos problem (Kling uses cfg_guidance, Runway uses guidance_scale). Genuinely unsolved.
AgentDocs value: Agent sends {prompt, style, duration, budget}, Holly resolves params
Already dogfood — Holly generates video
Agent says {task: "transcribe", latency: "fast"} → gets best provider with pricing + ready API call.
The purest AgentDocs wedge
Garry Tan: "Has anybody built Twilio for agents yet?"
Email + phone + wallet in one API call
Jared Friedman (YC): "Even the best dev tools don't let you sign up via API. This is a big miss in the claude code age — claude can't sign up on its own."
Aggregate APIs (Apollo + Firecrawl + Grok + Serper) behind one x402 endpoint.
"Throw money at endpoint, get data back"
OpenHolly becomes the first x402-native capability registry for non-crypto agent needs — the "Stack Overflow for agents" that makes every new category agent-accessible from day one.
For agents, "discovery" = machine-interpretable services, not human landing pages.
Services expose structured metadata (endpoints, pricing, chains). Facilitators crawl and index.
Layer of facilitators that index x402 services, maintain up-to-date pricing/metadata.
Coinbase agentic wallets pre-integrated with x402. Discovery APIs built-in.
"Internet of Agents" research: agents announce capabilities in machine-interpretable form.
The agent doesn't "Google" a platform; it queries its facilitator ("find a market-data API with latency <100ms and price <0.5¢/request"), receives candidates with structured metadata, picks one, then talks HTTP+402 with it.
— Perplexity Research, Feb 2026
Video models can generate stunning visuals but can't follow precise instructions. The bottleneck isn't compute or architecture — it's training data with exact state trajectories.
Data scaling plateaus at 200K-400K samples. The persistent ~15% gap between in-domain and out-of-domain performance isn't solvable with more data — it requires architectural changes AND structured training data.
— VBVR Paper (Wang et al.), Feb 2026
Learned "everything moves together" — can't isolate changes
Can't represent "object A moved, B stayed" explicitly
Step 1 error → Step 2 error → reasoning chain breaks
Frame-by-frame ground truth of what changed
Same action, different contexts — curriculum learning
Genesis/Isaac Sim backends for real dynamics
The Robot Arm Analogy
You can't teach a robot to cook if it knocks over the salt every time it reaches for the pepper. Same with video models: if they can't execute precise state transitions, chaining multi-step reasoning becomes impossible. Controllability is the prerequisite.
Pre-built scene generators for:
Genesis claims:
faster than real-time simulation
Direct fine-tuning:
The "data factory" — parameterized generators + distributed workers — is the real competitive advantage. No productized version exists for vertical industries.
Network effect: Each vertical adds templates → attracts more customers → funds more verticals
Gross margins >80% (compute is cheap vs. real data collection)
Timing
Genesis open-sourced Dec 2024. NVIDIA Cosmos launched Jan 2025. π0 open-sourced Feb 2025. The infrastructure just became available — but nobody has built the vertical data factory layer yet.
| Company | Focus | Gap |
|---|---|---|
| NVIDIA Cosmos | Foundation models | Not vertical-specific data |
| Genesis AI | Physics engine | No data pipeline layer |
| Physical Intelligence | Robot foundation model | Consumes data, doesn't sell it |
| Scale AI | Data labeling | Labels real data, doesn't generate |
| Data Factory (Us) | Synthetic video data | Full vertical pipeline ✓ |
The dirty secret of robotics AI is that real-world data collection costs $100-1000/hour when you include robot time, human supervision, and failure recovery. Synthetic data at $0.01/clip changes the economics completely.
— Industry estimate
VLM-as-a-judge is expensive and non-reproducible. IntPhys shows models at chance level. Everyone's flying blind on what their world models actually understand.
Most models perform at chance levels (50%), in stark contrast to human performance, which achieves near-perfect accuracy (97%+). Current video understanding benchmarks do not capture intuitive physics.
— IntPhys 2 (Meta FAIR), Jun 2025
Expensive ($0.10-1.00/sample), non-reproducible, biased
Cherry-picked videos, no systematic testing
Academic benchmarks ≠ production reliability
Rule-based, reproducible, cheap to run
VBVR-Bench achieves ρ > 0.9 with human judgment
Robotics, driving, medical — each needs own benchmarks
The VBVR Breakthrough
VBVR-Bench demonstrates that rule-based evaluation can match human judgment (ρ > 0.9 correlation). But it's research code, not a product. Domain-specific versions don't exist.
$5B+ has been raised in world models with no standardized evaluation. Every company is building their own benchmarks internally. That's waste.
Comparable: ML evaluation market ~$500M (2024), growing 25%+ YoY
The gap between demo videos and production reliability is massive. Objects disappear, physics drifts, game logic is brittle over longer sessions. We need systematic evaluation, not cherry-picked demos.
— GradientFlow Analysis, 2026
IntPhys-style physics, VBVR-style controllability, basic API
Robotics suite (π0 compatible), driving suite (Wayve/Comma style)
Like HuggingFace but for world models. Attract submissions, build community
"World Model certified for X domain" — becomes industry standard
Training video/world models costs 10-100x more than LLMs. The infrastructure layer is missing. We build the "AWS for embodied AI."
Embodied AI training requires tight integration of simulation, rendering, and ML. Current cloud offerings are designed for LLMs. The infrastructure gap is massive.
— Industry observation
H100s, A100s, optimized networking
PyTorch, JAX, distributed training
Text ingestion, tokenization, streaming
Physics engine + renderer + ML in one loop
Frame extraction, state annotation, streaming
Domain randomization, reality gap tools
Why This Matters Now
π0 just open-sourced. Genesis just launched. Cosmos is available. The building blocks exist but nobody has assembled them into a platform. Every robotics startup is duct-taping their own stack.
Managed Genesis/Isaac Sim instances
Video + state trajectory storage
Optimized for video models
| World model companies (funded) | $5B+ raised |
| Robot startups (π0 ecosystem) | 100+ companies |
| AV companies (simulation needs) | 50+ companies |
| Enterprise robotics adoption | Growing 30%+ YoY |
Conservative estimate: $2B addressable market for embodied AI infrastructure by 2028
Target: 40-60% gross margins (better than pure GPU cloud)
| Player | Sim | Data | Train | Eval | Deploy |
|---|---|---|---|---|---|
| CoreWeave/Lambda | ✗ | ✗ | ✓ | ✗ | ✗ |
| NVIDIA Omniverse | ✓ | ~ | ✗ | ✗ | ✗ |
| Genesis (OSS) | ✓ | ✗ | ✗ | ✗ | ✗ |
| Weights & Biases | ✗ | ~ | ~ | ✓ | ✗ |
| Us (Full Stack) | ✓ | ✓ | ✓ | ✓ | ✓ |
The Integration Thesis
Embodied AI requires tight coupling between simulation, data, and training. Point solutions create friction. An integrated platform captures the full workflow — and the full margin.
Deploy, coordinate, and govern fleets of Claude Code, Cursor, and Codex—so your team ships 10x faster with verification.
We fix that.
Enterprise spent $30-40B on AI pilots. Most failed—not because the models are bad, but because nobody built the infrastructure to run them safely.
MIT study: lack of context, poor verification, no adaptation. The agent isn't the problem—the harness is.
Answer.AI study: only 3/20 tasks completed. Best autonomous agent still needs orchestration.
Cursor's hidden API fees surprise users. One team spent $8,000 on a "$200" plan.
No coding agent learns from failures. Same mistakes repeat every session. Zero organizational memory.
"The bottleneck is now having multiple agents at once."
Every failed AI pilot breaks down on one of these. Mission Control solves all four.
Does the agent understand what you actually want? Intent verification, scope control, semantic diff between ask and interpretation.
How do you know it's right? Automated pipelines: tests, security, quality gates. Verified before it touches production.
Can the agent actually do this? Route tasks to best-fit agent. Claude Code for reasoning, Cursor for flow, Codex for CI/CD.
Does the system learn? Auto-generate rules from corrections. One person's fix helps the whole team.
The agent is commodity. The harness is moat. Everyone's building better agents—nobody's building the infrastructure to run 50 of them safely on a production codebase. We are.
From chaos to coordination in four steps.
Complex request → atomic, verifiable steps with dependency graph
Route each subtask to best-fit agent based on capability profiling
Multiple agents work simultaneously, conflicts prevented automatically
Automated verification, corrections become team-wide rules
5 tabs, 5 agents, merge conflicts everywhere
45+ minutes per agent PR, AI slop review fatigue
No learning between sessions or team members
$20/mo → $400/mo actual spend
All agents in one view, automatic coordination
12-minute reviews, confidence scores, semantic diffs
Auto-generate .cursorrules, team-wide improvement
Route to cheapest capable agent, budget alerts
VS Code 1.109 enables multi-agent dev. GitHub Copilot + Claude + Codex side-by-side. No one owns orchestration.
25% of YC companies have almost entirely AI-generated codebases. They need this.
Claude Code at 72.5% SWE-bench—best in class. The gap is now coordination, not capability.
| Tool | Gap |
|---|---|
| Langfuse | Observes, doesn't orchestrate |
| LangSmith | LangChain lock-in, no self-host |
| Cursor/Devin | Single-agent, no coordination |
| Linear/Jira | Track humans, not agents |
| CrewAI | Framework-specific, not code-native |
We're agent-agnostic, code-native, verification-first.
Three things compound: (1) Routing models improve with every task—network effects. (2) CI/CD integration is sticky—once you're in, you stay. (3) Verification layer builds trust—that's years of R&D, not a feature.
The AI recruiting agency that takes roles from intake to offer — no humans in the loop until the interview.
One AI agent that owns the entire candidate journey — source to offer.
Expand to 50 agencies by end of year
Consulting is 70% intelligence work (research, analysis, modeling) and 30% judgment work. We automate the 70%—the nights and weekends work—so consultants can focus on what actually matters.
"At McKinsey, we spent over 90% of our time on manual work—reading reports, building Excel models, creating presentations." — Grasp Founders
"AI generates generic fluff, not MBB-quality output. Copilot can't do action titles. No tool understands Pyramid Principle or MECE structures."
Operand says "AI to Kill McKinsey." We say: give McKinsey analysts superpowers. The winning model is augmentation—Grasp has 200 customers and 3.5x ARR growth proving it.
The Killer Gap: Verifiability. Consultants don't trust AI outputs. Every claim needs traceable sources, explicit assumptions, confidence intervals. No one does this yet.
Multi-agent system that turns days of research into client-ready deliverables in hours. Research → Analysis → Deliverable—one unified workflow with human-in-the-loop verification.
Why Now: Multi-modal AI finally good enough. Reasoning models enable complex analysis. Consulting under cost pressure. Talent arbitrage—ex-MBB available to train AI and verify outputs.
10-50 person consulting shops. Ex-MBB founders who know what "good" looks like. Fast decision cycles, desperate for competitive advantage, can't build in-house AI.
Expansion Path: Boutiques → In-house corporate strategy teams → PE portfolio companies → Big Four individual teams → Enterprise. Grasp already serves "most of the Big Four" after starting narrow.
We're Cursor for consultants—AI that actually understands strategy, not just generates slides. Targeting $2M seed to build the verifiable, consulting-grade AI workflow that doesn't exist yet.
One-Liner: "We're building AI analysts for consulting firms. Our multi-agent system turns days of research into client-ready deliverables in hours—with every claim verifiable and humans always in control."
We're building AI agents that do the work of procurement BPOs at 1/20th the cost. Pay-per-outcome, not seats.
Enterprises spend over $180 billion annually on procurement talent, compared to roughly $10 billion on procurement software, reflecting how much work still happens manually around existing systems.
— Lio PR / Industry Analysis
When companies spend 18x more on people than software, the work is clearly being done by humans. That's our opening.
Annual spend on procurement talent
Annual spend on procurement software
The Insight
Software hasn't automated procurement — it's just digitized paperwork. The work is still done by humans. AI agents can now do that work.
SAP Ariba is the market leader. Users hate it. This is our opening.
SAP Ariba — 98 reviews, near-universal hate
"This software makes me wanna quit my job. This should not exist."
— SAP Ariba User, Trustpilot
"Logs me out 5720937 times a day... like software from 1980"
— SAP Ariba User, Trustpilot
"If you are a supplier, THIS WILL HURT YOUR BUSINESS."
— SAP Ariba User, Trustpilot
Built pre-cloud, can't rebuild without breaking everything
Revenue from supplier fees, not buyer value
6-12 month implementations, constant IT involvement
"Tell me it's not their department"
What Buyers Actually Want
The winners in this space have proven wedges and go-to-market strategies we can learn from.
Before building anything, founders DMed hundreds of procurement managers on LinkedIn — not to sell, but to learn.
Why It Worked
Built for a "boring, massive" market with "sleepy incumbents" and "low NPS"
Zero upfront cost. "You never pay for seats or shelfware." Implementation in days, not months.
Sequoia Thesis
"In the old world, SaaS sold the promise of ROI. In the new world, AI actually delivers it."
Started with "long tail" suppliers enterprises never negotiate with — low risk, high volume.
Positioned as replacing outsourcing, not software. Competes against BPOs (20x cost of software).
The four-pillars framework shows exactly where agentic AI outperforms existing tools.
| Area | Legacy Gap | Agentic Opportunity |
|---|---|---|
| Intake & Approval | Email chains, manual routing | Autonomous triage and routing |
| Negotiation | Human-only, can't scale | AI negotiation at scale (2000 suppliers simultaneously) |
| Contract Compliance | Periodic audits, 2% value leakage | Continuous monitoring, proactive alerts |
| Supplier Management | Reactive, manual tracking | Proactive risk sensing, autonomous action |
| Invoice Processing | OCR + human review | End-to-end matching and payment |
Coupa too expensive, Ariba too painful
Wide open — Magentic early stage
Tacto owns Germany, US gap
Healthcare, construction procurement
Like our SDR offering, but for procurement. We operate the agents, customers get outcomes.
LLMs can now read contracts, negotiate, and execute end-to-end. Pactum proved it at Walmart scale.
COVID, Ukraine, tariffs created urgency. "The old answers—another dashboard or SaaS tool—are spent."
58% struggle to find/retain procurement talent. BPO arbitrage is ending as AI gets cheaper than offshore labor.
"We're entering a phase in the enterprise where AI moves beyond workflow co-pilots to autonomous, multi-agent execution."
— Seema Amble, a16z
$180B of human labor waiting to be automated.
Incumbents at 1.2/5 stars. 90% of CPOs exploring agents.
✅ High intelligence-to-judgment ratio • ✅ $5.8B BPO proves willingness to pay • ✅ Measurable P&L impact • ✅ Sequoia + a16z conviction
HIGHLY ATTRACTIVE vertical for AI autopilot investment
39,000 agencies run on email and spreadsheets. We're building the AI that replaces their back office — delivering outcomes, not tools.
Getting one of these businesses insured takes ~50 steps over two weeks. The broker's actual judgment matters for maybe 5% of the process. The other 95% is pure coordination.
— Panta (YC W26)
Sequoia, Emergence, Khosla, and YC are funding AI-native brokerages. YC's 2026 RFS explicitly calls out "AI-Native Agencies."
"Agencies of the future will look more like software companies, with software margins. And they'll scale far bigger than any agencies that exist in these fragmented markets today."
— YC Request for Startups, Spring 2026
700+ clients in 18 months • Growth startups (GoPuff, Bombas, EightSleep)
5,000+ clients • Middle America SMBs (daycares, dealerships, restaurants)
Hard-to-place E&S risks • Trucking, nightclubs, construction
Pattern
All three prove the same thesis: AI can handle the 95% coordination layer, freeing brokers for the 5% that actually requires judgment.
The industry runs on copy-paste, portal juggling, and endless email chasing.
ACORD forms, questionnaires, documentation
Submit to 5-20 carriers per risk
Follow-up emails, answer questions
Quote comparison, gap analysis
COIs, endorsements, renewals
Same 50 steps, every single time
PDFs, forms, emails — all parseable
1,000+ carrier APIs via IVANS
Email/chat = agent-native
Speed = competitive advantage
We don't sell software. We deliver outcomes: placements, renewals, COIs — done.
Become pseudo-IT for AI
Tools don't deliver outcomes
Generic tools, generic results
You focus on advising clients
Pay per placement, not per seat
ACORD forms, carrier APIs, COIs
One broker: 400 clients. One AI-augmented broker: 4,000 clients.
99% placement rate vs industry ~60%. Quote turnaround: days → hours.
35,000 agencies with <$2M revenue have no AI tools. They can't afford Zywave. We're their answer.
Mid-market agencies ($1-10M revenue)
WithCoverage Playbook
"Thousands of calls, travel across dozens of states. Offer free insurance analysis showing overpayment. In 18 months, 700+ of the fastest growing companies switched to us." — Max Brenner, CEO
The playbook is proven. The market is massive. The incumbents are asleep.
We automate 50% of IT helpdesk tickets in week one. Incumbents are hated. 67% of orgs can't hire enough techs. The market is ready.
"Faster to automate a task forever than to do it manually once."
— Serval's Core Pitch ($127M raised, $1B valuation)
MSPs can't hire enough techs. Their tools are outdated. AI can finally solve L1.
The Insight
Can't hire fast enough → must automate. L1 is now automatable. The gap is who does it for SMBs.
ConnectWise and Kaseya have created massive dissatisfaction. MSPs are actively seeking alternatives.
"ConnectWise is less bad overall"
— r/msp (damning with faint praise)
Structural issues: PE ownership → cost-cutting, poor support, billing games, technical debt. They can't rebuild AI-native. We can.
Global MSP TAM
Tickets automatable (proven by Serval)
Serval valuation (enterprise only)
| Segment | Status |
|---|---|
| Enterprise ($100K+ ACV) | Serval owns this |
| SMB Direct (50-500 employees) | Blue ocean |
| Small MSPs (5-20 techs) | Good entry wedge |
GPT-4+ enables reliable action execution
ConnectWise/Kaseya losing share, MSPs looking
Code-gen enables custom workflows
Serval proved it with enterprise. We do it for the rest of the market.
90%+ automatable
Okta, Google Groups, SCIM
Standard apps, self-service
Day 1 automation
| Us | Them |
|---|---|
| AI-native architecture | Bolt-on AI to legacy |
| Code-based workflows (auditable) | Black-box AI |
| Outcome pricing (per ticket) | Seat-based (pay even if unused) |
| Deploy in days | 6-month implementations |
Distribution: r/msp, IT Nation, MSP peer groups
Start free, pay when we hit 30% automation.
Every SMB outsources IT to MSPs. MSPs are dissatisfied with ConnectWise/Kaseya. 50%+ of tickets are automatable.
Nobody is selling "your IT just runs" directly to SMBs as an outcome.
Serval raised $127M doing this for enterprise. The SMB market is unowned.
"The IT team that scales with you, without headcount."
We recover millions in denied claims. AI agents that fight back against payer denials — so hospitals can focus on patients, not paperwork.
One insurer allegedly denied 300,000 claims in under two months using AI. Providers need their own AI to fight back.
— Healthcare Industry Report, 2024
Hospitals are drowning in denials. Payers deploy AI to reject claims faster. Providers still fight with spreadsheets.
Payers are deploying AI to deny claims faster. Medicare Advantage denials up 4.8% YoY. Providers need AI to fight back — or they lose.
Massive existing spend, ripe for automation
89% saw PA requirements increase in 2024
Massive labor shortage, no backfill coming
Finally capable of clinical document understanding
AI accuracy approaching human coders
Regulatory tailwind forcing digitization
The Insight
Outcome-based pricing is native to healthcare — providers already pay % of collections. We align incentives: we only win when they recover money.
Legacy vendors are slow. AI-native players are enterprise-only. The mid-market is wide open.
| Company | Focus | Strengths | Gap |
|---|---|---|---|
| Anterior ($64M+) | Prior auth (payer-side) | 99.24% accuracy, KLAS validated | Payer-only, not provider-side |
| AKASA | Provider RCM (enterprise) | Cleveland Clinic, Stanford | 12+ month sales cycles, expensive |
| Fathom | Medical coding only | 95.5/100 KLAS score | Narrow focus, no denial mgmt |
| Waystar / R1 | Legacy platforms | Scale, integrations | "AI" is mostly marketing, slow |
| OpenHolly | Denial recovery | Outcome-based, fast deploy | — |
Start with the most measurable outcome: dollars recovered from denials.
15-20% of recovered revenue. We only win when you recover money.
AI agents that read charts, identify denial root causes, generate appeals, and submit — automatically.
AI identifies why denial happened — missing documentation, coding error, or arbitrary payer rule.
Highlights relevant clinical passages from 50K-word records in seconds.
Generates payer-specific appeal letters with guideline citations. Human reviews in 5 minutes.
Orthopedics, cardiology, oncology (high-value procedures)
Low risk entry — AI is checking, not deciding
Then coding QA, prior auth automation
15-20%
of recovered revenue from denials
You only pay when we recover money. Zero risk.
3-5 specialty practices with 500+ denials/month
We'll recover $100K+ in year one — or you pay nothing.
340K accountants left the profession. 75% of CPAs are retiring. The close still takes 6+ days. We're building the autopilot.
"Accounting is structured, high-stakes, and essential to every business on earth. It's also one of the most underbuilt areas in technology."
— Basis founders (valued at $1.15B)
A profession in crisis meets primitive tooling. Something has to give.
50% cite it as key reason close is slow
Only 18% close in 3 days or less
#1 bottleneck — 3-5 systems just to match
"Bench raised ~$160M. Shut down Dec 2024. Human-heavy bookkeeping models can't scale profitably."
— Industry lesson
🔥 "Bench Refugees" = Urgent Demand
Thousands of abandoned customers actively looking for alternatives. Distrust human-heavy models. Ready for AI-first.
Basis raised $100M Series B at $1.15B valuation in Feb 2026. The market is real.
30% of Top 25 US firms already using
20% of Top 150 firms
First AI agent to complete end-to-end 1065 tax return autonomously
200+ customers, Series B
AI-native ERP — not AI bolted onto legacy
"Go live in weeks" vs months
Top 50 accounting firm partners
100+ customers
"Absorbs 47% of month-end close tasks"
The TAM
$50-80B market for accounting automation
Every business needs accounting. It's recurring. It's essential. And it's still done by hand.
AI that runs overnight, completes 90% of month-end tasks, and generates an exception report for morning review.
Auto-matching across all sources (95% accuracy)
90%+ accuracy with LLMs, learns from corrections
Auto-generated with supporting documentation
AI explains anomalies, humans review
| Metric | Before | After |
|---|---|---|
| Close Time | 6+ days | <3 days |
| Cash Rec Hours | 20-50 hrs | <5 hrs |
| Manual Tasks | 80% | 10% |
| Error Rate | 1-5% | <0.5% |
Everything is auditable. Everything is documented. AI generates, humans verify.
QuickBooks architecture is rigid
Rillet raised $50M+ because "rebuilding from scratch" is the only way
Services business, not software. Internal innovation killed by billable hour model.
Urgent need, distrust of human-heavy, ready for AI-first
$10-100M revenue, 2-5 person finance teams drowning in close
Fast adoption, price sensitive, willing to try
Per-Entity Most common for accounting software
Volume Tiers Common for AP/AR automation
Outcome-Based Emerging: % of cost savings
$2K – $15K/mo
Based on entity count + transaction volume
Outgrowing QuickBooks, avoiding NetSuite
Drowning in close process
SaaS, e-commerce, hospitality
The talent crisis is permanent. The tooling is primitive. The AI is ready.
We're building the autopilot for accounting.
We sell the adjustment, not the software. 400K workers retiring. $50-80B market. AI-native TPAs are the future.
Services: The New Software. The biggest opportunity isn't selling tools to adjusters — it's replacing what adjusters do.
— Sequoia Capital, March 2026
Structural workforce crisis meets broken incumbent tech.
"Half a billion between software, personnel, and opportunity cost" for Guidewire implementations that still fail.
— Industry Analysis
The Reality
Within 15 years, a large portion of today's adjusters will have retired — and there won't be enough people to replace them.
Smart money is flooding in. Exits are happening. The window is open.
December 2024. AI claims automation is now a proven exit category.
| Company | Model | Funding | Traction |
|---|---|---|---|
| Strala | AI-native TPA | Founders Fund (13x oversubscribed) | 26 US clients, UK expansion |
| Pace | Operations automation | Sequoia $10M Series A | Prudential multi-year deal |
| Elysian | Complex commercial claims | AmFam Ventures $6M | State Farm pitch winner, Lloyd's Lab |
| Tractable | Photo AI (point solution) | $1B+ unicorn | Auto insurers, property |
The Gap
Tractable sells photo AI. Shift sells fraud detection. Nobody sells the full claim outcome — FNOL to settlement, end-to-end.
Start where decisions are fast and budgets already exist.
Budget line exists. Vendor swap, not new category.
Not 12-24 month enterprise sales cycles
Per-claim pricing below legacy TPA rates
Use TPA wins to land large carriers
Start with FNOL/triage → Hybrid deployments → Full TPA as trust builds
"The answer can't always be more people." — Strala
Outcome-based pricing. Clear ROI story.
Strala claims 1 point loss ratio improvement. That's the number carriers care about most.
— Industry benchmark
The workforce is retiring. We're what comes next.
75% of CPAs are reaching retirement. 340,000 accountants left the profession since 2020. But tax work is 80-90% pure intelligence work—the exact work AI agents do best. We're building autopilot for tax preparation.
"Endless hours, stressed teams, client overload, constant risk of missing deadlines." 42% of firms report retention issues from burnout. The people who do stay work 60-80 hour weeks for months.
"Difficulties with state returns came up repeatedly in 'dislike' responses. Multi-state complexity multiplies fast, and manual tracking of different state rules becomes impossible at scale."
Incumbents charge $60K+/year for seat licenses. Blue J achieved 12x revenue growth via CPA.com distribution. But nobody has cracked multi-state complexity—nexus determination, varying apportionment rules, threshold tracking.
The Killer Gap: Research tools sell per-seat. Preparation is still manual. Nobody sells completed returns. The outcome-based pricing model is wide open.
Multi-agent system: reads documents, applies firm's tax strategy, enters data into systems. What takes 4 hours becomes 15 minutes of review. Every citation verifiable. Human signs, AI does the work.
Why Now: GPT-4 enabled Blue J's 12x growth. Filed claims 30-50% review cycle reduction. Avalara building "agentic tax" for transaction compliance. The capability inflection is here.
6-50 preparer firms: fast decisions, acute talent pain, can't build in-house. Outcome-based pricing—firms pay for completed returns, not software seats. CPA society partnerships for distribution.
Expansion Path: Mid-market firms → State CPA society endorsements → Enterprise (Top 100) → Big Four white-label. Basis already has 30% of Top 25 with enterprise-first approach.
Blue J sells research tools to accountants. We sell completed tax returns to firms. Outcome-based pricing aligned with Sequoia's "sell the work" thesis. The demographic crisis is now—we're the solution.
One-Liner: "AI agents that prepare tax returns from scratch—firms pay per return, not per seat. We automate the 80% of tax work that's pure intelligence, so the retiring 75% of CPAs don't take the industry with them."
The autonomous legal team for scaling companies — starting with NDAs.
One AI agent that handles contracts from intake to signed — Slack-native, no lawyers needed for routine work.
Own the autonomous legal stack for scaling companies