Dependabot for AI Models
Every week a new model drops. Your team manually benchmarks it. What if that happened automatically — with PRs when something's better?
New models released per month
Time to benchmark each one
Tools that auto-PR improvements
The Problem
- New model every week from OpenAI, Anthropic, Google, Meta, Mistral, open-source
- Teams manually benchmark against their stack — takes days per model
- By the time you finish testing, three more models dropped
- No one knows if they're running the best model for their use case
Competitive Landscape
| Company | What They Do | Gap |
|---|---|---|
| Portkey | AI gateway, routing, 1600+ LLMs | No auto-benchmarking against YOUR stack |
| Unify ($8M) | Finds best LLM for the job | Router-first, not benchmark-first |
| Braintrust ($36M, $150M val) | Eval-driven development | Reactive, not proactive |
| Us | Watch → Auto-benchmark → PR when better | — |
How It Works
1. Connect
Connect your AI stack. Define your eval suite (or we help you build one).
2. Watch
We monitor every model release across all providers. Automatically.
3. PR
When something beats your current setup, you get a PR with benchmarks.
The Dependabot Pattern
Watch → Auto-benchmark → PR when better
Nobody does this for AI models. We do.
ICP & Pricing
🎯 Target Customer
- Any team with AI in production
- 3+ AI features deployed
- Series B+ (or well-funded Series A)
- Engineering-led decision making
💰 Pricing
- Starter $2K/mo — up to 5 endpoints
- Growth $8K/mo — up to 20 endpoints
- Enterprise $20K+/mo — unlimited
Why Now?
- Model release velocity is accelerating — impossible to keep up manually
- LMArena proved model evaluation is a $1.7B market (raised $150M)
- Braintrust proved enterprises pay for evals ($36M Series A)
- Nobody has combined continuous monitoring + proactive optimization
CI/CD for AI
Software engineering solved "did my change break things?" 20 years ago. AI engineering still ships blind.
🔴 AI Today
Push prompt change → Hope it works → Find out in production
🟢 With Us
Push prompt change → Eval runs → PR blocked if quality drops
The Insight
The gap isn't that people don't have evals — Braintrust, Humanloop, and DSPy are giving them that.
The Real Gap
Evals aren't integrated as blocking gates in deployment pipelines the way unit tests are.
What We Build
GitHub Action + CI Integration
- Automatically runs your eval suite against every PR that touches AI code
- Prompts, model configs, RAG pipelines — all covered
- If eval score drops → PR is blocked
- If new model improves score → PR is auto-generated
Think: Braintrust's eval engine + Dependabot's automation + GitHub Actions' CI/CD — fused into one opinionated product.
Aemon vs Us
| Dimension | Aemon | Us |
|---|---|---|
| Purpose | Discover new optimal solutions | Protect existing quality + incrementally improve |
| Posture | Offensive R&D | Defensive Ops |
| Buyer | R&D Lead / ML Researcher | Engineering Manager / Platform Team |
| Integration | Standalone tool | Lives in your CI/CD |
ICP & Pricing
🎯 Target Customer
- 3+ AI features in production
- Series B+ companies
- Engineering-led sale
- Already using GitHub/GitLab CI
💰 Pricing
$2K – $20K/mo
Based on eval runs & endpoints
Private LMArena
LMArena raised $150M at $1.7B valuation on public evals. Enterprises need private evals on their own data.
LMArena valuation (public evals)
Private enterprise eval market
The Problem with Public Benchmarks
- Companies have been caught gaming LMArena scores
- Public benchmarks don't reflect YOUR use cases
- Generic evals ≠ production performance for YOUR data
- Enterprises need proprietary intelligence
What We Build
Enterprise Model Intelligence Platform
- Define eval suites from your production data
- Continuously benchmark every new model release
- Test every prompt variation, RAG config automatically
- Output: Private leaderboard + recommended actions
Hugging Face's Yourbench is the open-source precursor — but it's a DIY tool requiring significant ML expertise. We productize it.
Aemon vs Us
| Aemon | Private LMArena |
|---|---|
| Evolves novel algorithms | Evaluates existing models/configs |
| Research | Intelligence |
ICP & Pricing
🎯 Target Customer
- 10+ AI features in production
- $50K+/mo on AI infrastructure
- VP of Engineering or Head of AI
- Fintech, ad-tech, e-commerce, healthtech
💰 Pricing
$10K – $100K/mo
Enterprise contracts
AI Model FinOps
Companies spend $85K+/mo on AI infrastructure. Nobody knows if they're overpaying for quality they don't need.
Avg monthly AI spend
YoY growth
Visibility into cost-quality tradeoff
The Gap
| Tool | What It Does | Missing |
|---|---|---|
| Portkey | Routing, fallbacks | No cost-quality optimization |
| Unify | Cheapest model that meets threshold | Not continuous, not production data |
| Us | Continuously optimize cost-quality frontier across entire AI stack | |
What We Build
FinOps + Quality Optimization Layer
An agent that sits on top of your AI gateway:
- Continuously profiles every AI call (model, cost, latency, quality)
- Uses your production data as the eval
- Generates actionable recommendations:
"Your RAG pipeline on endpoint Y is underperforming — here's an optimized config"
ICP & Pricing
🎯 Target Customer
- $20K+/mo on LLM APIs
- CFO / VP Eng sale
- Any industry with AI in production
💰 Pricing
$2K – $15K/mo
Pays for itself from savings
⚡ Easiest ROI story of all these ideas
Eval-as-a-Service
Building good evals is harder than building the AI features themselves. We build the oracle.
The Insight
The Bottleneck Isn't Optimization
Braintrust's thesis: "If your eval is right, every decision becomes simple."
DSPy's framework depends on having good metrics to optimize against.
The bottleneck in the entire AI development loop is knowing what "good" looks like.
What We Build
Eval Generation Agent
- Takes your production AI traces
- Analyzes failure modes
- Interviews domain experts (async, Slack-based)
- Generates calibrated eval suites:
✓ Datasets
✓ Scoring rubrics
✓ Automated judges
Output plugs into Braintrust, DSPy, or your own CI/CD.
Aemon vs Us
| Aemon | Eval-as-a-Service |
|---|---|
| Assumes you have a good eval function | Creates the eval function |
| Optimizer | Oracle |
| Depends on eval quality | Is the prerequisite to everything else |
If you own the eval layer, you become the foundation every optimization tool depends on.
ICP & Pricing
🎯 Target Customer
- Same as Braintrust's customers
- AI product teams at Series B+
- Earlier in journey — before they've figured out evals
💰 Pricing
$5K – $30K/mo
Per eval suite built + maintenance
AI-Powered Outcomes.
Not Tools. Not Reports.
We operate fleets of AI agents that deliver results. Customers get outcomes. We get playbooks. Playbooks become platform.
A fat startup ships outcomes, not features. It bundles software, data, and human ops into one integrated product that actually gets the job done.
— Andrew Lee, a16z Speedrun Partner
The Shift
Spinning up AI agents is now trivial. Managing them is the new bottleneck.
What's Easy Now
One-click agent deployment
OpenClaw, dockerized instances, cloud GPUs
Capable models
GPT-5, Claude 4, open-source alternatives
Economics work
$0.01-0.10 per task, not $50/hr
What's Still Hard
People become pseudo-IT
Babysitting agents instead of running business
Debugging eats time
Every hour on agent issues ≠ hour on actual work
No one wants to manage agents
They want outcomes, not infrastructure
The Insight
Founders are too busy to become AI ops engineers. We absorb that complexity so they can focus on their actual business.
How We Got Here
We started in sales. Then customers kept asking for more.
The Variety We've Delivered
SDR for construction companies
Lead gen + qualification
Video generation for ML training
Synthetic data pipelines
Research assets for universities
Literature review + synthesis
BDR for startups
Outbound + meeting booking
The Common Thread
Every customer had the same problem:
"I tried spinning up agents myself. Then I spent all my time debugging them instead of running my business."
— Pattern across customers
They didn't want to manage AI. They wanted outcomes.
The Market Reality
Why Tools Aren't Enough
Companies don't want to become AI operations experts. They want someone to absorb the complexity and just deliver results.
The Model: Managed AI Operations
We operate agent fleets. Customers get outcomes. We encode playbooks.
DIY / SaaS Tools
You manage the agents
Become pseudo-IT for AI
Weeks to figure out
Setup, config, debugging
Hope it works
No guarantee of outcomes
OpenHolly (Us)
We manage the agents
You focus on your business
Results in days
We've done this before (playbooks)
Outcomes guaranteed
Pay for results, not effort
Current Focus: GTM/Sales
Starting with sales because the outcome is measurable: meetings booked.
Why Sales First
Clear success metric
Meetings booked = revenue
Broken market
70% AI SDR churn = customers looking for alternatives
High willingness to pay
$5-10K/month for what works
We have traction
50% of our revenue is SDR/BDR
What We Deliver
Research Agent
Deep prospect intelligence
Outreach Agent
Personalized messaging
Qualification Agent
Score and prioritize leads
Scheduling Agent
Book the meeting
Expansion Path
Sales → Research/Intel → Operations → Content. Each vertical = new playbook, same infrastructure.
The Unlock: Playbooks Compound
Every engagement encodes a playbook. Playbooks make the next engagement faster. This is how we build the moat.
What's In A Playbook
Every engagement becomes encoded knowledge:
Workflow sequences
What steps work for each use case
Prompt templates
Messaging that actually converts
Agent configurations
Which models, tools, and sequences
Failure patterns
What breaks and how to prevent it
The Compounding Effect
Customer 1: 2 weeks
Figure everything out from scratch
Customer 5: 3 days
Apply existing playbook + customize
Customer 10: Hours
Playbook is battle-tested
Eventually: Self-serve
Playbooks become product
The Fat Startup Advantage
We're getting paid to build our moat. Every dollar of revenue = more encoded knowledge. Competitors starting later start from zero.
Technical Insight
We're productizing the research consensus on what actually works.
The Research Convergence
Workflow-First Architecture
Declarative orchestration beats autonomous agents (Microsoft, 2024-25 surveys)
HITL as Training Signal
Human edits train intervention policies (ReHAC, EMNLP 2024)
Playbooks as Optimization Surface
Prompts + tool-use are parameters to optimize (AVATAR, NeurIPS 2024)
Guardrails are Required
Transparency + oversight for multi-agent systems (Nature, 2026)
Our Implementation
Declarative playbooks
Versioned configs, not imperative code
Logged human checkpoints
Every edit = structured training signal
Continuous optimization
Prompts, branching, model routing improve over time
Action-layer guardrails
Can't be prompt-injected, auditable
We log trajectories, human edits, and outcomes, then update prompts, branching logic, and model routing so the same business objective is achieved more reliably over time. The playbook is the learned policy space.
— Our technical thesis
The Compound Library
The internal system that makes agent workflows repeatable and efficient.
Without This System
Reinvent every time
Which tools? Which prompts? Which models?
Slow iteration
Learn the same lessons repeatedly
Linear scaling
More clients = more eng hours
With The Compound Library
Compose from proven
Verified, tested, reusable primitives
Each engagement adds
Learnings feed back into system
Sublinear scaling
More clients = richer library = faster
The Compounding Effect
Workflow #1 takes a week. Workflow #10 takes a day. Workflow #100 takes hours. The library IS the moat.
Why Us
Team
Keith Schacht
Co-Founder + Advisor
$140M exit (Mystery) · a16z funded · Built consumer products used by millions
[CTO]
Co-Founder & CTO
CTO of Because · $3M Seed · Deep agent infrastructure experience
Yasir
Co-Founder
yapthis.com · Agentic architecture · Shipped production agent systems
Unfair Advantages
We're running on OpenClaw
Dog-fooding our own infrastructure daily
We've built observability
ClawView for agent monitoring
We've built guardrails
Agent Seatbelt for safety
Revenue already
$4K MRR, +$2K this week
Traction
What This Proves
Companies will pay for AI-powered outcomes when someone else manages the complexity. The demand is real. The model works.
The Ask
What We Need
$[X] Pre-Seed
Scale agent fleet + engineering team
12-month goal: $1M ARR
Prove the playbooks at scale
Then: Productize
Turn proven playbooks into self-serve templates
Why Now
OpenClaw + GPT-5 + Claude 4
Agents just became capable enough
AI SDR market burned
70% churn = customers looking for what works
First-mover on playbooks
Every month we operate = more encoded knowledge
OpenHolly: AI-Powered Outcomes
Customers get results. We get playbooks. Playbooks become platform.
Your Personal AI OS
An AI that knows your context, anticipates your needs, and takes action on your behalf—not a chatbot you have to prompt.
The Vision
Imagine an AI that actually knows you—your work, your preferences, your patterns. It doesn't wait for commands. It proactively handles tasks, flags important things, and learns from every interaction.
The $56B Opportunity
Personal AI assistants are about to explode.
AI personal agents will arrive soon. What we do now with apps—manually, and in piecemeal fashion—will be done automatically. If a flight is cancelled, an AI agent will rebook the flight, reschedule meetings, and order food.
— Goldman Sachs, "What to Expect from AI in 2026"
Why Current Assistants Fail
Siri, Alexa, and Google Assistant lost the AI race. Here's why.
❌ The Problem
No Persistent Memory
Context resets after 2-3 turns. They forget everything.
Reactive, Not Proactive
Wait for commands. Never anticipate needs.
Siloed Knowledge
Can't connect your email, calendar, work, and life.
Limited Actions
"I can't do that" is their signature phrase.
✓ Personal AI OS
128K+ Token Context
Remembers weeks of interactions. Learns your patterns.
Proactive Intelligence
Anticipates what you need before you ask.
Connected Context
Sees your whole digital life—with your permission.
Real Actions
Browser, shell, files, messages—actual work gets done.
Microsoft's CEO called AI assistants "dumb as a rock." The truth is, they've stagnated while chatbots evolved.
— Industry Analysis, 2023-2024
The Hardware Graveyard
Why dedicated AI devices keep failing—and what we learned.
The Lesson
Hardware failed because it created friction instead of removing it. The winning approach: software that works with your existing devices—phone, laptop, wearables—not another gadget to carry.
Both Rabbit R1 and Humane AI Pin missed a crucial opportunity: integrating with existing user bases. Why create a separate device when you could leverage smartphones and their vast ecosystem?
— Medium Analysis, July 2024
Proactive vs. Reactive
The fundamental shift in how AI should work for you.
Reactive (Siri/ChatGPT)
"Hey Siri, add milk to my shopping list"
"ChatGPT, summarize this document"
You initiate every interaction. You remember to ask.
Proactive (Personal AI OS)
"You're almost out of milk. Added to cart—confirm?"
"Your flight changed. I rebooked + rescheduled 2 meetings."
AI monitors context. Surfaces what matters. Acts with permission.
Gartner predicts 40% of enterprise apps will embed task-specific AI agents by 2026, evolving assistants into proactive workflow partners.
— Forbes, "Agentic AI Takes Over," Dec 2025
Why Now?
Four converging forces make this the moment.
Technology Ready
GPT-5 / Claude 4
Models finally capable of real reasoning
128K+ Context Windows
Memory across weeks of interaction
MCP + Tool Use
Agents can control apps natively
Economics Work
$0.01-0.10 per task, not $50/hr
Market Ready
96% Enterprise Expansion
Plan to increase agentic AI budgets
25% → 50% Adoption
Enterprise GenAI agents 2025 → 2027
Siri Fatigue
95% frustrated with current assistants
Privacy Tailwinds
Apple Intelligence proves local AI demand
What Users Actually Want
From surveys, Reddit, and academic research.
Desires
Memory That Persists
"Remember what I told you last week"
Proactive Help
"Remind me before I forget"
Deep Personalization
"Know my preferences without asking"
Privacy Control
"My data stays mine"
Evidence
93% of respondents predict agentic AI will enable more personalized, proactive, and predictive services.
— Cisco 2025 AI Study
An assistant that knows you. The future of personal assistants is when the helper learns from your data, documents, and writing style.
— AI Industry Forecast 2026
How It Works
Always-on AI that learns, anticipates, and acts.
Current Focus: SDR/BDR
Research Agent
Deep prospect intelligence
Outreach Agent
Personalized messaging
Scheduling Agent
Meeting coordination
Platform Vision
Email Intelligence
Triage, draft, follow-up
Research & Analysis
Deep work on autopilot
Ops & Admin
The tasks you hate, automated
The Unique Wedge
What makes this different from Siri/Alexa/Google Assistant?
Big Tech Assistants
Built for mass market
Generic. Lowest common denominator.
Data goes to them
Your context trains their models.
Walled garden
Only works in their ecosystem.
Stagnant development
Lost the AI race years ago.
Personal AI OS
Built for power users
Deep personalization for serious work.
Your data stays yours
Local-first. You control what's shared.
Cross-platform
Works with your existing tools.
Cutting-edge models
GPT-5, Claude 4, always the best.
The Positioning
We're not competing with Siri for "set a timer." We're building the second brain for knowledge workers—people who will pay for AI that actually makes them more effective.
Traction & Team
Keith Schacht
Co-Founder + Advisor
$140M exit (Mystery) · a16z funded · Built consumer products used by millions
[CTO]
Co-Founder & CTO
CTO of Because · $3M Seed · Deep agent infrastructure
Yasir
Co-Founder
yapthis.com · Agentic architecture · Shipped production agent systems
The Ask
What We Need
$[X] Pre-Seed
Scale agent infrastructure + team
12-month goal: $1M ARR
Prove the Personal AI OS at scale
Then: Consumer launch
Personal AI for everyone
Why This Team
We use it daily
Dogfooding OpenClaw constantly
Built observability
ClawView for agent monitoring
Built safety
Agent Seatbelt for guardrails
Already have revenue
Proving demand before pitching
OpenHolly: Your Personal AI OS
An AI that knows you, anticipates your needs, and takes action—not just another chatbot waiting for prompts.
The Thesis in One Line
The shift from reactive AI to proactive AI is a $56B market. We're building the operating system for it.
Pay Per Meeting,
Not Per Seat
The SaaS pricing model is breaking. AI does the work now—so why pay for human logins? We deliver outcomes and charge when they happen.
AI is driving a shift toward outcome-based pricing. Per-seat is no longer the atomic unit of software. If AI can handle a sizable proportion of customer support, companies will need far fewer human agents, and therefore fewer software seats.
— a16z Enterprise Newsletter, December 2024
The Pricing Revolution
SaaS pricing is undergoing its biggest shift since the cloud. AI is killing the per-seat model.
Seat-based pricing may not fit when AI is doing the work. If an agent replaces a human task, customers will expect to pay based on outcomes, not log-ons.
— Bain Technology Report 2025
Why Seats Are Dying
The logic of per-seat pricing breaks when AI replaces the humans who need seats.
The Broken Math
AI replaces 10 analysts with 1 agent
Per-seat pricing undervalues the automation
$5-10K/month regardless of results
70% churn when outcomes don't follow
Soft ROI = death at renewal
2025 pilots hitting 2026 renewals—"are we really getting value?"
The New Model
Pay for work completed
Not for access to tools
ROI in their sleep
Customers calculate value instantly: $X per meeting = clear math
Aligned incentives
We only win when you win
The Bessemer Thesis
AI-native companies are abandoning seat-based SaaS pricing in favor of usage-, output-, and outcome-based models that directly align revenue with measurable results.
— Bessemer Venture Partners, "The AI Pricing and Monetization Playbook" (Feb 2026)
Who's Already Winning
The market leaders are proving outcome-based AI pricing works at scale.
Intercom Fin
Customer Support AI
$0.99 per resolution
65% resolution rate. Aligns every team around one outcome: resolved tickets. Now deployed at 99% of conversations.
Zendesk AI Agents
Customer Support AI
Outcome-based pricing
"First in CX industry to offer outcome-based pricing for AI agents" — August 2024 announcement.
EvenUp
Legal AI
Per demand package
AI + legal experts generate personal injury demand letters. Per output pricing, not hourly.
Decagon
Enterprise AI Support
Per-conversation + per-resolution
Hybrid model. Usage (conversations) + outcome (resolutions). Featured in a16z podcast.
Leena AI
Employee Support AI
ROI-based (tickets closed)
Shifted from consumption → outcomes. Customers gained clearer ROI, business accelerated.
Scale AI
Data Labeling → Platform
$13.8B valuation
Started as labeling services. Became infrastructure. Services → outcomes → platform.
The Pattern
Every major AI-native company is moving toward outcome-based pricing. This isn't experimentation—it's convergence.
Why Enterprises Love It
43% of enterprise buyers consider outcome-based pricing a significant factor in purchase decisions.
Buyer Psychology
Instant ROI Calculation
"$X per meeting booked" = CFO-ready math. No spreadsheet gymnastics.
Zero Implementation Risk
If it doesn't work, you don't pay. Risk transferred to vendor.
Scales With Value
More meetings = more spend = more value captured. Natural expansion.
No Renewal Anxiety
You're paying for results. Why churn from something that works?
What Buyers Say
"Why should we pay $X per user if we could pay $Y per outcome? Aligning price with realized value improves the ROI calculus."
— Enterprise buyer sentiment (Industry research)
"The fundamental shift is to stop charging for access and start charging for work done."
— Bain Technology Report 2025
Deloitte 2026 Prediction
"Outcome- or value-based pricing is based on the real business results that SaaS applications with AI agents produce. There will be a gradual move toward a future powered by integrated, autonomous multi-agent systems."
Our Model: Pay Per Meeting
We operate AI agent fleets that book qualified sales meetings. You pay only when meetings happen.
❌ Traditional AI SDR
✓ OpenHolly Outcome Model
Unit Economics That Work
Outcome-based pricing isn't charity—it's better economics for everyone.
Our Economics
$250-500 per meeting
Customer pays on outcome
$30-80 cost to deliver
AI compute + tooling + human oversight
3-7x margin
Healthy unit economics, scales with volume
Playbooks compound
Each meeting → better templates → lower cost
Customer Economics
Meeting = $5K-50K deal potential
$250-500 per meeting is a no-brainer
Zero upfront commitment
Start small, scale with proof
Budget predictability
Cost tracks linearly with value
Easy internal approval
CFO loves outcome-based spend
The Intercom Lesson
"Intercom's $0.99 per resolution aligns every team around one outcome: resolved tickets. If Fin resolves a ticket in three messages or thirty, the customer pays the same. The risk is real—but the reward is equally real: customers know exactly what they're getting, and they can calculate ROI in their sleep."
— Bessemer, Feb 2026
Managing the Risks
Outcome-based pricing has real risks. Here's how we mitigate them.
The Risks
Cost variability
Some meetings cost more than others
Revenue unpredictability
Customer usage varies month to month
Attribution disputes
"Did your AI really book this?"
Abuse potential
Customers gaming the system
Our Mitigations
Minimum commitments
Base retainer + outcome fees = floor
Playbook compounding
Cost per outcome drops with scale
Clear outcome definitions
Contractually defined: what counts
Full audit trail
Every action logged, no disputes
Industry Standard Emerging
"Agreements around basic definitions for things like 'an agent,' 'a task,' 'a process,' 'an interaction,' and 'an outcome' should be clearly defined, communicated, and agreed upon contractually." — Deloitte TMT Predictions 2026
Traction
Why Zero Churn
Aligned incentives
They pay for results → they get results → no reason to leave
Clear value
Every invoice shows exactly what they got
Natural expansion
"It's working—give me more"
Customer Mix
50% SDR/BDR
Our wedge: sales meetings
30% Video/ML
Synthetic data pipelines
20% Research
University lab assets
When you only pay for results, there's no reason to churn. Aligned incentives = sticky customers. This is why Intercom's outcome-based Fin has 99% deployment.
Team
Keith Schacht
Co-Founder + Advisor
$140M exit (Mystery) · a16z funded · Built consumer products used by millions
[CTO]
Co-Founder & CTO
CTO of Because · $3M Seed · Deep agent infrastructure experience
Yasir
Co-Founder
yapthis.com · Agentic architecture · Shipped production agent systems
Unfair Advantages
Dog-fooding daily
Running on OpenClaw infrastructure
Playbooks compounding
Every engagement → better templates
Why Outcome-Based Wins
We absorb the risk
Customers love it → lower CAC, zero churn
We're incentivized to deliver
Better AI = more margin for us
The Thesis
You post a bounty: "$500 per meeting booked." AI agents compete. Whoever performs best gets paid. We already do this with bug bounties, Kaggle, hackathons. Why not for AI agents?
— Macy Mills, a16z Speedrun Partner
Why Now
Market timing
61% → 30%+ outcome-based adoption wave
AI SDR burnout
70% churn = customers looking for what works
Enterprise demand
43% prefer outcome-based pricing
Comparable Outcomes
Scale AI: $13.8B
Services → outcomes → platform
Pilot: $1.2B
Bookkeeping outcomes, not seats
Intercom Fin
$0.99/resolution, 99% deployment
OpenHolly: Pay Per Outcome
AI agents that deliver results. You only pay when they do. The future of how work gets priced.
📚 Sources
a16z Enterprise Newsletter (Dec 2024) • Bessemer "AI Pricing Playbook" (Feb 2026) • Bain Technology Report 2025 • Deloitte TMT Predictions 2026 • OpenView SaaS Benchmarks • Gartner • EY "SaaS Transformation with GenAI" (Nov 2025) • BetterCloud "AI and SaaS Industry 2026" • Intercom Fin pricing page • Zendesk AI Agents announcement (Aug 2024)
The $500M AI SDR Market
Is Imploding. We're the Fix.
50-70% churn rates. LinkedIn bans. Domain blacklists. The "autonomous AI SDR" thesis failed. Human-in-the-loop is winning.
The AI SDR Disaster: Real Data
"AI SDRs don't work—biggest bubble in tech." — LinkedIn comment with 400+ likes
💀 What's Actually Happening
"Their AI continuously hallucinated, getting things wrong about what my company does, the industry we are in, what products we sell. 1 positive reply, 1 demo, thousands of prospects touched, $7.5K down the drain."
— r/SaaS, Dec 2025
"A CRO from a publicly traded company disclosed that while an AI SDR helped generate a substantial volume of leads over a nine-month period, it did not lead to actual sales."
— Tomasz Tunguz, Theory Ventures
"Reports emerged of Artisan accounts, including those of team members and founders, facing restrictions or bans for suspected spam and automation violations."
— Quasa.io, Jan 2026
📊 The Numbers Don't Lie
50-70% Annual Churn
2x the churn of human SDRs (a role notorious for turnover) — Common Room
LinkedIn Bans Spreading
Platform ramped up AI detection, restricting automation-heavy accounts
Domain Blacklisting
Gmail filtering harshened. Sender reputations destroyed in weeks.
Legal Exposure
GDPR fines up to 4% revenue. TCPA: $500-1,500 per message.
Brand Damage
"Permanent brand damage from being publicly associated with spam" — NUACOM
Even VCs Are Calling It
TechCrunch: "AI sales rep startups are booming. So why are VCs wary?"
"When one studies any of these startups individually, it's like 'wow, that's stunning product market fit.' When all 10 of them have stunning product market fit, it's hard to answer 'How is that going to play out?'"
— Shardul Shah, Partner, Index Ventures (hasn't invested)
"Without access to differentiated data, AI SDR startups risk being overtaken by incumbents like Salesforce, HubSpot, and ZoomInfo."
— Chris Farmer, CEO, SignalFire
"Investors are not surprised by the rapid adoption of AI SDRs; they are just doubting that adoption is sticky."
— TechCrunch, Dec 2024
The Jasper Cautionary Tale
$1.5B → 30% Layoffs
Jasper, the AI copywriting unicorn, ran into speed bumps and had to lay off 30% of staff after ChatGPT launched. AI SDRs face the same commoditization risk.
Why Adoption Isn't Sticky
Garbage In, Garbage Out
Built on commoditized LinkedIn data = undifferentiated output
Ops is Afterthought
Black boxes that create more work, not less
Feature, Not Product
Incumbents (Salesforce, HubSpot) can bundle this free
The Fundamental Flaw: Autonomous ≠ Better
"The AI SDR is dead, long live the AI SDR: How the future is Human-in-the-Loop"
❌ Why Autonomous Fails
No Emotional Intelligence
Can't read tone, context, or cultural nuance essential in enterprise sales
No Real Consent
Scraped data without consent → GDPR/CCPA violations
No Accountability
When AI misleads, your company bears the liability
Volume Over Value
"More volume on a bad message is not a strategy. It is self-sabotage."
Fake Personalization
"Commenting on someone's hoodie feels forced because it's a hollow observation"
✓ What Actually Works
"Teams that use AI to support human insight consistently outperform teams trying to replace humans entirely. It's not even close."
— Matthew Metros, The AI SDR is Dead
AI Does Research (90%)
Data mining, signal detection, prospect prioritization
Humans Do Relationships (10%)
Judgment, trust, closing
Human-in-Loop = Higher Ratings
MarketBetter (human oversight): 4.97/5 G2 rating
Better Outcomes
"Human-in-the-loop platforms consistently outperform fully autonomous ones"
OpenHolly: The Anti-AI-SDR
We're not building another AI SDR. We're building what should have been built from the start.
❌ 11x / Artisan / AiSDR
Replace human judgment
"Autonomous AI employee"
Optimize for volume
"6,000 contacts/month"
Per-seat pricing
$5-10K/mo regardless of results
You manage the tool
Become pseudo-IT for AI
Hope it works
No outcome guarantees
✓ OpenHolly
Augment human judgment
AI research + human checkpoints
Optimize for quality
Right message, right person, right time
Outcome-aligned pricing
Pay for meetings, not seats
We manage the agents
You focus on your business
Results guaranteed
Outcomes or you don't pay
How OpenHolly Works
AI handles the research. Humans make the decisions. You get meetings.
What AI Handles (90%)
Deep Prospect Research
Intent signals, company news, technographics, pain points
Lead Scoring & Prioritization
Who to contact and why, right now
Draft Generation
Personalized outreach based on real signals
Multi-channel Execution
Email, LinkedIn (safely), follow-ups
What Humans Handle (10%)
Approval Gates
Review before sending to high-value prospects
Live Conversations
When a prospect engages, humans take over
Strategy & ICP
Define who you want to reach and why
Judgment Calls
Edge cases, sensitive prospects, brand protection
The Market Opportunity: Fix AI SDR
Their 50-70% churn is our customer acquisition channel.
The Churned Customer Profile
Burned by AI SDR tools
Spent $5-10K/mo, got spam complaints
Domain reputation damaged
Need to rebuild sender trust
Still need meetings
The problem didn't go away
Now understand quality > volume
Educated by failure
Why They'll Choose Us
Outcome-based pricing
Only pay for meetings that happen
Brand protection
Human oversight prevents embarrassments
Proven playbooks
We've learned what works across verticals
We absorb the complexity
They don't manage agents, they get results
Traction: The Thesis Is Working
Why Zero Churn
Aligned Incentives
When customers only pay for results, there's no reason to churn. If we don't deliver meetings, they don't pay. Simple.
vs. AI SDR Churn
AI SDRs charge $5-10K/mo whether or not they work. When they don't deliver, customers leave. Misaligned incentives = 50-70% churn.
Team
Keith Schacht
Co-Founder + Advisor
$140M exit (Mystery) · a16z funded · Built consumer products used by millions
[CTO]
Co-Founder & CTO
CTO of Because · $3M Seed · Deep agent infrastructure
Yasir
Co-Founder
yapthis.com · Agentic architecture · Production agent systems
The Ask
What We Need
$[X] Pre-Seed
Scale human oversight operations + agent infrastructure
12-month goal: $1M ARR
Prove the anti-AI-SDR thesis at scale
Then: Productize
Turn proven playbooks into self-serve platform
Why Now
AI SDR market imploding
50-70% churn = massive displaced customer base
Human-in-loop proven
Highest G2 ratings go to human-oversight tools
First-mover on "fix"
Position as the safe alternative before market consolidates
OpenHolly: The Anti-AI-SDR
AI SDRs promised automation. They delivered spam, bans, and brand damage. We deliver meetings — with human judgment where it matters. Their 50-70% churn is our customer acquisition channel.
📚 Sources
Common Room "The AI SDR is dead" (Feb 2025) · TechCrunch "AI sales rep startups are booming. So why are VCs wary?" (Dec 2024) · Reddit r/SaaS AI SDR complaints · Quasa.io Artisan LinkedIn bans (Jan 2026) · Pipeline Group "Hidden Dangers of AI SDRs" · Theory Ventures SaaStr Talk · MarketBetter G2 Reviews
The Safety Layer
Before AI Gets the Keys
Browser-layer guardrails that block irreversible AI actions before they happen.
The "$39K Gone in a Blink" Problem
AI agents fail not from bad models, but from bad guardrails. 84% of companies deploying agents have zero safety boundaries defined.
— GenDigital Agent Trust Hub Research, 2026
What Goes Wrong
Runaway API costs
$47K overnight cloud bills
Wrong recipients
AI SDR emails competitors
Irreversible actions
Deleted production data
Credential leaks
Pricing sent to wrong channel
What We Block
Site-specific rules
Block LinkedIn "Follow" for AI SDRs
Action classification
Read vs. Write vs. Irreversible
Human approval gates
Require confirmation for risky ops
Rate limiting
Prevent runaway loops
How It Works
Chrome extension that intercepts agent browser actions
Why Browser Layer
Framework-agnostic. Works with any AI agent (OpenClaw, LangChain, AutoGen, custom). Install once, protect everything.
Market & Competitive Position
Why Now
OpenClaw: 9K → 60K stars
Autonomous agents exploding
CyberArk security concerns
Enterprise worried about agent security
EU AI Act
Regulatory tailwinds for safety
Competition
GenDigital Agent Trust Hub
Just launched - validates market
Our Angle
Browser-layer = framework-agnostic
MVP Achievable
Chrome extension ships fast
Agent Seatbelt
The seatbelt you install before giving AI the keys.
🔗 Supports These Pitches
Fat Startup • AWS of AI Work • Control Plane
Part of the human oversight layer that makes agent work reliable.
Datadog for
Autonomous Agents
When your AI employee sends the wrong email at 3am, you'll know exactly why.
The Problem
Companies are deploying autonomous AI agents that run 24/7. When something goes wrong—and it will—they have no idea why. Current tools are built for request-response, not proactive agents.
Current Tools Miss Autonomous Agents
LangSmith / Langfuse / Arize
Request-response patterns
User sends message, LLM responds
Chain tracing
LangChain-specific, not agent-native
No proactive agent support
Built for chatbots, not employees
ClawView
Autonomous operation
24/7 agents taking proactive actions
Decision tracing
Why did it make that choice?
Multi-channel + tools
Shell, browser, files, messages
The "Oh Shit" Demo
Without ClawView
"The agent sent the wrong email. Logs show it ran. No idea why."
With ClawView
"Step 3: Agent assumed X because of context Y. Here's how to prevent this class of error."
ClawView: See What Your Agents Actually Do
Every decision. Every action. Every assumption. Full causal tracing.
🔗 Supports These Pitches
Fat Startup • AWS of AI Work • Control Plane
Observability layer — see what agents are doing before they go wrong.
⚠️ Why This is a Feature, Not a Company
Langfuse, LangSmith, Arize are well-funded. But none are built for autonomous agents. ClawView is our internal observability layer, not a separate product pitch.
Governance for
AI Employees
Audit trails. Approval workflows. Compliance automation. The control layer enterprises need.
The Governance Gap
AI agents fail not from bad models, but from bad guardrails. The unlock isn't better agents—it's better safety rails.
— Industry consensus, 2026
What's Missing
No audit trails
What did the agent do at 3am?
No approval workflows
High-stakes actions go unsupervised
No compliance framework
EU AI Act enforcement coming
No agent-on-agent supervision
Humans can't supervise at machine speed
AgentGov Provides
Immutable audit trails
Every action, every decision, timestamped
Approval workflows
Human gates for high-stakes actions
Compliance automation
EU AI Act ready, audit reports generated
AI supervision layer
Validator agents checking worker agents
From "Human in Loop" to "Human on Loop"
McKinsey Insight
"Organizations are moving from human in the loop to human on the loop—above the loop for strategic oversight." AgentGov enables this transition safely.
AgentGov: Govern AI at Scale
Audit trails. Approval workflows. Compliance automation. Trust at machine speed.
🔗 Supports These Pitches
Fat Startup • AWS of AI Work • Control Plane
Governance + compliance layer — enables enterprise trust.
🔬 Key Research
Gravitee 2026: Only 14.4% have full security approval for agents. 88% reported incidents.
EU AI Act: Enforcement begins 2026, mandates audit trails.
Zenity: $38M Series B validates market (but they're low-code focused, not agent-native).
The Full Stack for
AI Employees
10 layers an AI employee needs to fulfill an entire job description. We're building the unified platform.
The Thesis
An AI employee's value lies in performing EVERYTHING in a job description—not just one workflow. This requires a complete infrastructure stack.
The 10-Layer Stack
What's Missing (⭐)
Layers 8-10 are the critical gaps. Everyone's building capabilities. Nobody's building supervision, agent-to-agent communication, and compliance.
The Integration Problem
Current landscape is fragmented
Today: Point Solutions
Memory: Mem0, Zep, LangMem
Tools: MCP servers
Identity: Okta, 1Password
Tasks: LangGraph, CrewAI
Compliance: Guardrails AI, Trail
Tomorrow: AI Employee OS
A unified platform that manages the full AI employee lifecycle.
Integrated stack
All 10 layers, one platform
Turnkey deployment
Job description → Working AI employee
Enterprise governance
Built-in compliance, audit, oversight
AI Employee OS
The unified platform for deploying, managing, and governing AI employees.
🔗 Framework For These Pitches
Fat Startup • AWS of AI Work • Control Plane
The 10-layer framework is how we think about what AI employees need.
⚠️ Why This is a Framework, Not a Pitch
Building all 10 layers is massive. We focus on Layers 8-10 (supervision, communication, compliance) because that's the critical gap. The framework informs strategy, not the pitch itself.
Stack Overflow
for AI Agents
Verified working code. Real benchmarks. Pay-per-snippet micropayments. Documentation that actually works.
The Hallucination Tax
Despite our best efforts, they will always hallucinate. That will never go away.
— Amr Awadallah, Vectara CEO, 2026
❌ The Problem
Best-documented ≠ Best solution
Agents pick whatever has most examples
Documentation gets stale
APIs change, snippets break
No verification
Agent can't know if code actually runs
No benchmarks
No cost/perf data to guide decisions
✓ AgentDocs
Agent-swarm verified
Code tested continuously, timestamped
Use-case organized
"Transcribe video" → 10 services compared
Real benchmarks
Cost, latency, quality scores
x402 micropayments
$0.05 per verified snippet
How It Works
Kill the API Key
No signup. No rate limits. No accounts. Agent pays per-request, gets verified code. Native to how agents want to consume services.
Initial Use Cases
Transcription
Deepgram, Whisper, AssemblyAI
Resend, SendGrid, Mailgun, CF
Image Generation
DALL-E, Midjourney API, Flux
Payments
Stripe, LemonSqueezy, Paddle
What Agents Get
Working code snippet
Last verified timestamp
Cost per API call
Latency benchmarks
Market & Competition
Closest Competitor: Context7
Up-to-date docs
✓ They have this
Version-specific
✓ They have this
Verified working
No continuous testing
Benchmarks
No cost/perf data
Micropayments
Free only, no agent-native billing
Why Now
x402 is production-ready
$43M+ processed, 35M+ txns
Agent adoption exploding
OpenClaw: 9K→60K stars
$50B market by 2030
AI agent infrastructure
Clear wedge
Verification is table stakes soon
The x402 Thesis
25,000+ developers building on x402. Google, Cloudflare, Stripe adopting. Machine-to-machine payments are the rails for agent economy.
x402 Market Opportunity
Real-time data from x402scan.com shows a booming agent economy — with a clear gap for developer tooling.
All 14 Facilitators
| Facilitator | 30d Txns | 30d Vol | What They Do |
|---|---|---|---|
| Dexter | 1.65M | $79.5K | Agent economy platform |
| Coinbase | 722K | $288.5K | Official CDP facilitator |
| Virtuals Protocol | 412K | $1.34M | AI agent tokenization |
| PayAI | 1.31M | $43.3K | Micropayments |
| RelAI | 66K | $84K | Agent payments (Solana) |
| Meridian | 19K | $315K | High-value transactions |
| Thirdweb | ~10K | ~$2K | Web3 dev platform |
| OpenX402 | 6.6K | $38.6K | Open-source facilitator |
| Polymer | 6.4K | $770 | Proof generation |
| AnySpend | ~3K | ~$5K | Multi-asset spending |
+ Corbits, OpenFacilitator, CustomPay, AgentPay (emerging)
Source: x402scan.com, Feb 27 2026
Market Gap Analysis
What Exists
Data APIs, AI services, crypto tools, social data
What's Missing
Verified code snippets, curated docs, developer knowledge
AgentDocs Opportunity
Be the Stack Overflow layer on x402 rails
Why We Can Win
Top services (StableEnrich, LowPaymentFee) aggregate APIs — they don't verify code quality.
AgentDocs: Premium pricing ($0.05-0.10) justified by verification + benchmarks.
Target: 1,000+ requests/day = $2,100+/month revenue from agent micropayments alone.
Revenue Model
AgentDocs: Documentation That Works
Verified snippets. Real benchmarks. Agent-native payments. Stack Overflow, but for machines.
🔗 Supports These Pitches
Better documentation → better agent outputs → more reliable outcomes.
📍 Current Progress
Live: agentdocs-api.holly-3f6.workers.dev
Snippets: 15 use cases, 21 verified snippets
Status: Dogfooding internally, expanding library
The Infrastructure Layer
for AI Agent Work
$30-40B poured into AI agents. 95% fail to deliver. We're building the missing infrastructure that makes them actually work.
The $30B Problem
Companies are pouring billions into AI agents. Almost none deliver measurable returns.
Companies are pouring $30–40 billion into generative AI, yet an MIT study finds that 95% of enterprise pilots deliver zero measurable return.
— MIT NANDA: The GenAI Divide, 2025
Why AI Agents Fail
The pattern is consistent. It's not the models—it's the infrastructure.
❌ What Breaks
No workflow templates
Teams reinvent every agent from scratch. Same failures, different companies.
No human oversight
Agents run unsupervised. High-stakes errors go uncaught. Trust collapses.
No failure patterns
Each company learns the same lessons. No accumulated knowledge.
No orchestration
Multi-agent systems collapse. Stanford CooperBench: 25% success rate.
✓ What's Missing: Infrastructure
Battle-tested workflow templates
Proven prompts, integrations, and sequences. Encoded from real deployments.
Human-in-the-loop routing
Smart escalation. Approval queues. Humans handle edge cases.
Failure pattern library
What breaks and how to prevent it. Compound learning across clients.
Agent orchestration layer
Coordinate multi-agent work. Handle failures gracefully.
The Unlock
The 5% that succeed have infrastructure. Templates. Oversight. Failure patterns. We're building that infrastructure as a service.
The Playbook: Services → Platform
The most valuable infrastructure companies started by doing the work themselves.
Scale AI
Data Labeling → AI Infrastructure
Started labeling images for self-driving cars (2016). Now the "Data Foundry" powering OpenAI, Meta, Google. 50% gross margins from tech-enabled services.
Pilot
Bookkeeping Services → Financial Infra
"AWS for SMB accounting." Started doing bookkeeping. Now processes $3B+ in transactions. Jeff Bezos led funding.
Stripe
Payments API → Financial Infrastructure
Started with simple payment processing (2010). Expanded to Connect, Radar, Atlas. Infrastructure that grows as customers grow.
The Pattern
Do the work → Encode the patterns → Become the platform. Services fund the R&D. Each engagement builds the moat. Competitors starting later start from zero.
Scale AI: The Detailed Parallel
Their journey is our playbook. Same model, different layer.
Scale AI's Model
Services Entry
Started labeling images for AV companies. Revenue from day one.
Tech Layer
Built pre-labeling ML that made each human 10x more efficient.
Data Flywheel
Each correction improved their models. More data = better automation.
Platform Expansion
Nucleus, Validate, Launch—from labeling to full ML lifecycle.
Our Model
Services Entry
Operating AI agent workflows for clients. Revenue from day one.
Tech Layer
Workflow templates + orchestration that make agents reliable.
Playbook Flywheel
Each engagement encodes learnings. More workflows = better templates.
Platform Expansion
Guardrails, Observability, Governance—full agent lifecycle.
Scale AI is not a traditional BPO company. It is a Data Foundry. Their technology layer is their moat—human workforce augmented by proprietary software that compounds in value.
— Takafumi Endo, "Scale AI: Deconstructing the Foundry"
The Workflow Template Moat
Each engagement encodes a playbook. Playbooks become the platform.
Compounding Effect
Customer 1: 2 weeks
Figure everything out from scratch
Customer 5: 3 days
Apply existing playbook + customize
Customer 10: Hours
Playbook is battle-tested
Customer 50+: Self-serve
Playbooks become product
What's In A Template
Prompt sequences
What actually works for each use case
Model routing
Which models for which tasks (cost/quality)
Tool configurations
Integrations, APIs, credentials patterns
Guardrail rules
What to block, what to escalate
Why Infrastructure Wins
Application companies fight for customers. Infrastructure companies power the ecosystem.
❌ Application Layer
Compete on features
Race to the bottom. Easy to copy.
Linear growth
Each customer = new acquisition cost
2-5x revenue multiples
Commodity software pricing
Low switching costs
Customers can leave anytime
✓ Infrastructure Layer
Compete on reliability
Mission-critical. Hard to replicate.
Compound growth
Templates improve → more value → more customers
10-25x revenue multiples
Scale AI: 18x. Stripe: higher.
High switching costs
Workflows built on your templates
Network effects are the underlying principle behind the success of companies like AWS, Stripe, and Salesforce. Higher network density means the product value increases.
— NFX: The Network Effects Manual
Market Size: $50-70B by 2030
AI agents are the fastest-growing category in enterprise software. We're building the infrastructure layer.
Our TAM Slice: Infrastructure
If AI Agents are $50B, infrastructure is 20-30% of stack value:
Why We Win This Slice
First-mover on playbooks
Every month = more encoded knowledge
Revenue while building
Services fund the platform
Real deployment data
Failure patterns competitors don't have
The Infrastructure Stack
Four layers that make AI agents reliable. We're building all four.
Verified prompts, sequences, integrations
Multi-agent coordination, task routing
Approval queues, escalation, feedback loops
Safety rails, monitoring, audit trails
Current Products
Agent Seatbelt
Browser-layer guardrails that block irreversible actions
ClawView
Observability for autonomous agents. See what they do.
AgentGov
Governance, compliance, audit trails
AgentDocs
Verified code snippets for agent tool use
Current Traction
What We've Delivered
SDR for construction companies
Lead gen + qualification workflows
Video generation for ML training
Synthetic data pipeline workflows
Research for universities
Literature review + synthesis workflows
BDR for startups
Outbound + meeting booking workflows
What This Proves
Fat Startup Thesis
We're getting paid to build our moat. Every dollar of revenue = more encoded knowledge. Competitors starting later start from zero.
"A fat startup ships outcomes, not features. It bundles software, data, and human ops into one integrated product that actually gets the job done."
— Andrew Lee, a16z Speedrun
The Path Forward
12-Month Milestones
$1M ARR
Prove unit economics at scale
50+ Workflow Templates
Across 5+ verticals
Infrastructure Products Live
Guardrails, Observability, Governance
First Self-Serve Templates
Deploy without our team
Why Now
Models just got capable enough
GPT-5, Claude 4—agents can work
AI SDR market burned
70-80% churn = customers seeking alternatives
Infrastructure window open
No dominant player yet. First-mover wins.
Regulatory tailwinds
EU AI Act mandates oversight, audit trails
The Ask
The AWS of AI Work
Infrastructure that makes AI agents reliable. Workflow templates. Orchestration. Human oversight.
Every company deploying agents will need this. We're building it.
Team
Keith Schacht
Co-Founder + Advisor
$140M exit (Mystery) · a16z funded
[CTO]
Co-Founder & CTO
CTO of Because · $3M Seed
Yasir
Co-Founder
yapthis.com · Shipped production agents
Key Sources
MIT NANDA Study: 95% AI failure rate, 171% ROI when successful
MarketsandMarkets: $7.8B → $52.6B AI agents market (2025-2030)
Scale AI (Sacra): $1.5B ARR, $29B valuation, 50% gross margins
Pilot (CNBC/TechCrunch): $1.2B valuation, Bezos-backed
11x/Artisan: 70-80% churn within months (Broadn research)
RAND Corporation: 80% AI project failure rate
The Uber for AI Work
Post an outcome. AI agents compete. Pay only for results. We're building the outcome marketplace for the AI economy.
The a16z Speedrun Thesis
This is the exact model a16z partners are calling for in 2026.
Say you need 50 qualified sales meetings. Instead of buying another AI tool, you post a bounty: "$500 per meeting booked." AI agents compete. Whoever performs best gets paid. We already do this with bug bounties, Kaggle, hackathons. Why not for AI agents going after real business outcomes?
— Macy Mills, a16z Speedrun, "14 Big Ideas for 2026"
I'm especially excited about products that use AI to make previously expensive services cheaper and more accessible, sometimes using human-in-the-loop to start.
— Kenan Saleh, a16z Speedrun, "14 Big Ideas for 2026"
A fat startup ships outcomes, not features. It bundles software, data, and human ops into one integrated product that actually gets the job done.
— Andrew Lee, a16z Speedrun Partner
The Market Shift: Tools → Outcomes
The freelance marketplace is $1.5T. It's about to be disrupted by AI agents.
❌ Legacy Marketplaces
Upwork: $1.67B market cap
Pay humans by the hour. Hope they deliver.
Fiverr: ~$1B market cap
Fixed-price gigs. Still human-dependent.
Slow, expensive, variable
Wait days. Pay premium. Quality varies.
✓ AI Agent Marketplace (Us)
Pay per outcome, not effort
$X per meeting, $Y per video, $Z per lead.
Hours, not days
AI agents work 24/7. Instant scale.
Network effects compound
More agents = better matching = better outcomes.
The Paradigm Shift
As we move to a future based on outcome-based pricing that perfectly aligns incentives between vendors and users, we'll first move away from time-based billing. — a16z Big Ideas 2026
How It Works
Bounties + Escrow + AI Agents = Outcome Marketplace
For Buyers
Define the outcome
"Book qualified meeting" or "Generate product video"
Set your price
Pay what the outcome is worth to you
Zero risk
Funds held in escrow. Pay only on delivery.
For Agents (Supply Side)
Pick bounties that fit
Match capabilities to opportunities
Build reputation
Success rate → more bounties → more revenue
Get paid instantly
Verified outcome → automatic payout
The Bounty Model Works
Proven in bug bounties, open source, and ML competitions. Now it's time for AI work.
Precedent: Replit Bounties
Imagine a tool where you describe your problem and get a solution built for you. Today we're introducing Bounties, a marketplace where you work with top creators and bring your software ideas to life.
— Replit, on launching Bounties
Replit proved bounties work for code. We're proving it works for any AI-deliverable outcome.
Precedent: GitCoin
Over the past 5 years we've supported the funding of public goods. Started with bounties for open source, evolved to quadratic funding.
— GitCoin: $60M+ distributed
GitCoin proved bounties + crypto payments = massive coordination. We're applying this to AI agent work.
Network Effects: The Moat
70% of tech value comes from network effects. Here's how we build them.
Network effects have been responsible for 70% of all the value created in technology since 1994. Founders who deeply understand how they work will be better positioned to build category-defining companies.
— NFX, "The Network Effects Bible"
Two-Sided Marketplace NFX
More buyers → More bounties
Attracts more agents to the platform
More agents → Better matching
Faster delivery, higher quality outcomes
Better outcomes → More buyers
Word of mouth, lower prices, faster delivery
Data Network Effects
Every bounty = training data
What works, what fails, edge cases
Smarter matching over time
Route bounties to best-fit agents
Proprietary playbook library
Compound knowledge competitors can't replicate
Metcalfe's Law
Value of a network grows proportional to N² (nodes squared). With agents AND buyers, we get cross-side network effects that compound faster than single-sided platforms.
Trust Layer: How Agents Build Reputation
The missing infrastructure for AI agent marketplaces.
Agent Identity & Track Record
Verifiable agent identity
Who built it, what it can do, audit trail
Per-function reputation
Track record based on actual outcomes, not reviews
Specialization scores
"This agent is 94% on sales meetings, 78% on video"
Trust Mechanics
Escrow with time-locks
Funds released only on verified delivery
Dispute resolution
Human or AI arbitration for edge cases
Sliding refund scale
Partial credit for partial delivery
Path: Managed → Open Marketplace
Like Uber: start premium, then open the platform.
Phase 1: Managed (Now)
We operate all agents
Quality control, learn playbooks
$4K MRR validates demand
Customers paying for outcomes
Build trust infrastructure
Escrow, verification, reputation
Phase 2-3: Marketplace
Invite partner agents
Vetted builders, revenue share
Open to all agents
Anyone can compete for bounties
Platform take rate: 15-20%
Like Uber, Airbnb, marketplace standard
The Uber Playbook
Uber started with black cars (premium, managed) before opening to UberX (open marketplace). We start with our agents, prove economics, then open to all. Services fund the platform build.
Comparable Companies & Valuations
Services → Platform is a proven path to massive outcomes.
Scale AI: Our North Star
Started as services
Data labeling for ML companies
Built the platform
Tools, workflows, quality systems
$2B+ revenue (2025)
Services funded the infrastructure
$13.8B valuation
Platform economics, not services multiples
Why We're Bigger
Scale AI: One vertical
Data labeling for ML
Us: All AI-deliverable work
Sales, content, research, ops...
TAM: $1.5T+ services market
Every white-collar task that can be AI'd
Why Now: The Perfect Storm
Technology Inflection
Models capable enough
GPT-5, Claude 4 can do real work
x402 machine payments
Agents can transact autonomously
Infrastructure exists
OpenClaw, MCP, agent frameworks
Market Readiness
AI tools disappointing
70% churn = buyers want outcomes
Budget exists
Companies spending on AI, getting nothing
First mover advantage
No AI-native outcome marketplace yet
Emerging primitives like x402 make payment settlement programmable and reactive. Smart contracts can settle a dollar payment globally in seconds. In 2026, this becomes the rails for agent commerce.
— a16z Big Ideas 2026, Part 3
Team & Traction
Keith Schacht
Co-Founder + Advisor
$140M exit (Mystery) · a16z funded · Built consumer products used by millions
[CTO]
Co-Founder & CTO
CTO of Because · $3M Seed · Deep agent infrastructure experience
Yasir
Co-Founder
yapthis.com · Agentic architecture · Shipped production agent systems
What Traction Proves
Companies pay for outcomes. 0% churn because incentives align. This is the business model for AI work.
The Ask
What We Need
$[X] Pre-Seed
Scale agent capacity, build marketplace infra
12-month goal: $1M ARR
Prove economics before opening marketplace
24-month: Open marketplace
Partner agents, then fully open
Why Us
Dog-fooding OpenClaw
We run agents daily, know what breaks
Built the infrastructure
ClawView, guardrails, workflows
Revenue already
$4K MRR proves the model
OpenHolly: The Uber for AI Work
Post an outcome. AI agents compete. Pay for results. The marketplace that makes AI actually deliver.
🔧 Infrastructure We're Building
🛡️ Guardrails • 📊 ClawView • 🏛️ AgentGov
Trust layer that makes marketplace outcomes reliable.
📚 Sources
a16z: "14 Big Ideas for 2026" (Macy Mills, Andrew Lee, Kenan Saleh) • "Big Ideas 2026 Part 1-3" • NFX: "The Network Effects Bible" (70% of tech value) • Market Data: Scale AI ($13.8B), Upwork ($1.67B), GitCoin ($60M+ distributed) • Replit: Bounties marketplace launch
The Control Plane for
AI Agents
Everyone's building autonomous agents. We're building the layer that makes them actually work: purpose-built infrastructure for human oversight at scale.
The Inconvenient Truth: Autonomy Fails
The research is clear—and the industry is learning the hard way.
Multi-Agent Systems Break Down
"Multi-agent architectures, despite their promise, can fall short on efficiency, reliability, and even accuracy... performance often degrades as coordination complexity increases."
— Berkeley/DeepMind "Why Multi-Agent LLM Systems Fail", 2025
75% failure rate
ChatDev on ProgramDev benchmark
~50% average task completion
Across autonomous agent frameworks
17x error amplification
In uncoordinated "bag of agents"
Enterprise AI Projects Crater
"42% of companies abandoned most of their AI initiatives in 2024, up from 17% the previous year. The average organization scrapped 46% of AI proof-of-concepts."
— S&P Global Research, 2024
95% of AI pilots fail
MIT Research on enterprise deployments
80%+ never reach production
RAND Corporation AI project study
2x failure rate vs traditional IT
AI projects vs standard software
Why This Matters
The industry is betting billions on fully autonomous agents. The research says they don't work. Someone needs to build the layer that makes them work.
Microsoft's Answer: Human-in-the-Loop
The largest AI research org in the world just validated our thesis.
"We argue that human-in-the-loop agentic systems offer a promising path forward, combining human oversight and control with AI efficiency to unlock productivity from imperfect systems."
— Microsoft Research, Magentic-UI (July 2025)
Magentic-UI Results
Only 10% of tasks needed human help
Lightweight intervention, massive improvement
1.1 avg clarifications per help request
Minimal interaction overhead
Key Interaction Mechanisms
Co-planning
Human + agent collaborate on plan before execution
Co-tasking
Seamless handoff between human and agent control
Action guards
Human approval for high-stakes actions
Memory
Learn from past interactions to improve
Microsoft's Conclusion
"Even as tomorrow's agents become more capable and reliable, we believe that human involvement will remain essential for preserving human agency, resolving unforeseen ambiguities, and guiding agents in adapting to an ever-changing world."
Anthropic's Findings: The Oversight Paradox
Real-world data from millions of Claude Code sessions reveals how humans actually oversee agents.
As Users Gain Experience...
Auto-approve increases: 20% → 40%+
Experienced users let Claude run autonomously
BUT interrupt rate ALSO increases: 5% → 9%
They intervene more often, not less
The shift: Step-by-step → Exception-based
From approving everything to watching for problems
Agent-Initiated Stops Matter
Claude asks for clarification 2x more
On complex tasks vs simple ones
More often than humans interrupt
On the most difficult tasks
Models know when they're uncertain
They can (and should) ask for help
"Effective oversight doesn't require approving every action but being in a position to intervene when it matters... our central conclusion is that effective oversight of agents will require new forms of post-deployment monitoring infrastructure and new human-AI interaction paradigms."
— Anthropic Research, "Measuring AI Agent Autonomy in Practice" (Feb 2026)
The Deployment Overhang
Anthropic found that "the autonomy models are capable of handling exceeds what they exercise in practice." The bottleneck isn't model capability—it's the oversight infrastructure.
Air Traffic Control for AI Agents
The analogy everyone is converging on—and what it means for product design.
"Think of agents within your multi-agent system as the airplanes. The agents have their own autonomy to act. But air traffic control provides guardrails, coordination, and human oversight for the whole system."
— Jason Bryant, AI in Pharma (Jan 2026)
Why Air Traffic Control Works
Planes are autonomous
Pilots make real-time decisions
Controllers handle coordination
Routing, conflicts, emergencies
Humans handle edge cases
Technology can't modify standard procedures
System improves over time
Incidents become new procedures
Why This Analogy Matters
Scaling ratio: 1 controller : many planes
Not 1:1 human-to-agent
Controllers can't replace pilots
Nor vice versa—complementary roles
No full automation possible
Edge cases require human judgment
Multi-billion dollar industry
ATC isn't going away
The Thesis
As AI agents proliferate, every company will need an "air traffic control" system for their agent fleet. That's the control plane we're building.
Why Current Interfaces Fail
Existing tools weren't designed for the human-agent oversight problem.
❌ Chat Interfaces
Conversational, not workflow-oriented. Can't manage 100 agents. No approval queues. No batch operations. You'd need a chat window per agent.
❌ Code/GitHub
Great for developers. Useless for ops teams. Can't approve actions in real-time. No visual understanding of agent state or intent.
❌ Slack/Email Alerts
Ad hoc approvals. No context. Alert fatigue. Doesn't learn from decisions. Can't see what agent plans to do next.
❌ Observability Dashboards
Read-only visibility. No intervention capability. See problems after they happen. Can't modify agent plans mid-execution.
"Only 14.4% of enterprises have full security approval for AI agents. 88% reported agent-related incidents. The interface problem is also a governance problem."
— Gravitee State of AI Agents Report, 2026
The Gap
There's no purpose-built interface for humans to oversee AI agents at scale. Not dashboards. Not chat. Not alerts. A new category needs to exist.
What a Control Plane Actually Needs
Distilled from Microsoft, Anthropic research, and our own deployments.
Pre-Execution
Plan Review
See what agent intends to do before it acts. Edit plans. Add constraints.
Scope Boundaries
Define allowed domains, tools, actions. Agent can't exceed boundaries.
Workflow Templates
Start from proven patterns. Don't reinvent for every task.
During Execution
Real-Time Visibility
See agent actions as they happen. Browser view. Code execution. API calls.
Interrupt & Resume
Pause any agent instantly. Take control. Hand back.
Action Guards
Automatic pause for high-stakes actions. Configurable thresholds.
Approval Layer
Unified Queue
All pending approvals across all agents in one view.
Batch Operations
Approve/reject patterns across many agents at once.
Smart Routing
Route different decisions to different humans by expertise.
Learning Layer
Decision Memory
Human approvals become future patterns. Rejections become rules.
Threshold Tuning
Auto-adjust when to ask humans based on outcomes.
Playbook Evolution
Workflows improve with every human intervention.
The "Control Plane" Category
Every complex system has a control plane. AI agents need one too.
Control plane for infrastructure. See what's happening. Alert when things break. Intervene.
Control plane for containers. Orchestrate workloads. Handle failures. Scale automatically.
Control plane for identity. Who can access what. Audit trails. Compliance.
What agents are doing. Approvals & intervention. Learning & guardrails. This category doesn't exist yet.
"The control plane provides management and orchestration across an organization's environment. It's akin to air traffic control for applications."
— Vectra AI definition
The Opportunity
Infrastructure got Datadog. Containers got Kubernetes. Identity got Okta. AI agents need their control plane. We're building it.
Why Human-in-the-Loop Scales
The VC objection—and why it's wrong.
The Objection
"If humans are in the loop, doesn't that kill unit economics? Isn't the whole point to remove humans?"
The Response: Look at the Data
Human labelers + AI. Humans as oversight.
Human bookkeepers + AI. Humans as QA.
Human analysts + AI. Humans as strategists.
The Key Distinction
"Humans as OVERSIGHT, not labor. AI does the work, humans QA. The ratio improves over time."
The Scaling Math
Year 1: 10:1 ratio
1 human oversees 10 agents. Heavy QA.
Year 2: 100:1 ratio
System learns. Fewer interventions needed.
Year 3+: 1000:1 ratio
Humans handle edge cases only. Still critical.
The Avi Medical Case Study
81% automation rate. 93% cost savings. Humans handle complex cases. HITL doesn't kill unit economics—it enables them.
The Contrarian Bet
Everyone's zigging toward full autonomy. We're zagging toward control.
What Everyone Else is Building
Fully autonomous agents
Demo well. Break in production.
More agent capabilities
Better models. More tools. Same failure modes.
"Just add more agents"
17x error amplification, per DeepMind.
Removing humans entirely
The dream that keeps failing.
What We're Building
The oversight layer
Makes ANY agent more reliable.
Human-agent collaboration
Complementary strengths. Better outcomes.
Coordination infrastructure
Turns bag-of-agents into functional team.
Humans in the right places
Exception handling. Strategic oversight.
"I'm especially excited about products that use AI to make previously expensive services cheaper and more accessible, sometimes using human-in-the-loop to start."
— Keenan Saleh, a16z Speedrun Partner
Our Position
We're not betting against agent capabilities improving. We're betting that oversight infrastructure will always be needed—and no one is building it well.
Why Us, Why Now
Team
Keith Schacht
Co-Founder + Advisor
$140M exit (Mystery) · a16z funded · Built consumer products used by millions
[CTO]
Co-Founder & CTO
CTO of Because · $3M Seed · Deep agent infrastructure experience
Yasir
Co-Founder
yapthis.com · Agentic architecture · Shipped production agent systems
Why Now
Agent adoption is exploding
OpenAI Operator, Anthropic Claude Code, 1000+ agent startups
Failure rates are becoming visible
95% pilot failure is now common knowledge
Research is converging
Microsoft, Anthropic, DeepMind all pointing to HITL
Regulation is coming
EU AI Act mandates audit trails & oversight
What We've Built
$4.5K MRR
Proving the thesis with real customers
OpenClaw infrastructure
Dogfooding our own control plane daily
Guardrails, ClawView, AgentGov
Components of the full control plane
The Ask
The Human-Agent Control Plane
Purpose-built infrastructure for human oversight of AI agents at scale. Plan review. Action guards. Approval queues. Learning loops. The missing layer that makes agents actually work.
What We Need
$[X] Pre-Seed
Build the full control plane product
12-month goal: $1M ARR
Prove control plane scales across customers
Then: Category definition
Be "Datadog for AI agents"
The Opportunity
New category creation
No one owns "AI agent control plane" yet
Research-backed thesis
Microsoft, Anthropic, DeepMind alignment
Every agent deployment needs this
Horizontal opportunity across industries
🔧 Infrastructure We're Building
🛡️ Guardrails • 📊 ClawView • 🏛️ AgentGov • 🤖 Employee OS
The Control Plane integrates all infrastructure layers into one human-facing interface.
🔬 Research Foundation
MIT: 95% of AI pilots fail · DeepMind: 17x error amplification in multi-agent · Microsoft Magentic-UI: 71% accuracy improvement with HITL · Anthropic: "New oversight infrastructure needed" · Berkeley: "Why Do Multi-Agent Systems Fail?" · S&P Global: 42% of AI initiatives abandoned
Vibe Code Your Business
"Vibe coding" revolutionized app development—describe what you want, AI builds it. Now apply this to business outcomes. Describe the result, AI + humans deliver it.
The Vibe Coding Revolution
What started as a meme became a paradigm shift. Now it's evolving beyond code.
"There's a new kind of coding I call 'vibe coding', where you fully give in to the vibes, embrace exponentials, and forget that the code even exists."
— Andrej Karpathy, Feb 2025 (coined the term)
Origins & Evolution
"The hottest new programming language is English"
Karpathy's early prediction about LLM capabilities
Vibe coding goes mainstream
Cursor, Replit, Claude Code—describe → build
Beyond coding: "Vibe Productivity"
Research, writing, reporting, file operations, "glue work"
Where It's Going
"What changed in early 2026 is that vibe coding is no longer confined to software development; it is spreading into research, writing, reporting, spreadsheet wrangling, file operations, and 'glue work' that usually fragments attention."
— Ken Huang, "The Vibe Shift" (Jan 2026)
The Pattern
Vibe coding showed that natural language → complex software works. Now we're applying the same pattern to natural language → business outcomes.
From Apps to Outcomes
The next evolution: describe what you want to achieve, not what you want built.
❌ Current Reality: Use Tools
Subscribe to AI SDR tool
$5-10K/month
Configure the tool
Import lists, write sequences, set rules
Monitor the tool
Fix errors, adjust settings, babysit
Hope for outcomes
70% churn in 3 months when it doesn't work
✓ Vibe Outcomes: Describe Results
Describe what you want
"50 qualified sales meetings with Series A fintech founders"
AI agents execute
Research, outreach, qualification, scheduling
Humans QA
Review, approve, handle edge cases
Pay for outcomes
$X per meeting delivered
The Thesis
Vibe coding proved that intent → artifact works for software. Vibe outcomes proves it works for business results. The "vibes" are the goal—the execution is handled by well-orchestrated HITL agent workflows.
How It Works
Describe outcome → Agents execute → Humans QA → Outcome delivered
Example: "50 Sales Meetings"
Input
"Book 50 qualified meetings with Series A fintech founders in Q1"
Research Agent
Identifies prospects, signals, contact info
Outreach Agent
Drafts personalized messages
Human Review
Approves messaging before send
Scheduling Agent
Books the meeting when prospect replies
Example: "Process These Invoices"
Input
"Process this month's invoices and flag anomalies"
Extraction Agent
Pulls data from PDFs, emails, systems
Matching Agent
Matches to POs, identifies discrepancies
Human Review
Approves exceptions, flags fraud
Output
Processed invoices, exception report
Why Vibe Outcomes Need Human-in-the-Loop
Pure AI can't deliver reliable business outcomes. The research is clear.
Why Pure AI Fails
"Multi-agent architectures, despite their promise, can fall short on efficiency, reliability, and even accuracy... performance often degrades as coordination complexity increases."
— Berkeley/DeepMind, 2025
Hallucinations occur even with high confidence
AI can be confidently wrong about business-critical decisions
Edge cases are infinite
Business has nuance AI can't anticipate
Stakes are high
Brand damage, legal liability, lost deals
Why HITL Fixes It
"Hybrid AI workflows, which combine automation with human oversight, are not a fallback; they're the modern standard for reliability, trust, and scalability in 2026."
— Parseur, Dec 2025
Human as QA layer, not labor
AI does 90% of work, humans verify critical decisions
Trust calibration over time
System learns when to ask, when to proceed
Only 10% of tasks need human help
Microsoft found lightweight intervention = massive improvement
The Interaction Layer
This is the UX for the AI-native agency, control plane, and marketplace pitches.
Why Current Interfaces Fail
❌ Chat Interfaces
Conversational, not outcome-oriented. Can't manage complex multi-step workflows. No approval queues.
❌ Dashboards
Read-only visibility. No intervention. See problems after they happen. Can't modify plans mid-execution.
❌ Slack/Email Alerts
Ad hoc. No context. Alert fatigue. Can't see what agent plans to do next.
The Vibe Outcomes Interface
Natural language input
"I need X" → system figures out how
Progress visibility
See what's happening toward your goal
Approval queues
Review decisions that matter
Interrupt & adjust
Course-correct mid-execution
Outcome tracking
Clear metrics: delivered vs requested
🔗 This Powers Our Other Pitches
⚡ Fat Startup: Vibe outcomes is how customers interact with us
🚗 Uber for AI Work: Natural language bounty posting
🎛️ Control Plane: The human oversight layer
☁️ AWS of AI Work: Workflow templates activated by intent
Market Opportunity
The shift from "tools" to "outcomes" is creating massive new markets.
Who Wants This
SMBs frustrated with AI tools
70% AI SDR churn = customers seeking alternatives
Enterprises with AI fatigue
95% pilot failure = demand for what works
Founders too busy to manage AI
Want outcomes, not another tool to learn
The Pricing Shift
"Per-seat is no longer the atomic unit of software. When AI can handle ticket resolution, the natural pricing metric becomes successful outcomes."
— a16z Enterprise Newsletter, Dec 2024
Outcome-aligned pricing
$X per meeting, $Y per processed invoice, $Z per video
Competitive Landscape
Who else is thinking about natural language → outcomes?
Tools (Not Outcomes)
AI SDRs (11x, Artisan, AiSDR)
Sell tools. Charge per seat. You manage agents. 70% churn.
❌ Not outcome-based
Agent Platforms (LangChain, CrewAI)
Infrastructure for developers. Build your own workflows.
❌ Not outcomes, just primitives
Automation (Zapier, Make)
Workflow automation. You design the flows.
❌ Not AI-native, not outcome-based
Closest Parallels
Scale AI ($13.8B)
Services + HITL → platform. "We need labeled data" → delivered.
✓ Outcome-based, HITL model
Pilot ($1.2B)
"Do my bookkeeping" → done. Humans + AI.
✓ Outcome-based, HITL model
Intercom Fin ($0.99/resolution)
AI support priced per successful outcome.
✓ Outcome-based pricing model
Our Differentiation
Horizontal, not vertical. Scale AI = data labeling. Pilot = bookkeeping. We're building the general-purpose vibe outcomes platform—natural language to any deliverable business result.
Current Traction
Proving the thesis with real customers and real outcomes.
Outcomes We've Delivered
"Get me sales meetings"
SDR/BDR for construction, startups (50% of revenue)
"Generate training videos"
ML training data pipelines (30% of revenue)
"Research these topics"
University lab literature synthesis (20% of revenue)
Why Zero Churn
"When you only pay for outcomes, there's no reason to churn. We deliver meetings, they pay. We don't deliver, they don't pay. Aligned incentives = sticky customers."
vs. AI Tool Churn
AI SDRs charge $5-10K/mo whether or not they work. When they don't deliver, customers leave. Misaligned incentives = 70% churn.
Why Now: 2026 Is the Year
Technology, market, and cultural convergence make this the moment.
Technology Ready
Models finally capable enough
GPT-5, Claude 4 can execute real business workflows
Agent infrastructure exists
OpenClaw, MCP, tool-use protocols
x402 machine payments
Agents can transact autonomously (a16z Big Ideas 2026)
HITL research converging
Microsoft, Anthropic, DeepMind all pointing same direction
Market Ready
AI tool fatigue
70% AI SDR churn. 95% pilot failure. Customers want what works.
Budget exists
Companies spending billions on AI, getting nothing
Pricing shift happening
30%+ enterprise SaaS moving to outcome-based
"Vibe coding" cultural moment
Natural language → results is now understood
"2025 was widely labeled 'the year of AI agents.' In reality, it was the year we learned what agents can and cannot do. 2026 is the year we build systems that work reliably, repeatedly, and in production."
— Human-in-the-Loop Newsletter, Dec 2025
Team
Keith Schacht
Co-Founder + Advisor
$140M exit (Mystery) · a16z funded · Built consumer products used by millions
[CTO]
Co-Founder & CTO
CTO of Because · $3M Seed · Deep agent infrastructure experience
Yasir
Co-Founder
yapthis.com · Agentic architecture · Shipped production agent systems
What We've Built
Dog-fooding daily
Running OpenClaw infrastructure ourselves
Agent Seatbelt
Browser-layer guardrails
ClawView
Agent observability
Workflow templates
Playbooks that compound
Why Us
We've shipped outcomes
$4K MRR from real deliverables
We understand HITL
Built the infrastructure, not just the agents
We know the failure modes
Encoded in playbooks from real experience
The Ask
Vibe Code Your Business
Describe the outcome you want. AI agents + human QA deliver it. Pay only for results. The interaction layer for the AI economy.
What We Need
$[X] Pre-Seed
Scale agent capacity + build the interface
12-month goal: $1M ARR
Prove vibe outcomes across multiple verticals
Then: Self-serve platform
Anyone can describe outcomes and get them
The Opportunity
New category creation
"Vibe outcomes" platform doesn't exist yet
Cultural moment
Vibe coding is mainstream—extend it to business
$52.6B market by 2030
AI agents + outcome-based pricing converging
📚 Research Foundation
Karpathy: Coined "vibe coding" Feb 2025 · MIT NANDA: 95% AI pilot failure · Microsoft Magentic-UI: 71% accuracy improvement with HITL · CooperBench: 30% lower success in multi-agent without coordination · a16z: Outcome-based pricing shift · Gartner: 30%+ enterprise SaaS with outcome pricing by 2025 · Bessemer: AI Pricing Playbook (Feb 2026)
🔗 Related Pitches
⚡ Fat Startup • 💰 Outcome-Based • 🚗 Uber for AI Work • 🎛️ Control Plane
Vibe Coding Outcomes is the UX/interaction layer that powers all of these.
25 NYC Startups: R&D Opportunities
Series A-B companies ($13M-$160M raised) with specific research they could implement but haven't.
NYC Tech Startups
Combined Funding
Research Opportunities
🏦 Fintech / Finance AI
Rogo — $75M (Series B)
Building "Wall Street's first AI analyst" — LLMs for financial reasoning
R&D Opportunities:
- Chain-of-Table reasoning — 40% more accurate on tabular financial data
- FinGPT fine-tuning — Open-source financial LLM for domain reasoning
- Toolformer for financial APIs — Teach LLMs to call Bloomberg/Reuters autonomously
Hook: "Your financial reasoning models could be 40% more accurate on tabular data with Chain-of-Table"
Farsight — $16M (Series A)
AI for finance — valuation models, deal analysis, Excel/PPT generation
R&D Opportunities:
- SpreadsheetLLM — Microsoft's approach to better spreadsheet understanding
- DocPrompting — Generate accurate documents with citations
- Table-GPT — Unified table understanding and generation
Hook: "SpreadsheetLLM could cut your Excel generation errors by 30%"
Aiera — $25M (Series B)
GenAI for financial professionals — broker research, earnings calls, filings
R&D Opportunities:
- LongLoRA — Process 10x longer earnings calls without quality loss
- RAG-Fusion — Multiple query generation for better retrieval
- Time-LLM — Repurpose LLMs for time series forecasting
Hook: "LongLoRA could let you process 10x longer earnings calls without quality loss"
Carbon Arc — $56M (Series A)
Marketplace for curated AI-ready datasets (Insights Exchange)
R&D Opportunities:
- Data-Juicer — Open-source data quality toolkit for AI datasets
- DataComp — Benchmark for dataset curation quality
- Synthetic data detection — Verify dataset authenticity/quality
Hook: "DataComp benchmarking could become your quality certification"
🏥 HealthTech / BioTech
Ataraxis — $20M (Series A)
AI for cancer precision medicine — analyzes data to identify optimal treatments
R&D Opportunities:
- CancerGPT — Few-shot learning for drug pair synergy prediction
- DrugCLIP — Contrastive learning for drug-target interaction
- Med-PaLM 2 — Google's medical LLM achieving expert-level performance
Hook: "CancerGPT's few-shot approach could expand your drug combination predictions 5x faster"
Inspiren — $35M (Series A)
AI + IoT for senior care — AUGi device for fall detection and patient monitoring
R&D Opportunities:
- RT-DETR — Real-time detection faster than YOLO
- Action recognition transformers — Video transformers for activity recognition
- Privacy-preserving pose estimation — On-device processing without cloud
Hook: "RT-DETR could cut your fall detection latency by 40% while running entirely on-device"
Slingshot AI — $40M (Series A)
AI for mental health — "Ash" chatbot simulates therapist-like conversations
R&D Opportunities:
- Constitutional AI for safety — Anthropic's approach to helpful + harmless
- EmoBERTa — Emotion-aware language model fine-tuning
- CBT dialogue systems — Structured therapeutic conversation flows
Hook: "Constitutional AI could reduce harmful responses by 80% while maintaining therapeutic value"
Camber — $30M (Series B)
Healthcare payment automation — streamlines insurance reimbursement
R&D Opportunities:
- Medical coding LLMs — Auto-coding diagnosis/procedure codes
- Claims denial prediction — ML to predict and prevent rejections
- Donut/Pix2Struct — Document understanding for medical forms
Hook: "Medical coding LLMs could auto-fill 60% of your claims forms"
🛠️ Dev Tools / Infrastructure
Warp — $18M (Series A)
AI-powered payroll platform for multi-state compliance
R&D Opportunities:
- Regulatory RAG — Retrieval over tax code databases
- LayoutLMv3 — Extract state tax forms with 95% accuracy
- Temporal reasoning — LLMs for date/deadline calculations
NetBox Labs — $35M (Series B)
Open-source network automation platform
R&D Opportunities:
- LLM for network config — Auto-generate Cisco/Juniper configs from NL
- Anomaly detection — Transformer-based time series for network telemetry
- Vision → IaC — Convert network diagrams to code
Topline Pro — $27M (Series B)
AI marketing for home service businesses
R&D Opportunities:
- Local SEO automation — LLM-generated location-specific content
- Multi-modal review response — Personalized responses with images
- Conversational scheduling — LLM-powered booking agents
💼 Sales / Marketing AI
Clay — $40M (Series B, $1.25B valuation)
AI for sales personalization — integrates 100+ data sources
R&D Opportunities:
- Persona-based email generation — LLMs that adapt tone per recipient
- Entity resolution at scale — Deduplication across data sources
- Buyer intent prediction — Multi-signal ML for ready-to-buy leads
Hook: "Buyer intent prediction could 3x your users' reply rates"
Profound — $35M (Series B) ⭐ Existing Client
AI search optimization — helps brands appear in AI-generated responses
R&D Opportunities:
- Retrieval optimization — Improve citation likelihood in RAG systems
- AI visibility benchmarking — Measure brand presence across LLMs
- Source authority scoring — How LLMs weight different sources
ShopMy — $77.5M (Series B)
Influencer commerce platform
R&D Opportunities:
- CLIP-based product matching — Visual similar product discovery
- Influencer-audience fit — ML for matching creators with brands
- Shoppable video AI — Auto-detect and tag products in video
⚖️ Compliance / Legal AI
Norm AI — $48M (Series B)
AI for regulatory compliance — automates review of legal documents
R&D Opportunities:
- Legal-BERT fine-tuning — Domain-specific transformer for legal text
- Contract element extraction — NER for legal clauses
- Regulatory change detection — Track and summarize regulation updates
Hebbia — $130M (Series B, $700M valuation)
Document AI — searches large document sets with citations
R&D Opportunities:
- ColBERT v2 — Late interaction retrieval for better search
- Self-RAG — LLM that self-reflects on retrieval quality (+25% accuracy)
- Structured reasoning chains — Better citation generation
Hook: "Self-RAG could improve your citation accuracy by 25%"
🔒 Cybersecurity / 🌱 Climate / 🛒 Consumer
Zip Security — $13.5M
SMB cybersecurity
- LLM threat intelligence
- Automated SOC analyst
- LLM phishing detection (+40% accuracy)
Chestnut Carbon — $160M
Reforestation + carbon credits
- Satellite carbon estimation
- Biodiversity monitoring (audio/visual)
- ML credit verification
GDI — $20M+
Silicon anodes for EV batteries
- Battery degradation prediction
- Materials discovery with ML
- CV defect detection (-40% QC cost)
Novig — $18M
P2P sports betting
- LLM odds modeling
- Market making algorithms
- Fraud detection
David — $75M
High-protein nutrition bars
- AI food formulation
- Demand forecasting
- Consumer preference modeling
Cents — $40M
Laundry/dry-cleaning SaaS
- Demand forecasting
- Route optimization
- Image garment classification
🎯 Best Targets by Category
🔥 Highest Urgency (AI-Native)
- Rogo — Financial reasoning is hard, need every edge
- Hebbia — Document AI is competitive, Self-RAG matters
- Aiera — Long context + time series = big opportunities
- Slingshot AI — Safety is existential for mental health AI
💰 Big Companies With Resources
- Clay ($1.25B val) — Can afford to experiment
- Hebbia ($700M val) — Research-forward culture
- Chestnut Carbon ($160M) — ML for verification is huge
🎯 Underserved Markets
- Inspiren — Elder care + CV is niche
- Cents — Laundry tech has zero AI competition
- Topline Pro — Home services AI is wide open
⭐ Existing Relationship
- Profound — Already a client, easy expansion
Outreach Template
Subject: Quick R&D idea for [Company] — [specific technique]
Hi [Name],
Congrats on [recent news/funding]. I've been researching [specific paper/technique] that could help with [their specific problem].
Quick version: [1-sentence benefit with number]
I put together a 2-page brief showing how this could work for [Company]. Want me to send it over?
R&D ≠ The Pain Point
The real market pain is downstream from R&D — it's about shipping AI to production.
AI projects fail to reach production (RAND)
GenAI pilots failing (MIT/Fortune 2025)
The gap isn't finding the right model. It's shipping AI to production.
The Skills Gap (Reddit Gold)
From r/MLQuestions — 688 upvotes, Nov 2025
What Candidates Know
- Transformer architectures, attention mechanisms
- Papers they've implemented (diffusion, GANs, LLMs)
- Kaggle competitions, theoretical deep learning
What Companies Need
- Deploy a model behind an API that doesn't fall over
- Write data pipelines that process reliably
- Debug why the model is slow/expensive in production
- Build evals to know if the model is working
"I'll interview someone who can explain LoRA fine-tuning in detail but has never deployed anything beyond a Jupyter notebook."
— Startup co-founder hiring ML engineers
The Observability Gap (Your Opportunity)
From Cleanlab's survey of 95 teams with AI in production
Teams satisfied with observability
Plan to improve observability next year
Rebuild AI stack every 3 months
Key Insight
Even among the 5% of companies that reach production, most remain early in maturity. They can't reliably know when their agents are right, wrong, or uncertain.
Reframing The Pitch
| ❌ OLD: "AI R&D Engineer" | ✅ NEW: "Production AI Engineer" | |
|---|---|---|
| Vibes | Research, experimentation | Deployment, reliability |
| Perception | Nice-to-have | Need-to-have |
| Target | Teams with resources | Teams with stuck projects |
| Job-to-be-done | "Find the best model" | "Ship to production this month" |
The Positioning Gap
Aemon = the optimization engine
You = the shipping engine
Target Customers (Not Research Teams)
🚀 Series A-C Startups with AI Features
- Have small ML teams, can't hire fast enough
- ML engineers cost $200-400k and are hard to find
- Need someone who can actually deploy, not just research
Pain: "We have 3 AI features in Jira blocked for months"
🏢 Product Companies Adding AI
- Non-ML companies adding AI features
- Don't have ML expertise internally
Pain: "We want AI in our product but don't know where to start"
⚙️ Enterprise AI Platform Teams
- Drowning in stack churn (rebuilding every 3 months)
- Coordination overhead killing velocity
Pain: "Platform team of 5 supporting 20 feature teams — we're bottlenecked"
🏛️ Regulated Industries
- 42% plan to add oversight features (vs 16% unregulated)
- Need governance + observability
Pain: "Can't deploy AI without compliance sign-off"
Better Pitch Angles
1. "Your AI Projects Are Stuck. We Ship Them."
- Target: Companies with AI projects "in progress" for months
- Proof: Show deployment timelines (weeks vs months)
- Wedge: Audit → identify stuck projects → ship one fast
2. "AI Observability + Ops as a Service"
- Target: Companies with AI in production but no visibility
- Pain: "We don't know when our AI is wrong"
- Proof: Catch regressions, reduce incidents
3. "The AI Platform Team You Can't Hire"
- Target: Scaling startups without MLOps expertise
- Pain: ML engineers cost $400k and don't want to do ops
- Proof: Infrastructure setup in days, not months
4. "CI/CD for AI" (existing pitch)
- Still good, but position as production not research
- Focus on deployment gates, not model selection
- "Every AI PR tested against your evals before merge"
Action Items
- Rewrite pitches with "production" and "ship" language
- Target stuck projects — companies with AI features in backlog
- Lead with observability — 63% want better visibility
- Offer quick wins — "Ship one AI feature in 2 weeks"
- Avoid research teams — they don't have budget urgency