I'm about to shut down Sup AI. This is a very painful decision because I put a ton of effort into this over months, people tell me it's extraordinary and valuable, I'm very proud of how accurate, reliable, and comprehensive the AI responses are (with dominant performance on benchmarks). I'm also very proud of the whole system, it's growing with hundreds of new users every week, and I don't want to give up. However, it isn't clear that I can make this an impactful and profitable business despite the growth and there are a number of forces against Sup's approach. For example, frontier labs started encrypting their thinking traces and limiting or eliminating logprob outputs. This required clever workarounds that resulted in lower quality, more latency, and higher costs.

Given the skills I gained in creating Sup and my strong interest in causality due to my dad (he's an expert in causal and counterfacutal reasoning, https://scott.am) and Judea Pearl, which business should I create next? I love software development too. I consider myself extremely good at it, but I don't know how much longer this skill will be useful as AI models get better over the next few years. Of course, there's no question AI is the future and I want to be a part of it. I need to figure out a way to contribute and make an impact in AI.

Please investigate this thoroughly.

May 2, 11:17:48 PM·Expert·5252 models

Strategic Synthesis: Your Next AI Venture

Ken — first, the human note: shutting down something technically excellent that's growing isn't failure. It's the discipline most founders never develop. Seven independent analyses converged on the same conclusion, so I'll spare you suspense:

The pivot is not away from your mission. It's one level deeper.

1. Why Sup AI Hit a Structural Wall (Not a Personal One)

Your read of the market is correct, and the data confirms it. Sup's edge depended on access to model internals — logprobs, token distributions, thinking traces. That surface is being permanently restricted across providers:

OpenAI deprecated logprobs for GPT-5 models in mid-2025community.openai.com , and by GPT-5.2 logprobs support appears effectively removedcommunity.openai.com . Reasoning models explicitly do not expose raw reasoning tokens, only summariesdevelopers.openai.com .
Anthropic's OpenAI SDK compatibility docs state logprobs is ignoredplatform.claude.com . Claude 4 thinking is often summarized, omitted, or encrypted; redacted thinking blocks may contain no human-readable summaryplatform.claude.com .
Google Gemini uses encrypted "thought signatures" — opaque blobs you pass back rather than interpret.

This is not a temporary API regression. It's a deliberate, industry-wide architectural shift driven by three forces:

Safety/alignment — OpenAI explicitly argues that chain-of-thought monitorability is a primary defense against reward hacking, and that strong supervision pressure could lead models to hide intentfrontiermodelforum.org +1.
IP protection — token probabilities and reasoning traces leak training dynamics and architectural decisionsarxiv.org .
Cost/latency economics — exposing full distributions increases bandwidth and complicates billing.

Implication: any business whose moat depends on persistent low-cost access to model internals faces existential platform risk. Your instinct to wind down is rational. But the underlying mission — reliable AI for high-stakes decisions — has only become more valuable. Your benchmark result of 52.15% on Humanity's Last Examprnewswire.com remains a credibility asset for whatever you build next.

2. The Convergence: Why Causal AI is Your Natural Next Surface

Six of seven models independently identified the same opportunity space, and the case is stronger than any of them alone. Three forces are converging:

2.1 Market Tailwind

The Causal AI market shows aggressive growth across every credible report (with significant taxonomy variance — narrow definitions yield ~$89M–$300M today, broader decision-intelligence definitions yield $20B+):

Source	2025/2026 Baseline	Trajectory	CAGR
Dimension Market Research	$89.4M (2026)	$1.815B by 2035	39.7%dimensionmarketresearch.com
Fortune Business Insights	$116.03B (2026, broad def.)	$744B by 2032	~38%fortunebusinessinsights.com
Yahoo Finance/industry consensus	—	$1.62B by 2035	38.35%finance.yahoo.com

The growth direction is unambiguous. Recent funding rounds confirm investor conviction:

Alembic: $145M Series B (Nov 2025), with NVIDIA as founding customerbusinesswire.com
Causal Labs: $6M seedbusinesswire.com
Calibre: $3.3M pre-seed for causal health navigationtech.eu
Enterprise AI investment hit $37B in 2025, tripling in one yearglobenewswire.com

2.2 Regulatory Tailwind

The EU AI Act is phasing in: prohibitions live since Feb 2, 2025; GPAI obligations from Aug 2, 2025; broad applicability Aug 2, 2026ai-act-service-desk.ec.europa.eu +1, with penalties up to 7% of global turnoverdigital-strategy.ec.europa.eu . NIST has published a Generative AI Profile for risk managementnist.gov . Causal models satisfy these mandates substantively because they produce causal audit trails, not post-hoc correlational explanations like SHAP/LIME.

2.3 Why Causal Reasoning Specifically

Pearl's causal hierarchy maps directly onto what LLMs lack:

Association ( $P(Y \mid X)$ ) — what LLMs do natively. This is where logprobs lived.
Intervention ( $P(Y \mid do(X))$ ) — what happens when you act.
Counterfactuals — what would have happened.

Sup AI operated at Level 1. Your next company should operate at Levels 2 and 3 — a fundamentally more durable surface that doesn't require provider cooperation.

2.4 Your Unfair Advantage

This is what no other founder in this space has: your father is one of a small number of researchers trained directly under Judea Pearl. Scott Mueller's site explicitly states his current research is the integration of LLMs with causal inference — "how causal reasoning can enhance LLM capabilities, and how LLMs might accelerate causal discovery"scott.am . His work on Probability of Necessity and Sufficiency (PNS)ijcai.org directly bounds individual treatment effects — exactly the math required for high-stakes individualized decisions.

The combination of world-class causal credentials + practicing systems architect + son who builds benchmark-dominating AI is a pairing that no causal-AI startup currently has.

3. The Three Strategic Options (With Confidence Ratings)

The seven analyses split between two camps. Here's the synthesis with my confidence-weighted assessment:

Option A — Causal Observability / Counterfactual Debugging for AI Agents ⭐ Highest near-term odds

A "causal debugger" for production AI systems. When an agent fails, the platform tells teams which step caused the failure, runs counterfactual replays ("would this have failed under a different model/prompt/retrieval policy?"), and generates regression tests.

Why it's compelling:

Datadog's 2026 report confirms operational complexity — not model intelligence — is now the primary barrier; multi-model usage and agent-framework adoption nearly doubled YoYdatadoghq.com .
IBM Instana's "probable root cause" feature already validates that causal RCA works in observabilityibm.com — but only for infrastructure, not agent trajectories.
OpenTelemetry GenAI semantic conventions are emerging as substrateopentelemetry.io , but no causal reasoning layer sits on top.
Reuses every piece of Sup's stack: orchestration, model comparison, evaluation, retrieval.
Developer-led sales motion (faster cycles than enterprise legal/healthcare).
Frame: "Datadog tells you the agent failed. LangSmith shows the trace. We tell you what caused the failure and which intervention fixes it."

Risk: Observability incumbents (Datadog, Arize, LangSmithlangchain.com , Phoenixarize.com ) could add causal layers. Mitigation: depth via Pearl-grade methodology + open-source PyWhy ecosystem integrationpywhy.org .

Option B — Causal Reliability Platform for Regulated Verticals (Legal-First) ⭐ Highest leverage if you stay in vertical

API-agnostic causal verification middleware: extracts causal claims from LLM outputs, validates them against structural causal models, runs counterfactual stress tests, returns reliability scores + audit trails mapped to EU AI Act / NIST articles.

Why it's compelling:

Direct continuation of your existing legal relationships (Kronenberger Rosenfeld, Manning & Kass). Legal AI is exploding ($4.59B → $12.49B 2025–2030giiresearch.com ) but tools remain prompt-layer correlational.
Causation is foundational in tort law (counterfactual "but-for" tests), liability, and damages — your father's PNS work maps directly onto liability assessment.
Higher pricing power ($50K–$250K ACV vs. dev tools).
Regulatory tailwinds favor causal explainability over SHAP/LIME.

Risk: Crowded with well-funded incumbents (Harvey at ~$11B valuationharvey.ai , Thomson Reuters CoCounsellegal.thomsonreuters.com , vLex Vincentus.vlex.com , Lexis+ AI). You compete on technical depth, not breadth.

Option C — Causal Healthcare/Clinical Trial Platform ⭐ Highest mission impact, slowest to revenue

Operationalize Scott's PNS framework for clinical trial optimization, individual treatment effect estimation, or counterfactual decision support.

Why it's compelling:

Healthcare leads causal AI adoption (~37% of market, fastest CAGR).
Your father's research literally addresses "harm" in personalized medicine and individual responses.
Aitia Bio, Calibre, Allos, Causal Foundrycausalfoundry.ai are validating the space but none combine your specific advantages.

Risk: FDA's CDS guidance treats many decision-support tools as regulated medical-device software depending on intended usefda.gov . Sales cycles are 12–24 months. Validation costs $300K–$500K per algorithm. Hard wedge for your first startup post-Sup.

Comparison

Dimension	A. Causal AgentOps	B. Causal Legal	C. Causal Healthcare
Time-to-revenue	3–6 months	6–12 months	12–24 months
Sup tech reusability	~80%	~60%	~40%
Regulatory drag	Low	Medium	High
Father's expertise leverage	Medium	High	Highest
Market crowding	Medium	High	Medium
Your existing relationships	Low	High	Low
Defensibility	High (data network effects)	High (compliance + legal IP)	Highest (FDA moat)

My Recommendation: A → B sequence

Start with causal observability/debugging for AI agents as a 12–18 month wedge. Build the causal reasoning engine, prove counterfactual replay works at production scale, accumulate domain-specific causal graphs, and generate revenue from a developer-led motion. Then expand into Option B (legal verticalization) once the engine is hardened and you have capital.

This sequence:

Reuses Sup's stack maximally.
Avoids the worst trap of new founders: 18-month enterprise sales cycles before product-market fit.
Lets you publish benchmarks (replicating your HLE playbook) to establish category leadership.
Preserves optionality to pivot vertical if developer SaaS proves cyclical.
Keeps your father in an advisor/Chief Scientist role rather than requiring his full-time involvement immediately.

The contrarian case for jumping straight to B: your existing legal relationships are perishable assets. If Kronenberger Rosenfeld and Manning & Kass would pay for pilots now, that's a stronger signal than any market report. Validate this in week one.

4. On Software Development Becoming "Less Useful"

This concern deserves a direct answer because it's affecting your strategic framing — and the data says you're partially right and importantly wrong.

Where you're right: AI is compressing the value of mechanical implementation. Coding tools captured $7.3B of enterprise spend in 2025globenewswire.com . GitHub Copilot showed 55.8% task speedups in controlled experimentsmicrosoft.com , and field studies show ~14% gainsnber.org .

Where you're wrong: software development is being elevated, not commoditized. Sonar's 2026 developer survey found AI accounts for a large share of committed code, but 96% of developers don't fully trust AI-generated code, and only 48% always verify itsonarsource.com . The bottleneck shifted to verification, architecture, and systems thinking — the exact skills you demonstrated by building Sup. NBER's 6,000-firm study reports >80% of AI-using firms see no productivity impact yetnber.org , confirming the measurement/integration gap is the constraint, not raw code generation.

The U.S. BLS still projects 15% growth for software developer roles 2024–2034bls.gov — well above average. What's depreciating is one-line-at-a-time coding. What's appreciating is your specific stack: distributed systems for AI orchestration, evaluation pipelines, causal graph computation, constraint solving, scalable inference architecture. AI agents currently struggle most with exactly these — numerical stability, multi-system integration, and domain constraint encoding.

The answer to your fear is the same as the answer to your business question: build at higher levels of abstraction. Causal reasoning systems require deep engineering rigor that LLMs can't yet replicate. You're not running away from your software skill — you're routing it to where it compounds longest.

5. 90-Day Action Plan

Days 1–14: Validation & Decision Lock

Sunset Sup AI gracefully. Email users honestly: "We're winding down the general assistant to focus on the harder problem we encountered — making AI systems reliable and debuggable in production. If you're building agents or AI workflows, we'd love your help." This converts users into design partners.
30 customer discovery calls split: 15 with AI engineering teams (developer-led wedge), 10 with legal contacts including Kronenberger Rosenfeld and Manning & Kass, 5 with causal AI researchers/practitioners.
Critical question for legal: "Would you sign a paid pilot ($15K–$30K) for an AI tool that generates EU AI Act–compliant audit trails for your AI workflows?" If 3+ say yes with budget authority, accelerate Option B. If not, default to A.
Have the conversation with Scott. Define the relationship: Chief Scientist? Advisory + IP? Collaboration on a shared lab? His Toyota Research role is compatible with advisory; full-time co-founding requires bigger conversation. Lock this before fundraising.

Days 15–45: Technical Spike + Positioning

Build a working prototype of causal counterfactual replay: ingest agent traces (OpenTelemetry GenAI format), construct a causal graph over the execution, run counterfactual replays varying model/prompt/retrieval, output critical-failure-step + recommended fix.
Publish a benchmark. Compare causal-consistency reliability vs. logprob ensembles vs. native model self-evaluation on a public eval set. This is your HLE playbook applied to the new category. Submit to arXiv or a workshop.
Write the manifesto post: "Beyond Logprobs: Why the Reliability Layer for AI is Causal, Not Probabilistic." Frames the category and positions you as the architect.

Days 46–90: Pilots + Seed Round

Convert 3–5 design partners into paid pilots ($15K–$30K each). If nobody pays, you don't have an urgent enough problem — pivot or kill.
Seed round, $2–4M at $12–18M pre-money. Pitch: causal lineage (Pearl → Mueller), engineering credibility (52.15% HLEprnewswire.com ), structural insight on logprob deprecation, design-partner traction. Target deep-tech AI infrastructure funds, not generalist SaaS funds.
Hire 2 critical roles: Senior causal ML engineer (PyWhy/DoWhy/EconML background), Enterprise GTM lead (legal or AI-platform sales).

Kill Criteria (Be Honest)

After 90 days, kill or pivot if:

Fewer than 3 teams will connect production traces.
Users say existing observability/eval tools are sufficient.
Causal claims feel too academic to translate into buyer ROI.
Nobody pays before a full platform exists.

If A fails, fall back to B (legal-first Causal Reliability Platform). If B is also slow, the third option is causal measurement for marketing/growth teams — Google Meridian and Meta Robyniab.com confirm the demand, though competition is fierce.

6. Risks & Honest Caveats

Each model raised legitimate concerns. Synthesizing the most important:

Risk	Likelihood	Mitigation
Frontier labs add native causal reasoning	Medium (3–5 yr)	Causal inference is architecturally different from next-token prediction; vertical depth + audit-trail compliance create durable moat
Causal identification is hard with unmeasured confounders	High	Be epistemically honest — surface assumptions, sensitivity analysis, human-in-loop. PyWhy framework discipline helpspywhy.org
Enterprise sales cycles (legal/healthcare)	High	Start developer-led; verticalize after engine matures
Scott can't commit full-time	Medium	Even formal advisory carries credibility; structure for what's realistic
"Causal" sounds academic to buyers	High	Lead with buyer outcome ("debug failed agent runs," "audit your AI for EU AI Act"), not Pearl's hierarchy
Market size variance across reports	Documented	Conservative estimates still show explosive growth; growth rate (~38% CAGR) is more reliable than absolute size

Documented assumptions:

Frontier labs continue restricting model internals (very high confidence based on safety research direction)
Regulatory pressure for explainability accelerates (high confidence given EU AI Act timeline)
Causal-LLM hybrid systems prove production-viable (medium confidence; this is the technical bet)
Your existing legal relationships convert to pilots (unknown — must validate week one)

Final Word

The strategic logic is unusually clean: Sup AI proved you can build benchmark-dominating systems. The platform shift made that specific architecture unsustainable. But the underlying mission — reliable AI for high-stakes decisions — has only become more valuable, more regulated, and more in need of someone who can build at the systems level. Your father's expertise gives you a credentialed wedge into the most defensible reliability paradigm: causal reasoning. Your existing software skill compounds in that environment rather than depreciates.

You don't need a new mission. You need a new technical surface for the same one. Causal infrastructure is that surface.

The most consistent finding across all seven analyses: don't abandon AI, don't abandon reliability, don't abandon your father's intellectual lineage. Combine them and ship the version of Sup that doesn't depend on what the labs choose to expose.

Start narrow (causal observability for agents). Earn the right to expand (legal, then healthcare). And — when it stops feeling like grief — announce the new company.

May 2, 11:32:40 PM·14m 52s

Shared by

Ken Mueller