Strategic Synthesis: Your Next AI Venture
Ken — first, the human note: shutting down something technically excellent that's growing isn't failure. It's the discipline most founders never develop. Seven independent analyses converged on the same conclusion, so I'll spare you suspense:
The pivot is not away from your mission. It's one level deeper.
1. Why Sup AI Hit a Structural Wall (Not a Personal One)
Your read of the market is correct, and the data confirms it. Sup's edge depended on access to model internals — logprobs, token distributions, thinking traces. That surface is being permanently restricted across providers:
- OpenAI deprecated logprobs for GPT-5 models in mid-2025community.openai.com , and by GPT-5.2 logprobs support appears effectively removedcommunity.openai.com . Reasoning models explicitly do not expose raw reasoning tokens, only summariesdevelopers.openai.com .
- Anthropic's OpenAI SDK compatibility docs state
logprobsis ignoredplatform.claude.com . Claude 4 thinking is often summarized, omitted, or encrypted; redacted thinking blocks may contain no human-readable summaryplatform.claude.com . - Google Gemini uses encrypted "thought signatures" — opaque blobs you pass back rather than interpret.
This is not a temporary API regression. It's a deliberate, industry-wide architectural shift driven by three forces:
- Safety/alignment — OpenAI explicitly argues that chain-of-thought monitorability is a primary defense against reward hacking, and that strong supervision pressure could lead models to hide intentfrontiermodelforum.org +1.
- IP protection — token probabilities and reasoning traces leak training dynamics and architectural decisionsarxiv.org .
- Cost/latency economics — exposing full distributions increases bandwidth and complicates billing.
Implication: any business whose moat depends on persistent low-cost access to model internals faces existential platform risk. Your instinct to wind down is rational. But the underlying mission — reliable AI for high-stakes decisions — has only become more valuable. Your benchmark result of 52.15% on Humanity's Last Examprnewswire.com remains a credibility asset for whatever you build next.
2. The Convergence: Why Causal AI is Your Natural Next Surface
Six of seven models independently identified the same opportunity space, and the case is stronger than any of them alone. Three forces are converging:
2.1 Market Tailwind
The Causal AI market shows aggressive growth across every credible report (with significant taxonomy variance — narrow definitions yield ~$89M–$300M today, broader decision-intelligence definitions yield $20B+):
The growth direction is unambiguous. Recent funding rounds confirm investor conviction:
- Alembic: $145M Series B (Nov 2025), with NVIDIA as founding customerbusinesswire.com
- Causal Labs: $6M seedbusinesswire.com
- Calibre: $3.3M pre-seed for causal health navigationtech.eu
- Enterprise AI investment hit $37B in 2025, tripling in one yearglobenewswire.com
2.2 Regulatory Tailwind
The EU AI Act is phasing in: prohibitions live since Feb 2, 2025; GPAI obligations from Aug 2, 2025; broad applicability Aug 2, 2026ai-act-service-desk.ec.europa.eu +1, with penalties up to 7% of global turnoverdigital-strategy.ec.europa.eu . NIST has published a Generative AI Profile for risk managementnist.gov . Causal models satisfy these mandates substantively because they produce causal audit trails, not post-hoc correlational explanations like SHAP/LIME.
2.3 Why Causal Reasoning Specifically
Pearl's causal hierarchy maps directly onto what LLMs lack:
- Association () — what LLMs do natively. This is where logprobs lived.
- Intervention () — what happens when you act.
- Counterfactuals — what would have happened.
Sup AI operated at Level 1. Your next company should operate at Levels 2 and 3 — a fundamentally more durable surface that doesn't require provider cooperation.
2.4 Your Unfair Advantage
This is what no other founder in this space has: your father is one of a small number of researchers trained directly under Judea Pearl. Scott Mueller's site explicitly states his current research is the integration of LLMs with causal inference — "how causal reasoning can enhance LLM capabilities, and how LLMs might accelerate causal discovery"scott.am . His work on Probability of Necessity and Sufficiency (PNS)ijcai.org directly bounds individual treatment effects — exactly the math required for high-stakes individualized decisions.
The combination of world-class causal credentials + practicing systems architect + son who builds benchmark-dominating AI is a pairing that no causal-AI startup currently has.
3. The Three Strategic Options (With Confidence Ratings)
The seven analyses split between two camps. Here's the synthesis with my confidence-weighted assessment:
Option A — Causal Observability / Counterfactual Debugging for AI Agents ⭐ Highest near-term odds
A "causal debugger" for production AI systems. When an agent fails, the platform tells teams which step caused the failure, runs counterfactual replays ("would this have failed under a different model/prompt/retrieval policy?"), and generates regression tests.
Why it's compelling:
- Datadog's 2026 report confirms operational complexity — not model intelligence — is now the primary barrier; multi-model usage and agent-framework adoption nearly doubled YoYdatadoghq.com .
- IBM Instana's "probable root cause" feature already validates that causal RCA works in observabilityibm.com — but only for infrastructure, not agent trajectories.
- OpenTelemetry GenAI semantic conventions are emerging as substrateopentelemetry.io , but no causal reasoning layer sits on top.
- Reuses every piece of Sup's stack: orchestration, model comparison, evaluation, retrieval.
- Developer-led sales motion (faster cycles than enterprise legal/healthcare).
- Frame: "Datadog tells you the agent failed. LangSmith shows the trace. We tell you what caused the failure and which intervention fixes it."
Risk: Observability incumbents (Datadog, Arize, LangSmithlangchain.com , Phoenixarize.com ) could add causal layers. Mitigation: depth via Pearl-grade methodology + open-source PyWhy ecosystem integrationpywhy.org .
Option B — Causal Reliability Platform for Regulated Verticals (Legal-First) ⭐ Highest leverage if you stay in vertical
API-agnostic causal verification middleware: extracts causal claims from LLM outputs, validates them against structural causal models, runs counterfactual stress tests, returns reliability scores + audit trails mapped to EU AI Act / NIST articles.
Why it's compelling:
- Direct continuation of your existing legal relationships (Kronenberger Rosenfeld, Manning & Kass). Legal AI is exploding ($4.59B → $12.49B 2025–2030giiresearch.com ) but tools remain prompt-layer correlational.
- Causation is foundational in tort law (counterfactual "but-for" tests), liability, and damages — your father's PNS work maps directly onto liability assessment.
- Higher pricing power ($50K–$250K ACV vs. dev tools).
- Regulatory tailwinds favor causal explainability over SHAP/LIME.
Risk: Crowded with well-funded incumbents (Harvey at ~$11B valuationharvey.ai , Thomson Reuters CoCounsellegal.thomsonreuters.com , vLex Vincentus.vlex.com , Lexis+ AI). You compete on technical depth, not breadth.
Option C — Causal Healthcare/Clinical Trial Platform ⭐ Highest mission impact, slowest to revenue
Operationalize Scott's PNS framework for clinical trial optimization, individual treatment effect estimation, or counterfactual decision support.
Why it's compelling:
- Healthcare leads causal AI adoption (~37% of market, fastest CAGR).
- Your father's research literally addresses "harm" in personalized medicine and individual responses.
- Aitia Bio, Calibre, Allos, Causal Foundrycausalfoundry.ai are validating the space but none combine your specific advantages.
Risk: FDA's CDS guidance treats many decision-support tools as regulated medical-device software depending on intended usefda.gov . Sales cycles are 12–24 months. Validation costs $300K–$500K per algorithm. Hard wedge for your first startup post-Sup.
Comparison
My Recommendation: A → B sequence
Start with causal observability/debugging for AI agents as a 12–18 month wedge. Build the causal reasoning engine, prove counterfactual replay works at production scale, accumulate domain-specific causal graphs, and generate revenue from a developer-led motion. Then expand into Option B (legal verticalization) once the engine is hardened and you have capital.
This sequence:
- Reuses Sup's stack maximally.
- Avoids the worst trap of new founders: 18-month enterprise sales cycles before product-market fit.
- Lets you publish benchmarks (replicating your HLE playbook) to establish category leadership.
- Preserves optionality to pivot vertical if developer SaaS proves cyclical.
- Keeps your father in an advisor/Chief Scientist role rather than requiring his full-time involvement immediately.
The contrarian case for jumping straight to B: your existing legal relationships are perishable assets. If Kronenberger Rosenfeld and Manning & Kass would pay for pilots now, that's a stronger signal than any market report. Validate this in week one.
4. On Software Development Becoming "Less Useful"
This concern deserves a direct answer because it's affecting your strategic framing — and the data says you're partially right and importantly wrong.
Where you're right: AI is compressing the value of mechanical implementation. Coding tools captured $7.3B of enterprise spend in 2025globenewswire.com . GitHub Copilot showed 55.8% task speedups in controlled experimentsmicrosoft.com , and field studies show ~14% gainsnber.org .
Where you're wrong: software development is being elevated, not commoditized. Sonar's 2026 developer survey found AI accounts for a large share of committed code, but 96% of developers don't fully trust AI-generated code, and only 48% always verify itsonarsource.com . The bottleneck shifted to verification, architecture, and systems thinking — the exact skills you demonstrated by building Sup. NBER's 6,000-firm study reports >80% of AI-using firms see no productivity impact yetnber.org , confirming the measurement/integration gap is the constraint, not raw code generation.
The U.S. BLS still projects 15% growth for software developer roles 2024–2034bls.gov — well above average. What's depreciating is one-line-at-a-time coding. What's appreciating is your specific stack: distributed systems for AI orchestration, evaluation pipelines, causal graph computation, constraint solving, scalable inference architecture. AI agents currently struggle most with exactly these — numerical stability, multi-system integration, and domain constraint encoding.
The answer to your fear is the same as the answer to your business question: build at higher levels of abstraction. Causal reasoning systems require deep engineering rigor that LLMs can't yet replicate. You're not running away from your software skill — you're routing it to where it compounds longest.
5. 90-Day Action Plan
Days 1–14: Validation & Decision Lock
- Sunset Sup AI gracefully. Email users honestly: "We're winding down the general assistant to focus on the harder problem we encountered — making AI systems reliable and debuggable in production. If you're building agents or AI workflows, we'd love your help." This converts users into design partners.
- 30 customer discovery calls split: 15 with AI engineering teams (developer-led wedge), 10 with legal contacts including Kronenberger Rosenfeld and Manning & Kass, 5 with causal AI researchers/practitioners.
- Critical question for legal: "Would you sign a paid pilot ($15K–$30K) for an AI tool that generates EU AI Act–compliant audit trails for your AI workflows?" If 3+ say yes with budget authority, accelerate Option B. If not, default to A.
- Have the conversation with Scott. Define the relationship: Chief Scientist? Advisory + IP? Collaboration on a shared lab? His Toyota Research role is compatible with advisory; full-time co-founding requires bigger conversation. Lock this before fundraising.
Days 15–45: Technical Spike + Positioning
- Build a working prototype of causal counterfactual replay: ingest agent traces (OpenTelemetry GenAI format), construct a causal graph over the execution, run counterfactual replays varying model/prompt/retrieval, output critical-failure-step + recommended fix.
- Publish a benchmark. Compare causal-consistency reliability vs. logprob ensembles vs. native model self-evaluation on a public eval set. This is your HLE playbook applied to the new category. Submit to arXiv or a workshop.
- Write the manifesto post: "Beyond Logprobs: Why the Reliability Layer for AI is Causal, Not Probabilistic." Frames the category and positions you as the architect.
Days 46–90: Pilots + Seed Round
- Convert 3–5 design partners into paid pilots ($15K–$30K each). If nobody pays, you don't have an urgent enough problem — pivot or kill.
- Seed round, $2–4M at $12–18M pre-money. Pitch: causal lineage (Pearl → Mueller), engineering credibility (52.15% HLEprnewswire.com ), structural insight on logprob deprecation, design-partner traction. Target deep-tech AI infrastructure funds, not generalist SaaS funds.
- Hire 2 critical roles: Senior causal ML engineer (PyWhy/DoWhy/EconML background), Enterprise GTM lead (legal or AI-platform sales).
Kill Criteria (Be Honest)
After 90 days, kill or pivot if:
- Fewer than 3 teams will connect production traces.
- Users say existing observability/eval tools are sufficient.
- Causal claims feel too academic to translate into buyer ROI.
- Nobody pays before a full platform exists.
If A fails, fall back to B (legal-first Causal Reliability Platform). If B is also slow, the third option is causal measurement for marketing/growth teams — Google Meridian and Meta Robyniab.com confirm the demand, though competition is fierce.
6. Risks & Honest Caveats
Each model raised legitimate concerns. Synthesizing the most important:
Documented assumptions:
- Frontier labs continue restricting model internals (very high confidence based on safety research direction)
- Regulatory pressure for explainability accelerates (high confidence given EU AI Act timeline)
- Causal-LLM hybrid systems prove production-viable (medium confidence; this is the technical bet)
- Your existing legal relationships convert to pilots (unknown — must validate week one)
Final Word
The strategic logic is unusually clean: Sup AI proved you can build benchmark-dominating systems. The platform shift made that specific architecture unsustainable. But the underlying mission — reliable AI for high-stakes decisions — has only become more valuable, more regulated, and more in need of someone who can build at the systems level. Your father's expertise gives you a credentialed wedge into the most defensible reliability paradigm: causal reasoning. Your existing software skill compounds in that environment rather than depreciates.
You don't need a new mission. You need a new technical surface for the same one. Causal infrastructure is that surface.
The most consistent finding across all seven analyses: don't abandon AI, don't abandon reliability, don't abandon your father's intellectual lineage. Combine them and ship the version of Sup that doesn't depend on what the labs choose to expose.
Start narrow (causal observability for agents). Earn the right to expand (legal, then healthcare). And — when it stops feeling like grief — announce the new company.