A letter from Ken.

Sup AI as a company is winding down. Sup AI as a mission is not. I want to walk you through what changed, what I got right, and what I'm going to spend the next several years actually doing.

Sup AI was built around a thesis: that orchestrating an ensemble of frontier models, confidence-scoring their tokens with logprobs, and synthesizing the result, would be more accurate than any single model in the ensemble.

That thesis held up, and it was absolutely incredible to watch. On almost no budget, I got consistently better answers and behavior out of the ensemble than out of any single model inside it. Sometimes I could drop the frontier models entirely, run the whole thing on cheap open-weight models, and still beat any single frontier model running on its own. The dominance in vibes and benchmarks came from three strategies: selecting models whose errors didn't correlate, querying each one optimally, and fusing their outputs segment by segment based on where each was most confident. That part still feels like magic to me. It is, I think, an underrated result, and I hope someone keeps pushing on it.

I want to be clear about something before going further. Sup AI was growing. Usage was up, paid conversion was up, the product was getting better and more accurate every month, and the people using it were the kind of users you build a company to serve, researchers, lawyers, doctors, analysts, people who really care whether the answer is right and complete. This letter is not a postmortem, it is a redirection. I am closing this chapter because I think the next problem is more important than continuing to compound on this one.

What got harder was implementing the algorithms on top of these APIs, both the latest ideas out of ML research and my own. Many vendors started limiting the two signals the ensemble leaned on most: thinking traces and logprobs. For example, Anthropic and OpenAI both moved to responding with a summary of a model's reasoning instead of the reasoning itself, while still charging for the full hidden traces underneath, tokens our customers paid for and never got to see or utilize. I think the intent is that raw reasoning chains make distillation attacks easier, where a competitor trains on your model's reasoning. The side effect is that anyone trying to reason across multiple models loses a huge signal. Logprobs went the same way. OpenAI, for instance, stopped exposing them on its reasoning models, which are exactly the models an accuracy system most wants to interrogate. I found workarounds for most of this, and I'm proud of how clever some of them were, but the underlying truth is unavoidable. The frontier labs are making it structurally harder to build accuracy systems on top of their APIs from the outside, even when you are paying retail.

Then there's the economics. The frontier labs run high margins on inference. SemiAnalysis puts Anthropic's above 70 percent, up from 38 percent a year earlier, and OpenAI's is reportedly in the same range. Several of them use that API revenue to subsidize their own consumer subscriptions like ChatGPT Plus, Claude Pro, and Gemini Advanced. Anyone paying retail on the input side cannot match the per-message price of a product whose model maker is also its sponsor. No amount of engineering on my side changes who controls the input price.

But the deeper problem is on the demand side. Consumer willingness to pay for accuracy is bimodal. A small minority of users would happily pay more for an answer they can trust. Doctors checking dosing, lawyers writing briefs with real citations, researchers checking facts, the people Sup AI was actually built for. A vast majority will not, in the sense that "twenty dollars and good enough" beats "two hundred dollars and right" almost every time, even when the stakes look like they should matter. There is a small total addressable market for premium-priced, accuracy-first consumer AI. Not zero, but too thin and too difficult to build a repeatable sales channel to build a company on at consumer price points. That is the reason accuracy-focused consumer startups eventually ends up pivoting to B2B.

B2B is a legitimate path. I had design partners, signed NDAs with companies that wanted to deploy Sup AI internally, and there was real opportunity. I could go build that company. But it's a different business where I would take different agentic approaches to AI. Every time I asked myself the difficult question, "is this actually what I want to spend the next X years on," the answer was always no. Selling enterprise accuracy software is a good business, but it is not the reason I started this, so I stopped pretending it was.

What I'm doing next

I'm a sophomore at Stanford. I'm starting a research lab focused on a single problem: causal reasoning in language models.

Causal reasoning is the ability to answer questions about cause, effect, intervention, and counterfactuals. Not just "what tends to happen when X is true," but "what would happen if I changed X," and "what would have happened if X had been different." Correlations are everywhere in the training data. LLMs are excellent at incorporating these into their weights. We're told that correlation is not causation. Yet, the behavior and decisions of LLMs are based on correlations when they should be based on causation. We accept so much of the inevitable poor behavior and decisions from these LLMs because they sound so convincing.

Today's models look like they reason causally because they reason fluently. When a frontier model walks you through why a stock dropped, why a patient got sicker, why a policy backfired, it is mostly retrieving plausible-sounding chains of explanation from text where humans had already explained similar things. That is an extraordinary capability and it covers an enormous fraction of useful work. It also breaks in characteristic ways. Ask a frontier model a counterfactual ("would the patient have died if we had given drug A instead of drug B that we gave"), or to reason about a rare intervention with no analog in its training data, and the failure is consistently confident, fluent, and wrong. That confident-and-wrong mode is the thing I care about most. It is also, coincidentally, the thing logprob confidence scoring at Sup AI was designed to expose.

The frontier labs treat causal reasoning as an emergent property. The implicit bet is that if you keep scaling parameters, data, and compute, true causal models fall out of training for free. I do not believe that. I think causality is structural. It requires representing variables, interventions, and counterfactuals as first-class objects inside the model, not as patterns over tokens that happen to look causal in training. I think this is the gap between current systems and AGI, and I think it is the reason every frontier model still hallucinates on the same kinds of questions, no matter how big it gets.

The first two steps of the plan:

Build benchmarks that show, rigorously, where current frontier models fail at causal reasoning. Counterfactuals, interventions, confounding, mediation. Make the failures undeniable and reproducible.
Take an existing open-weight model and give it real causal reasoning end to end, by training a representation of causal structure directly into it rather than hoping it emerges from next-token prediction.

Those are the first steps, not the whole roadmap. After that comes architecture work to make causal reasoning scale, alignment work to make it controllable, and applied work in the domains where causal reasoning is the bottleneck: medicine, law, science, public policy.

If I'm wrong about all of this, I want to be the person who proved it wrong rigorously. The thing I am unwilling to do is keep optimizing a local maximum because it pays. The causal frontier is the only place where the questions worth my next ten years live.

What this means for you

If you had an active subscription, it has already been paused. You will not be billed again.

Email support@sup.ai for either a refund of your remaining balance or a copy of your data. Balances rolled over month to month, so refunds cover everything you've accumulated.

If you just want to talk, about the new direction or the old one or anything else, write me directly at ken@sup.ai. I read everything that comes in, and I answer.

Thank you for using Sup AI. The product got better because of you, especially the people who told me when it broke. None of it was wasted. The next thing is built directly on what you taught me.

Ken Mueller

Founder, Sup AI