Jeffrey Lee Cooper

My Wife and I Pay the Same for AI ...but Should We?

Wed, 10 Jun 2026 00:00:00 GMT

My wife opens ChatGPT maybe twice a day. Usually it’s a simple search while we are driving around, some question one of us swears we’re right about, settled in thirty seconds instead of a stalemate. Then she closes the app until the next mystery rears its head.

Meanwhile I’ve got agents running while I sleep. I burn tokens in a way that would make eco-activists cry. These tools have become a constant engine running under the hood of both my professional and personal life.

The fun part? We pay roughly the same.

I have less than $200 a month in subscriptions - she has roughly $40 a month. Yet our usage differs by orders of magnitude. Have I invented free money somehow?!

Someone else’s balance sheet

The people who’ve done the math (there’s a Hacker News thread where practitioners work it out) land on the thesis that marginal token use is most likely carrying a positive margin once the model’s trained.

But how did we get to the point where the marginal cost was manageable? Training runs, R&D, the free users, the data centers, the whole apparatus that led to this model - that’s a big fucking check to write.

OpenAI reportedly lost around $9 billion in 2025 and, by its own financial documents, isn’t projected to be cash-flow-positive until somewhere around 2030 (Fortune, November 2025). And the balance sheet absorbing all this is enormous: combined hyperscaler capex across Alphabet, Amazon, Meta, Microsoft, and Oracle neared half a trillion dollars in 2025 (Epoch AI).

So the subsidy is real. It lives in the gap between what it costs to build and run all of this and what any of us pay to use it.

Many people are patiently doing this arithmetic. David Cahn’s “AI’s $600B Question” at Sequoia estimated the required payback, and Ed Zitron keeps hammering the sharper, angrier question “where’s the money?”. I’m standing in the room they’ve been measuring, holding my absurdly cheap miracle, and trying to milk it as much as I can before someone wakes up and charges me more.

The place where the flat price breaks

Sam Altman, back in January 2025: “we are currently losing money on openai pro subscriptions! people use it much more than we expected” (TechCrunch).

That was the $200-a-month Pro tier, not the $20 one, which is worth saying out loud, because the dollar amounts aren’t the point. The mechanism is. Flat pricing assumes you’ll be roughly average. Heavy use is exactly where a flat price stops making sense, and the company building it admits they didn’t see it coming (maybe if they had better AI…).

My wife is the dream customer. Twice a day, profitable on any plan, basically subsidizing the rest of us with her restraint.

I’m not the dream customer. Apparently, they didn’t price in my overnight fever-dream coding sessions.

Waiting for the meter

I’m loving abusing my subscriptions, but I can’t enjoy them cleanly.

I’m not concerned that everything will crash down, but I am concerned that my costs will go up …way up. I worry I’ve developed a habit I can’t afford.

I don’t know when the meter turns on, or what it’ll look like. Sam Altman has already painted a vision of AI being globally accessible and metered like electricity or running water.

But not yet… get it while it’s hot!

We Are All Just Predicting Tokens

Fri, 15 May 2026 00:00:00 GMT

Late last year, I was at dinner with an engineer I used to work with. I was trying to explain my complete amazement with the productivity gains (and possibilities) of using LLMs for software development. After my gleeful rant, he paused and replied that LLMs are “just predicting tokens.” Glorified autocomplete. Copy-paste on steroids. He said it the way people say things when they’ve decided the conversation is over.

It certainly sounded like a reasonable technical rebuttal …but it also quietly assumes a kind of “real reasoning” exists that is unique to humans and somehow mechanistically different (with no description of how it is different or what the mechanisms are).

“Just predicting tokens” is a category error, not an argument. The system we’re dismissing is doing something we can’t cleanly describe ourselves doing, and the dismissal survives because almost nobody pressure-tests the contrast.

First, I find myself in awe of the extent to which language serves as a compression algorithm for describing our reality and interactions within it. In my view, the ability to speak or write comes ‘pre-loaded’ with concepts like objects, interactions, and basic reasoning. Second, the “real human reasoning” that the “token prediction” dismissal gets at is something we’ve yet to define or locate on the human-side.

The dismissal everyone nods along to

The lineage goes back to Emily Bender and Timnit Gebru’s 2021 “Stochastic Parrots” paper, and it has hardened since. Bender and Alex Hanna’s 2025 book The AI Con argues it is “fundamentally confused” to use any human-like term (understanding, reasoning, belief) for what LLMs do (Bender & Hanna, The AI Con, 2025). That’s the purest version of the position. It isn’t an empirical claim. It’s a definitional one: whatever LLMs are doing, it categorically isn’t cognition, because cognition is (implicitly) the other thing.

I want to be fair to it. I also want to be honest about the social shape it’s taken on.

I often find that to the wise, career-weathered software developer or CTO, the dismissal is a status move. It signals you aren’t hype-pilled. It lets you stay in the “I’m not impressed” club while the rest of the room squirms about Claude one-shotting the user story that was originally going to be three engineers’ stand-up updates for the next two weeks. A 2024 LessWrong catalog called “Hunting Undead Stochastic Parrots” documents the frame persisting as vibe rather than argument, with people invoking the phrase without engaging the last two years of empirical work.

The phrase is doing social work; the technical work has moved on without it.

In order to truly defend against this stance, one must show (1) that predicting tokens at frontier scale produces reasoning-like internals, and (2) that the thing we keep gesturing at when we say “real reasoning” isn’t a thing humans cleanly do either.

Language is reasoning, compressed

The core intuition came from Ilya Sutskever before anyone else: if you predict the next token well enough, you have to model the reality that produced the token. Statistics, when done really well, essentially bleed into world-modeling. You can’t keep winning the prediction game on hard text without reconstructing the causal and inferential structure of the thought that wrote it (Sutskever on next-token prediction, 2023). Despite being three years old at this point, it’s still the cleanest articulation of the point.

You could call it an argument about semantics - the good old Chinese Room argument. However, as we dig into the complexities of these language models, we see shadows of understanding emerge.

Anthropic’s “Mapping the Mind of a Large Language Model” paper (May 2024) found features inside Claude 3 Sonnet for abstract concepts (inner conflict, catch-22, code bugs, deception), organized by conceptual similarity and causally shaping behavior when amplified. Turn the “deception” feature up, the model behaves more deceptively. These aren’t token-level artifacts. They’re features that look like concepts, sitting in the spots where concepts would have to live if the system were reasoning with them.

The 2025 follow-on pushed further. Anthropic’s April 2025 circuits update and “On the Biology of a Large Language Model” traces induction heads and multi-step circuits that implement abstract operations across layers: entity resolution, arithmetic carry chains, multi-hop lookups. If you’ve read this literature and still want to call the internal mechanism “just prediction,” you’re describing the training objective, not the reality of the machine once trained.

If you trained me to win a dunk contest - with a loss function designed to promote sick jams - I might grow a plethora of features that support that objective, but aren’t strictly limited to that application (I’m stronger, better cardio, worked on my depth perception, learned a bit of physics, etc). I believe I can fly…

Marcus’s knockout blow, taken seriously

The strongest live version of the skeptical case in 2025 belongs to Gary Marcus, not Bender. Bender hardens into definition, which is easy to dispatch. Marcus makes an empirical prediction, which is much harder. His June 2025 substack “A knockout blow for LLMs?” is the argument at full strength.

LLMs, Marcus writes, “can generalize within a training distribution of data they are exposed to, but their generalizations tend to break down outside that distribution.” The receipt: a seven-year-old can solve Tower of Hanoi. Claude, at the time of writing, scored under 80% at 7 discs and basically zero at 8. The internal process, on this read, “is not logical and intelligent.” It approximates patterns, and it fails when the problem shifts beyond what it saw during training.

If you take that seriously (and you should), the shape of the argument is quite reasonable, empirical, and falsifiable. It says: if the system really reasoned, there would be no distribution cliff.

François Chollet makes the technical sibling version of this argument. LLMs are “big interpolative memory”; scaling increases skill but not intelligence; without discrete program search on genuinely novel problems, what looks like reasoning is memorization of solver templates. The May 2025 ARC-AGI-2 paper is the benchmark built to operationalize the claim, and pure LLMs scored roughly 0% at launch. Not a rounding error. A cliff.

If my belief that “language encodes reasoning” was qualified, Tower of Hanoi at 8 discs is exactly where it should not fail. The cliff is real data. An honest defender has to contend with this.

The cliff is real. But seems to be moving…

The benchmark designed to falsify “LLMs can reason” is being climbed by LLM-based systems. ARC Prize’s December 2025 results and analysis put Gemini 3 Deep Think at 84.6% on the ARC-AGI-1 public leaderboard, with refinement-loop LLM systems hitting 54% on ARC-AGI-2, the specifically-harder benchmark Chollet built to rule out memorization. Fucking expensive and slow - but genuinely making real progress against these tests. The cliff that Chollet said was structural is being walked up by systems whose substrate is the thing he said couldn’t do it.

In Andrej Karpathy’s end-of-2025 post-mortem - “2025 LLM Year in Review” (December 19, 2025) - he coins “summoned ghosts” and “jagged intelligence,” a shape that’s neither human reasoning nor mere lookup, and argues RLVR produces genuine problem-decomposition strategies, not pattern memorization. In a way, he offers new vocabulary that helps sharpen the discussion beyond the blunt comparison to human thinking.

And humans are better right?

Now the uncomfortable part. Humans have distribution cliffs too.

Ask any adult to do long division on 9-digit numbers without paper. Ask them to run a 14-step modus tollens chain in their head. Ask a bright 35-year-old to solve Tower of Hanoi at 12 discs without tools (which is 4,095 moves).

Failure outside distribution isn’t evidence against cognition. It’s what cognition looks like when you push it past its harness or memory capabilities. What varies between systems is where the cliff sits and what tricks move it.

In my view, the Marcus argument proves a weaker thing than it claims. Current LLMs have specific cliffs in specific places. It doesn’t follow that the process “is not logical and intelligent.” It follows that the process has limits, which is true of every cognitive system we know (even you, sorry).

The standard no one has ever met

Suppose, steelman-to-steelman, that LLMs are exactly what Marcus says: interpolative engines that fail predictably at the training distribution’s edge. Here’s the question no one in this camp seems willing to ask out loud. What would “real reasoning” look like, such that humans would pass and LLMs would not, and where has that thing ever been observed?

The best current neuroscience doesn’t give you one. Predictive processing (the Karl Friston and Andy Clark framework) treats the brain as a hierarchical prediction machine, minimizing prediction error across levels from retina to frontal cortex. A 2024 meta-analysis of predictive-processing fMRI studies confirms it as current consensus, not fringe. If the leading account of the brain describes it as a prediction system, the intuition that “prediction” and “understanding” are obviously different categories is doing a lot of philosophical work on zero budget.

Geoffrey Hinton has been unusually blunt about this. In his 2024 and 2025 interviews (Mindplex, October 2024), he says GPT-4 “definitely understands,” and the old-fashioned AI claim that neural networks can’t reason without symbolic scaffolds was “just utterly wrong.” His symmetry point is the one I want to sit with: human understanding is the same kind of distributed-feature computation. Calling it something else in the biological case, and something less in the silicon case, is a move the evidence doesn’t license.

So here’s the category error, fully stated. “Just predicting tokens” only reads as a dismissal if you quietly import a contrast class of prediction-free, symbolic, grounded, non-statistical “real” reasoning. Neuroscience doesn’t have one. Introspection doesn’t have one. Functionalism has covered this ground for half a century. The standard does no work on LLMs because it does no work on us.

Which is why the frontier-lab practitioners who’ve actually built these systems have mostly stopped making the dismissive argument. Karpathy’s “summoned ghosts” isn’t a hype line. It’s a practitioner giving up on vocabulary that didn’t survive contact with the thing.

So back to my engineering coworker…

He’s not wrong that transformers predict the next token. He’s wrong that this tells us anything useful about whether they reason, and even less about whether they have meaningful utility.

Whelp, time for me to go stare at a blank wall …am I just predicting tokens?

The Three Pillars That Enable My Long-Running Agents

Wed, 15 Apr 2026 00:00:00 GMT

EDIT NOTE 6/10/26: This article is left untouched - but I’ll admit, much of my process has evolved since tools have evolved. I still like the fundamentals of this article - but you can be a little more sloppy now.

The thing that got me seriously utilizing long-running AI agents was Geoff Huntley’s “ralph loop” post (the deceptively simple idea of running the same PROMPT.md through a coding agent in a while true until the work is done). This approach used brute force to ensure clean context and required thoughtfulness to be put into the ‘plan’ and ‘specs’ that allowed for coherency between ‘memoryless’ agent sessions.

This scaffolding got me thinking deeply about optimizing agents for long-time-horizon tasks (hours, or sometimes just 10+ minutes). As I built more agents doing this type of work (not just for coding), I kept encountering a consistent trend in issues that required deep thinking in a few critical areas.

So here’s where I’ve landed (and it’s NOT a takedown of Huntley’s ralph loop):

What turns long-horizon agent systems from impressive demos into reliable workhorses isn’t the loop pattern, and it isn’t a smarter model. It’s three quieter disciplines underneath whatever loop you’re running: blueprint discipline, context hygiene, and back-pressure. When my agents failed, the diagnosis was almost always one of those three.

The failures better models don’t fix

Long-horizon agent runs collapse in four characteristic ways: drift, context loss, confidently broken work, compounding step error. Chip Huyen makes the math vivid in her Agents chapter: 95% per-step accuracy becomes 60% over ten steps and basically zero over a hundred. She also names the meta-failure, “errors in reflection,” where the agent confidently claims completion while the goal sits unmet.

If this were a model problem, Claude 4.5 and GPT-5 would have closed it. They haven’t. Chroma’s Context Rot report (July 2025) ran 18 frontier models on tasks held at constant difficulty while input length grew, and every single one of them degraded, across every family tested. The unlock lives somewhere else.

Pillar 1: Blueprint discipline

The unglamorous version: write the spec, write the per-task instructions, write the orchestration plan, then run the loop. The loop is the delivery vehicle. The blueprint is the rails. Hell, skip the loop altogether and just use subagents. Doesn’t matter if the rails are solid.

I learned this from the challenge of working with an intern, wrestling with the poor functioning code he wrote using Claude (he explained it to me by having Claude give him a script of what to say to me …the humans are becoming email for bots).

He was building a contract-comparison system and kept asking a single LLM call to score multiple documents across 50+ dimensions at once. Outputs were unreliable and non-improvable. The fix wasn’t a smarter prompt - it was breaking the task into multiple prompts, i.e. sub-tasks, with excruciating detail on doing each step. If the intern couldn’t do it by hand with these instructions, the bot was going to fail too.

The research has hardened around this. The 2025 Plan-and-Act paper shows separated planner/executor architectures beat reactive ReAct-style loops on long-horizon tasks. Anthropic’s multi-agent research post-mortem is unusually honest: “prompt engineering was the primary lever for improving behaviors.” Their early failures were instruction failures, including spawning 50 subagents for a trivial query because nothing told the orchestrator not to. Huntley’s PROMPT.md is itself a form of blueprint discipline with its reliance on strong spec and implementation plan files.

Every agent I’m building usually has a slim CLAUDE.md/AGENTS.md file (just references to orchestration files, a map of the repo, and basic info on how everything works), a detailed orchestration.md file, and highly detailed specs for each step of any pipeline outlined in the orchestration file.

Pillar 2: Context hygiene

Context windows have historically been the limit of a lot of the practical use of LLMs. In 2023, we were having to break up tasks to work around these limitations. Now the windows are bigger - but that can enable us to stuff a shitload of unnecessary information into them.

A 200K-token window full of stale tool output is mostly noise pulling the model toward the wrong answer. Anthropic’s context-engineering post (September 2025) puts it cleanly: “the challenge of maintaining coherence across extended interactions will remain central.” (The “lost in the middle” finding from Liu et al., 2023 is where the empirical thread starts.)

The easiest move to fight the bloat is utilizing subagents. Each one gets its own system prompt, window, and tool list… verbose output stays in the child and only a summary returns. That’s what “subagents in Claude Code” is actually doing: using the process boundary as a context boundary. Once you think that way, segmenting instructions across files (so the writing agent never has to worry about the editing instructions) becomes the default.

Pillar 3: Back-pressure

This is the one most people are doing manually. You are sitting at your computer, reviewing some output, and thinking “this is totally wrong” …then re-instructing the LLM to fix its errors or go back and try again. (repeat 27 times, and tadah!)

Back-pressure is a runnable check the agent can call against its own output and the wiring that forces it to do that. For code this could be tests, types, linters, builds. For non-code, a graded rubric or an LLM-as-judge with veto rights and the ability to send work back for more revisions. It’s the move that bridges the gap between “looks done but kinda sucks” to “actually done.”

Spotify’s Honk Part 3 post (December 2025) is the strongest production evidence I’ve seen. Independent verifiers veto roughly 25% of agent sessions, and the agents self-correct about half the time after. The line that stuck with me: “the agent doesn’t know what the verification does and how, it just knows that it can (and in certain cases must) call it.” Hamel Husain’s LLM Evals FAQ (January 2026) makes the discipline argument plainly: evals are “part of the development process, similar to how debugging is part of software development.”

Anthropic’s November 2025 “effective harnesses” post is interesting precisely because the same lab that pushed context engineering hardest is now saying “compaction isn’t sufficient” past a certain horizon. Their answer (initializer plus coding agent plus immutable tests plus a progress file) is essentially back-pressure. The tests are immutable so the agent can’t quietly delete the very mechanism that tells it its work sucks. It shows the importance of making sure these back-pressure mechanisms are available to the agent, but safe from tinkering. Treat them like a college kid that just might cheat if the adderall has worn off at 3a.

But could this all just be a runtime problem?

The strongest counter comes from LangChain. In “Building LangGraph from first principles” (September 2025) they argue agent reliability is fundamentally a runtime problem, not an operator-discipline one. Agents are flaky and non-deterministic in ways ordinary code isn’t, so you need durable state with checkpointing, task queues for retries, first-class human-in-the-loop interruption, tracing. Get those right and the instruction and verification questions become tractable.

They’re right about a real thing. Durable state genuinely solves a class of problems prompt discipline can’t touch: crashes, restarts, review windows measured in days. Great runtime is certainly necessary - just not sufficient. With perfect durability but zero blueprint, context, or back-pressure discipline - the agent can still drift, lose the thread, or confidently ship broken work. It can just do all that across multiple sessions and interruptions.

Don’t get frustrated with failure, use it to tune the pillars.

When a long-horizon agent you are working on starts going haywire - I always suggest interrogating these three pillars and adjusting them based on the flavor of “this didn’t work” that you encounter: Did I write the blueprint, detailed instructions, sub-prompts well? Did I keep the context clean and void of bloat or irrelevant instructions for each subagent? Did I give it the tools to actually test and validate its outputs or start over at different checkpoints if certain conditions aren’t met?

If you build really great rails, the train will run much better.