<?xml version="1.0" encoding="UTF-8"?><rss version="2.0" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Jeffrey Lee Cooper</title><description>Essays on AI agents, LLMs, and building software.</description><link>https://jeffreyleecooper.com/</link><item><title>My Wife and I Pay the Same for AI ...but Should We?</title><link>https://jeffreyleecooper.com/writing/subsidized-cognition/</link><guid isPermaLink="true">https://jeffreyleecooper.com/writing/subsidized-cognition/</guid><description>She opens ChatGPT twice a day. I burn tokens through agents all day. We pay roughly the same, and I keep wondering who is covering the bill, and what it will feel like when they stop.</description><pubDate>Wed, 10 Jun 2026 00:00:00 GMT</pubDate><content:encoded>&lt;img src=&quot;https://jeffreyleecooper.com/_astro/cover.x6VD6P6c.png&quot; alt=&quot;A round wall-mounted utility meter with its needle near zero, lit by a lamp left switched on above it.&quot; /&gt;&lt;p&gt;My wife opens ChatGPT maybe twice a day. Usually it’s a simple search while we are driving around, some question one of us swears we’re right about, settled in thirty seconds instead of a stalemate. Then she closes the app until the next mystery rears its head.&lt;/p&gt;
&lt;p&gt;Meanwhile I’ve got agents running while I sleep. I burn tokens in a way that would make eco-activists cry. These tools have become a constant engine running under the hood of both my professional and personal life.&lt;/p&gt;
&lt;p&gt;The fun part? We pay roughly the same.&lt;/p&gt;
&lt;p&gt;I have less than $200 a month in subscriptions - she has roughly $40 a month. Yet our usage differs by orders of magnitude. Have I invented free money somehow?!&lt;/p&gt;
&lt;h2 id=&quot;someone-elses-balance-sheet&quot;&gt;Someone else’s balance sheet&lt;/h2&gt;
&lt;p&gt;The people who’ve done the math (there’s a &lt;a href=&quot;https://news.ycombinator.com/item?id=41878719&quot; rel=&quot;noopener noreferrer&quot; target=&quot;_blank&quot;&gt;Hacker News thread&lt;/a&gt; where practitioners work it out) land on the thesis that marginal token use is most likely carrying a positive margin once the model’s trained.&lt;/p&gt;
&lt;p&gt;But how did we get to the point where the marginal cost was manageable? Training runs, R&amp;#x26;D, the free users, the data centers, the whole apparatus that led to this model - that’s a big fucking check to write.&lt;/p&gt;
&lt;p&gt;OpenAI reportedly lost around $9 billion in 2025 and, by its own financial documents, isn’t projected to be cash-flow-positive until somewhere around 2030 (&lt;a href=&quot;https://fortune.com/2025/11/12/openai-cash-burn-rate-annual-losses-2028-profitable-2030-financial-documents/&quot; rel=&quot;noopener noreferrer&quot; target=&quot;_blank&quot;&gt;Fortune, November 2025&lt;/a&gt;). And the balance sheet absorbing all this is enormous: combined hyperscaler capex across Alphabet, Amazon, Meta, Microsoft, and Oracle neared half a trillion dollars in 2025 (&lt;a href=&quot;https://epoch.ai/data-insights/hyperscaler-capex-trend&quot; rel=&quot;noopener noreferrer&quot; target=&quot;_blank&quot;&gt;Epoch AI&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;So the subsidy is real. It lives in the gap between what it costs to &lt;em&gt;build and run all of this&lt;/em&gt; and what any of us pay to use it.&lt;/p&gt;
&lt;p&gt;Many people are patiently doing this arithmetic. David Cahn’s “&lt;a href=&quot;https://sequoiacap.com/article/ais-600b-question/&quot; rel=&quot;noopener noreferrer&quot; target=&quot;_blank&quot;&gt;AI’s $600B Question&lt;/a&gt;” at Sequoia estimated the required payback, and Ed Zitron keeps hammering the &lt;a href=&quot;https://www.wheresyoured.at/wheres-the-money/&quot; rel=&quot;noopener noreferrer&quot; target=&quot;_blank&quot;&gt;sharper, angrier question “where’s the money?”&lt;/a&gt;. I’m standing in the room they’ve been measuring, holding my absurdly cheap miracle, and trying to milk it as much as I can before someone wakes up and charges me more.&lt;/p&gt;
&lt;p&gt;&lt;img alt=&quot;An enormous open ledger standing upright like a wall, dwarfing a single tiny coin on a table where a small humanoid robot looks up at it.&quot; loading=&quot;lazy&quot; decoding=&quot;async&quot;  width=&quot;1536&quot; height=&quot;1024&quot; src=&quot;https://jeffreyleecooper.com/_astro/fig-1.CY8bBlzu_11sVfm.webp&quot; &gt;&lt;/p&gt;
&lt;h2 id=&quot;the-place-where-the-flat-price-breaks&quot;&gt;The place where the flat price breaks&lt;/h2&gt;
&lt;p&gt;Sam Altman, back in January 2025: “we are currently losing money on openai pro subscriptions! people use it much more than we expected” (&lt;a href=&quot;https://techcrunch.com/2025/01/05/openai-is-losing-money-on-its-pricey-chatgpt-pro-plan-ceo-sam-altman-says/&quot; rel=&quot;noopener noreferrer&quot; target=&quot;_blank&quot;&gt;TechCrunch&lt;/a&gt;).&lt;/p&gt;
&lt;p&gt;That was the $200-a-month Pro tier, not the $20 one, which is worth saying out loud, because the dollar amounts aren’t the point. The mechanism is. Flat pricing assumes you’ll be roughly average. Heavy use is exactly where a flat price stops making sense, and the company building it admits they didn’t see it coming &lt;em&gt;(maybe if they had better AI…)&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;My wife is the dream customer. Twice a day, profitable on any plan, basically subsidizing the rest of us with her restraint.&lt;/p&gt;
&lt;p&gt;I’m not the dream customer. Apparently, they didn’t price in my overnight fever-dream coding sessions.&lt;/p&gt;
&lt;p&gt;&lt;img alt=&quot;A nearly empty plate beside a table piled high with stacks of plates, both sharing one identical small price tag while the bill is nowhere in sight.&quot; loading=&quot;lazy&quot; decoding=&quot;async&quot;  width=&quot;1536&quot; height=&quot;1024&quot; src=&quot;https://jeffreyleecooper.com/_astro/fig-2.zFVsATh0_ZIRYTu.webp&quot; &gt;&lt;/p&gt;
&lt;h2 id=&quot;waiting-for-the-meter&quot;&gt;Waiting for the meter&lt;/h2&gt;
&lt;p&gt;I’m loving abusing my subscriptions, but I can’t enjoy them cleanly.&lt;/p&gt;
&lt;p&gt;I’m not concerned that everything will crash down, but I am concerned that my costs will go up …way up. I worry I’ve developed a habit I can’t afford.&lt;/p&gt;
&lt;p&gt;I don’t know when the meter turns on, or what it’ll look like. Sam Altman has already painted a vision of AI being &lt;a href=&quot;https://www.businessinsider.com/sam-altman-ai-utility-electricity-water-openai-2026-3&quot; rel=&quot;noopener noreferrer&quot; target=&quot;_blank&quot;&gt;globally accessible and metered&lt;/a&gt; like electricity or running water.&lt;/p&gt;
&lt;p&gt;But not yet… get it while it’s hot!&lt;/p&gt;</content:encoded></item><item><title>We Are All Just Predicting Tokens</title><link>https://jeffreyleecooper.com/writing/just-predicting-tokens/</link><guid isPermaLink="true">https://jeffreyleecooper.com/writing/just-predicting-tokens/</guid><description>The sage-engineer dismissal of LLMs as &apos;just autocomplete&apos; looks like a technical claim, but it&apos;s a definitional and social one that fails against humans too.</description><pubDate>Fri, 15 May 2026 00:00:00 GMT</pubDate><content:encoded>&lt;img src=&quot;https://jeffreyleecooper.com/_astro/cover.ChgT7b7f.png&quot; alt=&quot;A small humanoid robot and a grey-haired man sit across a dinner table with wine glasses between them; the robot leans in gesturing, the man sits back with arms crossed.&quot; /&gt;&lt;p&gt;Late last year, I was at dinner with an engineer I used to work with. I was trying to explain my complete amazement with the productivity gains (and possibilities) of using LLMs for software development. After my gleeful rant, he paused and replied that LLMs are “just predicting tokens.” Glorified autocomplete. Copy-paste on steroids. He said it the way people say things when they’ve decided the conversation is over.&lt;/p&gt;
&lt;p&gt;It certainly sounded like a reasonable technical rebuttal …but it also quietly assumes a kind of “real reasoning” exists that is unique to humans and somehow mechanistically different (with no description of how it is different or what the mechanisms are).&lt;/p&gt;
&lt;p&gt;“Just predicting tokens” is a category error, not an argument. The system we’re dismissing is doing something we can’t cleanly describe ourselves doing, and the dismissal survives because almost nobody pressure-tests the contrast.&lt;/p&gt;
&lt;p&gt;First, I find myself in awe of the extent to which language serves as a compression algorithm for describing our reality and interactions within it. In my view, the ability to speak or write comes ‘pre-loaded’ with concepts like objects, interactions, and basic reasoning. Second, the “real human reasoning” that the “token prediction” dismissal gets at is something we’ve yet to define or locate on the human-side.&lt;/p&gt;
&lt;h2 id=&quot;the-dismissal-everyone-nods-along-to&quot;&gt;The dismissal everyone nods along to&lt;/h2&gt;
&lt;p&gt;The lineage goes back to Emily Bender and Timnit Gebru’s 2021 “Stochastic Parrots” paper, and it has hardened since. Bender and Alex Hanna’s 2025 book &lt;em&gt;The AI Con&lt;/em&gt; argues it is “fundamentally confused” to use any human-like term (understanding, reasoning, belief) for what LLMs do (&lt;a href=&quot;https://thecon.ai/&quot; rel=&quot;noopener noreferrer&quot; target=&quot;_blank&quot;&gt;Bender &amp;#x26; Hanna, &lt;em&gt;The AI Con&lt;/em&gt;, 2025&lt;/a&gt;). That’s the purest version of the position. It isn’t an empirical claim. It’s a definitional one: whatever LLMs are doing, it categorically isn’t cognition, because cognition is (implicitly) the other thing.&lt;/p&gt;
&lt;p&gt;I want to be fair to it. I also want to be honest about the social shape it’s taken on.&lt;/p&gt;
&lt;p&gt;I often find that to the wise, career-weathered software developer or CTO, the dismissal is a status move. It signals you aren’t hype-pilled. It lets you stay in the “I’m not impressed” club while the rest of the room squirms about Claude one-shotting the user story that was originally going to be three engineers’ stand-up updates for the next two weeks. A 2024 LessWrong catalog called “&lt;a href=&quot;https://www.lesswrong.com/posts/KWHeBG978uZuqNK6Q/hunting-undead-stochastic-parrots-finding-and-killing-the&quot; rel=&quot;noopener noreferrer&quot; target=&quot;_blank&quot;&gt;Hunting Undead Stochastic Parrots&lt;/a&gt;” documents the frame persisting as vibe rather than argument, with people invoking the phrase without engaging the last two years of empirical work.&lt;/p&gt;
&lt;p&gt;The phrase is doing social work; the technical work has moved on without it.&lt;/p&gt;
&lt;p&gt;In order to truly defend against this stance, one must show (1) that predicting tokens at frontier scale produces reasoning-like internals, and (2) that the thing we keep gesturing at when we say “real reasoning” isn’t a thing humans cleanly do either.&lt;/p&gt;
&lt;h2 id=&quot;language-is-reasoning-compressed&quot;&gt;Language is reasoning, compressed&lt;/h2&gt;
&lt;p&gt;The core intuition came from Ilya Sutskever before anyone else: if you predict the next token well enough, you have to model the reality that produced the token. Statistics, when done really well, essentially bleed into world-modeling. You can’t keep winning the prediction game on hard text without reconstructing the causal and inferential structure of the thought that wrote it (&lt;a href=&quot;https://www.youtube.com/watch?v=YEUclZdj_Sc&quot; rel=&quot;noopener noreferrer&quot; target=&quot;_blank&quot;&gt;Sutskever on next-token prediction, 2023&lt;/a&gt;). Despite being three years old at this point, it’s still the cleanest articulation of the point.&lt;/p&gt;
&lt;p&gt;You could call it an argument about semantics - the good old Chinese Room argument. However, as we dig into the complexities of these language models, we see shadows of understanding emerge.&lt;/p&gt;
&lt;p&gt;Anthropic’s “&lt;a href=&quot;https://www.anthropic.com/research/mapping-mind-language-model&quot; rel=&quot;noopener noreferrer&quot; target=&quot;_blank&quot;&gt;Mapping the Mind of a Large Language Model&lt;/a&gt;” paper (May 2024) found features inside Claude 3 Sonnet for abstract concepts (inner conflict, catch-22, code bugs, deception), organized by conceptual similarity and causally shaping behavior when amplified. Turn the “deception” feature up, the model behaves more deceptively. These aren’t token-level artifacts. They’re features that look like concepts, sitting in the spots where concepts would have to live if the system were reasoning with them.&lt;/p&gt;
&lt;p&gt;The 2025 follow-on pushed further. Anthropic’s &lt;a href=&quot;https://transformer-circuits.pub/2025/april-update/index.html&quot; rel=&quot;noopener noreferrer&quot; target=&quot;_blank&quot;&gt;April 2025 circuits update and “On the Biology of a Large Language Model”&lt;/a&gt; traces induction heads and multi-step circuits that implement abstract operations across layers: entity resolution, arithmetic carry chains, multi-hop lookups. If you’ve read this literature and still want to call the internal mechanism “just prediction,” you’re describing the training objective, not the reality of the machine once trained.&lt;/p&gt;
&lt;p&gt;If you trained me to win a dunk contest - with a loss function designed to promote sick jams - I might grow a plethora of features that support that objective, but aren’t strictly limited to that application (I’m stronger, better cardio, worked on my depth perception, learned a bit of physics, etc). &lt;em&gt;I believe I can fly…&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;&lt;img alt=&quot;A muscular humanoid robot suspended mid-dunk at a basketball hoop, with a physics textbook, stopwatch, eye chart, and dumbbell scattered on the floor beneath it.&quot; loading=&quot;lazy&quot; decoding=&quot;async&quot;  width=&quot;1536&quot; height=&quot;1024&quot; src=&quot;https://jeffreyleecooper.com/_astro/fig-1.C7vXiECl_1xu5OP.webp&quot; &gt;&lt;/p&gt;
&lt;h2 id=&quot;marcuss-knockout-blow-taken-seriously&quot;&gt;Marcus’s knockout blow, taken seriously&lt;/h2&gt;
&lt;p&gt;The strongest live version of the skeptical case in 2025 belongs to Gary Marcus, not Bender. Bender hardens into definition, which is easy to dispatch. Marcus makes an empirical prediction, which is much harder. His June 2025 substack “&lt;a href=&quot;https://garymarcus.substack.com/p/a-knockout-blow-for-llms&quot; rel=&quot;noopener noreferrer&quot; target=&quot;_blank&quot;&gt;A knockout blow for LLMs?&lt;/a&gt;” is the argument at full strength.&lt;/p&gt;
&lt;p&gt;LLMs, Marcus writes, “can generalize within a training distribution of data they are exposed to, but their generalizations tend to break down outside that distribution.” The receipt: a seven-year-old can solve &lt;a href=&quot;https://en.wikipedia.org/wiki/Tower_of_Hanoi&quot; rel=&quot;noopener noreferrer&quot; target=&quot;_blank&quot;&gt;Tower of Hanoi&lt;/a&gt;. Claude, at the time of writing, scored under 80% at 7 discs and basically zero at 8. The internal process, on this read, “is not logical and intelligent.” It approximates patterns, and it fails when the problem shifts beyond what it saw during training.&lt;/p&gt;
&lt;p&gt;If you take that seriously (and you should), the shape of the argument is quite reasonable, empirical, and falsifiable. It says: if the system really reasoned, there would be no distribution cliff.&lt;/p&gt;
&lt;p&gt;François Chollet makes the technical sibling version of this argument. LLMs are “big interpolative memory”; scaling increases skill but not intelligence; without discrete program search on genuinely novel problems, what looks like reasoning is memorization of solver templates. The May 2025 &lt;a href=&quot;https://arxiv.org/pdf/2505.11831&quot; rel=&quot;noopener noreferrer&quot; target=&quot;_blank&quot;&gt;ARC-AGI-2 paper&lt;/a&gt; is the benchmark built to operationalize the claim, and pure LLMs scored roughly 0% at launch. Not a rounding error. A cliff.&lt;/p&gt;
&lt;p&gt;If my belief that “language encodes reasoning” was qualified, Tower of Hanoi at 8 discs is exactly where it should not fail. The cliff is real data. An honest defender has to contend with this.&lt;/p&gt;
&lt;p&gt;&lt;img alt=&quot;A Tower of Hanoi puzzle tilting at the edge of a steep cliff, one disc already tumbling over the drop into empty space below.&quot; loading=&quot;lazy&quot; decoding=&quot;async&quot;  width=&quot;1536&quot; height=&quot;1024&quot; src=&quot;https://jeffreyleecooper.com/_astro/fig-2.BBMKJE-W_Z1XoOlm.webp&quot; &gt;&lt;/p&gt;
&lt;h2 id=&quot;the-cliff-is-real-but-seems-to-be-moving&quot;&gt;The cliff is real. But seems to be moving…&lt;/h2&gt;
&lt;p&gt;The benchmark designed to falsify “LLMs can reason” is being climbed by LLM-based systems. &lt;a href=&quot;https://arcprize.org/blog/arc-prize-2025-results-analysis&quot; rel=&quot;noopener noreferrer&quot; target=&quot;_blank&quot;&gt;ARC Prize’s December 2025 results and analysis&lt;/a&gt; put Gemini 3 Deep Think at 84.6% on the ARC-AGI-1 public leaderboard, with refinement-loop LLM systems hitting 54% on ARC-AGI-2, the specifically-harder benchmark Chollet built to rule out memorization. Fucking expensive and slow - but genuinely making real progress against these tests. The cliff that Chollet said was structural is being walked up by systems whose substrate is the thing he said couldn’t do it.&lt;/p&gt;
&lt;p&gt;In Andrej Karpathy’s end-of-2025 post-mortem - “&lt;a href=&quot;https://karpathy.bearblog.dev/year-in-review-2025/&quot; rel=&quot;noopener noreferrer&quot; target=&quot;_blank&quot;&gt;2025 LLM Year in Review&lt;/a&gt;” (December 19, 2025) - he coins “summoned ghosts” and “jagged intelligence,” a shape that’s neither human reasoning nor mere lookup, and argues RLVR produces genuine problem-decomposition strategies, not pattern memorization. In a way, he offers new vocabulary that helps sharpen the discussion beyond the blunt comparison to human thinking.&lt;/p&gt;
&lt;h2 id=&quot;and-humans-are-better-right&quot;&gt;And humans are better right?&lt;/h2&gt;
&lt;p&gt;Now the uncomfortable part. Humans have distribution cliffs too.&lt;/p&gt;
&lt;p&gt;Ask any adult to do long division on 9-digit numbers without paper. Ask them to run a 14-step modus tollens chain in their head. Ask a bright 35-year-old to solve Tower of Hanoi at 12 discs without tools (which is 4,095 moves).&lt;/p&gt;
&lt;p&gt;Failure outside distribution isn’t evidence against cognition. It’s what cognition looks like when you push it past its harness or memory capabilities. What varies between systems is where the cliff sits and what tricks move it.&lt;/p&gt;
&lt;p&gt;In my view, the Marcus argument proves a weaker thing than it claims. Current LLMs have specific cliffs in specific places. It doesn’t follow that the process “is not logical and intelligent.” It follows that the process has limits, which is true of every cognitive system we know (even you, sorry).&lt;/p&gt;
&lt;h2 id=&quot;the-standard-no-one-has-ever-met&quot;&gt;The standard no one has ever met&lt;/h2&gt;
&lt;p&gt;Suppose, steelman-to-steelman, that LLMs are &lt;em&gt;exactly&lt;/em&gt; what Marcus says: interpolative engines that fail predictably at the training distribution’s edge. Here’s the question no one in this camp seems willing to ask out loud. What would “real reasoning” look like, such that humans would pass and LLMs would not, and where has that thing ever been observed?&lt;/p&gt;
&lt;p&gt;The best current neuroscience doesn’t give you one. Predictive processing (the Karl Friston and Andy Clark framework) treats the brain as a hierarchical prediction machine, minimizing prediction error across levels from retina to frontal cortex. A &lt;a href=&quot;https://pmc.ncbi.nlm.nih.gov/articles/PMC11339134/&quot; rel=&quot;noopener noreferrer&quot; target=&quot;_blank&quot;&gt;2024 meta-analysis of predictive-processing fMRI studies&lt;/a&gt; confirms it as current consensus, not fringe. If the leading account of the brain describes it as a prediction system, the intuition that “prediction” and “understanding” are obviously different categories is doing a lot of philosophical work on zero budget.&lt;/p&gt;
&lt;p&gt;Geoffrey Hinton has been unusually blunt about this. In his 2024 and 2025 interviews (&lt;a href=&quot;https://magazine.mindplex.ai/post/geoffrey-hinton-on-ai-intelligence-and-superintelligence&quot; rel=&quot;noopener noreferrer&quot; target=&quot;_blank&quot;&gt;Mindplex, October 2024&lt;/a&gt;), he says GPT-4 “definitely understands,” and the old-fashioned AI claim that neural networks can’t reason without symbolic scaffolds was “just utterly wrong.” His symmetry point is the one I want to sit with: human understanding is the same kind of distributed-feature computation. Calling it something else in the biological case, and something less in the silicon case, is a move the evidence doesn’t license.&lt;/p&gt;
&lt;p&gt;So here’s the category error, fully stated. “Just predicting tokens” only reads as a dismissal if you quietly import a contrast class of prediction-free, symbolic, grounded, non-statistical “real” reasoning. Neuroscience doesn’t have one. Introspection doesn’t have one. Functionalism has covered this ground for half a century. The standard does no work on LLMs because it does no work on us.&lt;/p&gt;
&lt;p&gt;Which is why the frontier-lab practitioners who’ve actually built these systems have mostly stopped making the dismissive argument. Karpathy’s “summoned ghosts” isn’t a hype line. It’s a practitioner giving up on vocabulary that didn’t survive contact with the thing.&lt;/p&gt;
&lt;p&gt;&lt;img alt=&quot;A humanoid robot and a human standing side by side before an oval mirror; the reflection shows an identical glowing lattice of nodes and threads inside each figure.&quot; loading=&quot;lazy&quot; decoding=&quot;async&quot;  width=&quot;1536&quot; height=&quot;1024&quot; src=&quot;https://jeffreyleecooper.com/_astro/fig-3.BuQuy1If_Y4ahU.webp&quot; &gt;&lt;/p&gt;
&lt;h2 id=&quot;so-back-to-my-engineering-coworker&quot;&gt;So back to my engineering coworker…&lt;/h2&gt;
&lt;p&gt;He’s not wrong that transformers predict the next token. He’s wrong that this tells us anything useful about whether they reason, and even less about whether they have meaningful utility.&lt;/p&gt;
&lt;p&gt;Whelp, time for me to go stare at a blank wall &lt;em&gt;…am I just predicting tokens?&lt;/em&gt;&lt;/p&gt;</content:encoded></item><item><title>The Three Pillars That Enable My Long-Running Agents</title><link>https://jeffreyleecooper.com/writing/three-pillars-long-horizon-agents/</link><guid isPermaLink="true">https://jeffreyleecooper.com/writing/three-pillars-long-horizon-agents/</guid><description>Long-horizon agents fail at three quiet disciplines underneath them that you must master: blueprint, context hygiene, and back-pressure.</description><pubDate>Wed, 15 Apr 2026 00:00:00 GMT</pubDate><content:encoded>&lt;img src=&quot;https://jeffreyleecooper.com/_astro/cover.DPCjWX7p.png&quot; alt=&quot;Three parallel metal rails stretch toward a bright horizon point; a small humanoid robot with a clipboard stands on the middle rail.&quot; /&gt;&lt;p&gt;&lt;em&gt;EDIT NOTE 6/10/26: This article is left untouched - but I’ll admit, much of my process has evolved since tools have evolved.  I still like the fundamentals of this article - but you can be a little more sloppy now.&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;The thing that got me seriously utilizing long-running AI agents was Geoff Huntley’s “&lt;a href=&quot;https://ghuntley.com/ralph/&quot; rel=&quot;noopener noreferrer&quot; target=&quot;_blank&quot;&gt;ralph loop&lt;/a&gt;” post (the deceptively simple idea of running the same &lt;code&gt;PROMPT.md&lt;/code&gt; through a coding agent in a &lt;code&gt;while true&lt;/code&gt; until the work is done). This approach used brute force to ensure clean context and required thoughtfulness to be put into the ‘plan’ and ‘specs’ that allowed for coherency between ‘memoryless’ agent sessions.&lt;/p&gt;
&lt;p&gt;This scaffolding got me thinking deeply about optimizing agents for long-time-horizon tasks (hours, or sometimes just 10+ minutes).  As I built more agents doing this type of work (not just for coding), I kept encountering a consistent trend in issues that required deep thinking in a few critical areas.&lt;/p&gt;
&lt;p&gt;So here’s where I’ve landed (and it’s NOT a takedown of Huntley’s ralph loop):&lt;/p&gt;
&lt;p&gt;What turns long-horizon agent systems from impressive demos into reliable workhorses isn’t the loop pattern, and it isn’t a smarter model. It’s three quieter disciplines underneath whatever loop you’re running: &lt;strong&gt;blueprint discipline, context hygiene, and back-pressure.&lt;/strong&gt; When my agents failed, the diagnosis was almost always one of those three.&lt;/p&gt;
&lt;h2 id=&quot;the-failures-better-models-dont-fix&quot;&gt;The failures better models don’t fix&lt;/h2&gt;
&lt;p&gt;Long-horizon agent runs collapse in four characteristic ways: drift, context loss, confidently broken work, compounding step error. Chip Huyen makes the math vivid in her &lt;a href=&quot;https://huyenchip.com/2025/01/07/agents.html&quot; rel=&quot;noopener noreferrer&quot; target=&quot;_blank&quot;&gt;&lt;em&gt;Agents&lt;/em&gt; chapter&lt;/a&gt;: 95% per-step accuracy becomes 60% over ten steps and basically zero over a hundred. She also names the meta-failure, “errors in reflection,” where the agent confidently claims completion while the goal sits unmet.&lt;/p&gt;
&lt;p&gt;If this were a model problem, Claude 4.5 and GPT-5 would have closed it. They haven’t. Chroma’s &lt;a href=&quot;https://www.trychroma.com/research/context-rot&quot; rel=&quot;noopener noreferrer&quot; target=&quot;_blank&quot;&gt;&lt;em&gt;Context Rot&lt;/em&gt;&lt;/a&gt; report (July 2025) ran 18 frontier models on tasks held at constant difficulty while input length grew, and every single one of them degraded, across every family tested. The unlock lives somewhere else.&lt;/p&gt;
&lt;p&gt;&lt;img alt=&quot;A humanoid robot kneels over a large blueprint unrolled across a workshop floor, with drafting tools arrayed around it.&quot; loading=&quot;lazy&quot; decoding=&quot;async&quot;  width=&quot;1536&quot; height=&quot;1024&quot; src=&quot;https://jeffreyleecooper.com/_astro/fig-1.Biy5SWuo_Z1cekrC.webp&quot; &gt;&lt;/p&gt;
&lt;h2 id=&quot;pillar-1-blueprint-discipline&quot;&gt;Pillar 1: Blueprint discipline&lt;/h2&gt;
&lt;p&gt;The unglamorous version: write the spec, write the per-task instructions, write the orchestration plan, &lt;em&gt;then&lt;/em&gt; run the loop. The loop is the delivery vehicle. The blueprint is the rails. Hell, skip the loop altogether and just use subagents. Doesn’t matter if the rails are solid.&lt;/p&gt;
&lt;p&gt;I learned this from the challenge of working with an intern, wrestling with the poor functioning code he wrote using Claude &lt;em&gt;(he explained it to me by having Claude give him a script of what to say to me …the humans are becoming email for bots)&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;He was building a contract-comparison system and kept asking a single LLM call to score multiple documents across 50+ dimensions at once. Outputs were unreliable and non-improvable. The fix wasn’t a smarter prompt - it was breaking the task into multiple prompts, i.e. sub-tasks, with excruciating detail on doing each step. If the intern couldn’t do it by hand with these instructions, the bot was going to fail too.&lt;/p&gt;
&lt;p&gt;The research has hardened around this. The 2025 &lt;a href=&quot;https://arxiv.org/html/2503.09572v3&quot; rel=&quot;noopener noreferrer&quot; target=&quot;_blank&quot;&gt;Plan-and-Act paper&lt;/a&gt; shows separated planner/executor architectures beat reactive ReAct-style loops on long-horizon tasks. Anthropic’s &lt;a href=&quot;https://www.anthropic.com/engineering/multi-agent-research-system&quot; rel=&quot;noopener noreferrer&quot; target=&quot;_blank&quot;&gt;multi-agent research post-mortem&lt;/a&gt; is unusually honest: “prompt engineering was the primary lever for improving behaviors.” Their early failures were instruction failures, including spawning 50 subagents for a trivial query because nothing told the orchestrator not to. Huntley’s &lt;code&gt;PROMPT.md&lt;/code&gt; is itself a form of blueprint discipline with its reliance on strong spec and implementation plan files.&lt;/p&gt;
&lt;p&gt;Every agent I’m building usually has a slim CLAUDE.md/AGENTS.md file (just references to orchestration files, a map of the repo, and basic info on how everything works), a detailed orchestration.md file, and highly detailed specs for each step of any pipeline outlined in the orchestration file.&lt;/p&gt;
&lt;p&gt;&lt;img alt=&quot;A humanoid robot sweeps scattered scrolls and tangled cables into a bin while a single tidy folder rests on a pedestal nearby.&quot; loading=&quot;lazy&quot; decoding=&quot;async&quot;  width=&quot;1536&quot; height=&quot;1024&quot; src=&quot;https://jeffreyleecooper.com/_astro/fig-2.Cpwertt__2pBpL2.webp&quot; &gt;&lt;/p&gt;
&lt;h2 id=&quot;pillar-2-context-hygiene&quot;&gt;Pillar 2: Context hygiene&lt;/h2&gt;
&lt;p&gt;Context windows have historically been the limit of a lot of the practical use of LLMs. In 2023, we were having to break up tasks to work around these limitations. Now the windows are bigger - but that can enable us to stuff a shitload of unnecessary information into them.&lt;/p&gt;
&lt;p&gt;A 200K-token window full of stale tool output is mostly noise pulling the model toward the wrong answer. Anthropic’s &lt;a href=&quot;https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents&quot; rel=&quot;noopener noreferrer&quot; target=&quot;_blank&quot;&gt;context-engineering post&lt;/a&gt; (September 2025) puts it cleanly: “the challenge of maintaining coherence across extended interactions will remain central.” (The “lost in the middle” finding from &lt;a href=&quot;https://arxiv.org/abs/2307.03172&quot; rel=&quot;noopener noreferrer&quot; target=&quot;_blank&quot;&gt;Liu et al., 2023&lt;/a&gt; is where the empirical thread starts.)&lt;/p&gt;
&lt;p&gt;The easiest move to fight the bloat is utilizing subagents. Each one gets its own system prompt, window, and tool list… verbose output stays in the child and only a summary returns. That’s what “&lt;a href=&quot;https://claude.com/blog/subagents-in-claude-code&quot; rel=&quot;noopener noreferrer&quot; target=&quot;_blank&quot;&gt;subagents in Claude Code&lt;/a&gt;” is actually doing: using the process boundary as a context boundary. Once you think that way, segmenting instructions across files (so the writing agent never has to worry about the editing instructions) becomes the default.&lt;/p&gt;
&lt;p&gt;&lt;img alt=&quot;A humanoid robot at a workbench holds up a gear to a tall mechanical judge on a stool, who stamps a verdict onto a ticket over a looping conveyor.&quot; loading=&quot;lazy&quot; decoding=&quot;async&quot;  width=&quot;1536&quot; height=&quot;1024&quot; src=&quot;https://jeffreyleecooper.com/_astro/fig-3.CsLQLSrR_Z9kJ3U.webp&quot; &gt;&lt;/p&gt;
&lt;h2 id=&quot;pillar-3-back-pressure&quot;&gt;Pillar 3: Back-pressure&lt;/h2&gt;
&lt;p&gt;This is the one most people are doing manually. You are sitting at your computer, reviewing some output, and thinking “this is totally wrong” …then re-instructing the LLM to fix its errors or go back and try again. &lt;em&gt;(repeat 27 times, and tadah!)&lt;/em&gt;&lt;/p&gt;
&lt;p&gt;Back-pressure is a runnable check the agent can call against its own output and the wiring that &lt;em&gt;forces&lt;/em&gt; it to do that. For code this could be tests, types, linters, builds. For non-code, a graded rubric or an LLM-as-judge with veto rights and the ability to send work back for more revisions. It’s the move that bridges the gap between “looks done but kinda sucks” to “actually done.”&lt;/p&gt;
&lt;p&gt;Spotify’s &lt;a href=&quot;https://engineering.atspotify.com/2025/12/feedback-loops-background-coding-agents-part-3&quot; rel=&quot;noopener noreferrer&quot; target=&quot;_blank&quot;&gt;Honk Part 3 post&lt;/a&gt; (December 2025) is the strongest production evidence I’ve seen. Independent verifiers veto roughly 25% of agent sessions, and the agents self-correct about half the time after. The line that stuck with me: “the agent doesn’t know what the verification does and how, it just knows that it can (and in certain cases must) call it.” Hamel Husain’s &lt;a href=&quot;https://hamel.dev/blog/posts/evals-faq/&quot; rel=&quot;noopener noreferrer&quot; target=&quot;_blank&quot;&gt;&lt;em&gt;LLM Evals FAQ&lt;/em&gt;&lt;/a&gt; (January 2026) makes the discipline argument plainly: evals are “part of the development process, similar to how debugging is part of software development.”&lt;/p&gt;
&lt;p&gt;Anthropic’s November 2025 &lt;a href=&quot;https://www.anthropic.com/engineering/effective-harnesses-for-long-running-agents&quot; rel=&quot;noopener noreferrer&quot; target=&quot;_blank&quot;&gt;“effective harnesses” post&lt;/a&gt; is interesting precisely because the same lab that pushed context engineering hardest is now saying “compaction isn’t sufficient” past a certain horizon. Their answer (initializer plus coding agent plus immutable tests plus a progress file) is essentially back-pressure. The tests are immutable so the agent can’t quietly delete the very mechanism that tells it its work sucks. It shows the importance of making sure these back-pressure mechanisms are available to the agent, but safe from tinkering. Treat them like a college kid that just might cheat if the adderall has worn off at 3a.&lt;/p&gt;
&lt;h2 id=&quot;but-could-this-all-just-be-a-runtime-problem&quot;&gt;But could this all just be a runtime problem?&lt;/h2&gt;
&lt;p&gt;The strongest counter comes from LangChain. In &lt;a href=&quot;https://www.langchain.com/blog/building-langgraph&quot; rel=&quot;noopener noreferrer&quot; target=&quot;_blank&quot;&gt;“Building LangGraph from first principles”&lt;/a&gt; (September 2025) they argue agent reliability is fundamentally a &lt;em&gt;runtime&lt;/em&gt; problem, not an operator-discipline one. Agents are flaky and non-deterministic in ways ordinary code isn’t, so you need durable state with checkpointing, task queues for retries, first-class human-in-the-loop interruption, tracing. Get those right and the instruction and verification questions become tractable.&lt;/p&gt;
&lt;p&gt;They’re right about a real thing. Durable state genuinely solves a class of problems prompt discipline can’t touch: crashes, restarts, review windows measured in days. Great runtime is certainly necessary - just not sufficient. With perfect durability but zero blueprint, context, or back-pressure discipline - the agent can still drift, lose the thread, or confidently ship broken work. It can just do all that across multiple sessions and interruptions.&lt;/p&gt;
&lt;h2 id=&quot;dont-get-frustrated-with-failure-use-it-to-tune-the-pillars&quot;&gt;Don’t get frustrated with failure, use it to tune the pillars.&lt;/h2&gt;
&lt;p&gt;When a long-horizon agent you are working on starts going haywire - I always suggest interrogating these three pillars and adjusting them based on the flavor of “this didn’t work” that you encounter:
Did I write the blueprint, detailed instructions, sub-prompts well?
Did I keep the context clean and void of bloat or irrelevant instructions for each subagent?
Did I give it the tools to actually test and validate its outputs or start over at different checkpoints if certain conditions aren’t met?&lt;/p&gt;
&lt;p&gt;If you build really great rails, the train will run much better.&lt;/p&gt;</content:encoded></item></channel></rss>