We Should Take Text Optimization More Seriously

There is a common negative sentiment I observe among ML researchers toward prompting, or more broadly, text optimization. The underlying view seems to be something like “real learning happens in the weights.” By text optimization, I broadly mean methods that modify the mutable text layer around a model: prompts, context, filesystem state, memory, retrieval databases, and model harnesses.¹ I think this layer should be taken more seriously by the broader research community. I’ll argue for text optimization on three counts:

Text optimization is a legitimate update mechanism. It holds the same functional role as gradient-based weight optimization: changing future behavior in response to new information.
Text optimization is much more sample-efficient than weight optimization, particularly in the low-data regime. Relatively short, high-likelihood text has low description length, giving text optimization a favorable inductive bias.
Text optimization enables a new scaling axis: update-time compute. Reflective text optimization lets a system spend more compute learning from a single experience, the way inference-time scaling lets a model spend more on a single input.

Learning Outside the Weights

Deployed AI systems are no longer just a parameter vector queried in isolation; they are complex, stateful machines with many moving parts, the weights being just one of them. Once this whole system is the object of study, learning can mean changing any behavior-conditioning state. Weights are one state, typically updated through gradient-based optimization. Prompts, memories, retrieval indices, and harness code are others, with different costs, capacities, and failure modes. The important question is which update target is the most appropriate for a given piece of information.

Text artifacts have a useful inductive bias. The usual Kolmogorov-style compression intuition applies: short specifications that explain many cases are more likely to capture real structure than long lists of exceptions. In this sense, good text updates are compact patches to a pretrained world prior. Empirically, text optimization is orders of magnitude more sample-efficient in the low-data regime (1, 2, 3). Because of this, a recurring pattern at scale is to use the text layer to elicit and compose existing capabilities in the model, and then distill this into the weights over time (Anthropic, OpenAI, Cursor, Letta, Hippocratic AI, Harvey).

Update-Time Compute: A New Scaling Axis

The text layer enables reflective learning (Reflexion, Trace, GEPA, Meta-Harness): an optimization loop grounded in text can externalize its own hypotheses about how it should change. This makes hypothesis testing scalably useful at update time: systems can propose multiple ideas in text and test them against new evidence before accepting or rejecting them, the way a scientist might propose and test multiple theories before settling on one. See e.g. Appendix A.2 of Meta-Harness for a real example of such hypothesis-testing behavior. SGD can’t cheaply do this; its single running parameter vector commits each update, with no easy way to fork and compare.

I think the core promise of text optimization is that we can scale “update-time compute”: just as inference-time scaling lets a model spend more compute to solve a single instance, reflective text optimization lets a system spend more compute learning from a single experience. A failed trajectory can be reread, diagnosed, abstracted, tested against candidate revisions, and then converted into a proposed update. Text-space learning is therefore especially useful when (1) failures are expensive, (2) the desired behavior is hard to specify, or (3) there is abundant offline trace data that does not work well otherwise (SFT or offline RL).

The Strongest Case for Weights, and My Counterpoints

There are some compelling arguments for keeping learning in the weights. For each, I will state my strongest interpretation of the argument, and then respond in rebuttal style.

Weights give amortization. Once a behavior is trained into the model, the system no longer has to carry the full specification of that behavior in every context window. The context window, in contrast, is a finite resource.

I think this is a strong argument for many types of information to ultimately belong in weights. I agree; for example, LLMs should not need a long prompt to explain basic arithmetic for every request. Even here, though, many pieces of useful information are not stable or general enough to be worth the cost of amortization, as with search agents that gather dynamic internet context or personalized agents that depend on changing user history, preferences, and private state. I think the right framing is as a routing problem: weights are where stable, repeatedly useful information belongs, while text is where information stays while it is volatile, local, auditable, or not yet trusted enough to amortize.

Additionally, good text-layer systems do not dump all available information into the context window. They implement progressive disclosure of information, where the system retrieves and conditions on relevant information as needed (RAG, MemGPT, RLM, Anthropic dynamic workflows, Meta-Harness). With the right organization, it’s fairly straightforward to implicitly condition on a much larger context than the model’s input window. When you know what to include, you can pack a surprising amount of information into context. 1% of a 1M-token context is 10K tokens, which is more than three copies of this post; hopefully enough that reading it meaningfully shifts a reader’s mental model.

Even if some information is worth amortizing into weights, it need not be amortized immediately. I’ve come to view the text layer as a kind of flexible “staging ground” for information that may eventually be distilled into weights. This layer makes it very easy to test and refine behavioral hypotheses before committing them to the model. The mechanics of how to evolve the text layer and use it to improve the weights over time is an interesting open research question, whether through direct distillation, (1, 2, 3, 4), synthetic data generation (5, 6, 7, 8), modifying the training loop itself (9, 10), or fast-slow learning frameworks (11).

Training the weights creates new neural circuits. Text optimization only ever elicits existing behavior from a fixed set of weights, and given those weights, there is a ceiling on what the text layer can reach.

Agreed that a weak model gives text optimization very little to work with. However, such a ceiling is not unique to text optimization, and this argument has even been made against RL.² Text optimization does not need to create completely new latent capabilities to be useful. Many deployed systems are bottlenecked not by whether the model could in principle perform a certain behavior primitive, but by whether the system can elicit and compose that behavior reliably (mgh). The practical question is therefore how much useful headroom remains between the model’s latent capabilities and the behavior the deployed system actually exhibits.

Empirically, the headroom for improving the text layer is significant. It shows up across retrieval-augmented QA, test-time scaling, and tool-use agents: fixed-model behavior improves when we change the context or execution environment rather than the weights (1, 2, 3, 4, 5). Scale also appears to increase the value of text conditioning: larger models become better at using information supplied at inference time, and some context-conditioned abilities appear only at larger scale (1, 2, 3).

The “existence argument”: the human brain is clearly intelligent. It must be possible to learn by changing weights alone.

I’d actually make a similar existence argument for text optimization. Look at the collection of all written text (books, papers, code, webpages, etc.): good external representations greatly amplify human intelligence. How much would the quality of our work suffer if we were suddenly cut off from all external text?

Anyone can change a text artifact and get a seemingly better-looking output. Text optimization is unusually vulnerable to benchmark leakage and folk theories about model psychology.

First, text optimization has been poorly marketed by its early successes. The most visible examples were amusing model quirks like “let’s think step by step”, “take a deep breath”, “this is very important to my career”, personas, and threats and tipping. It’s perhaps tempting to conclude that text optimization itself will disappear as newer models become more robust to such tricks. But this confuses a weak early framing of the field with the underlying research problem.

It’s very easy to tinker on the text layer: anyone can edit an instruction and declare victory based on cherry-picked outputs.³ This low barrier to entry makes bad science here common. If anything, I view such immature methodological norms as a strong argument for studying text optimization more rigorously, especially given its practical importance.

Gradient descent is a real optimizer. You can lean on the large literature on optimization, generalization, and convergence to understand how it works. Text optimization is heuristic hill-climbing.

Convergence theory only guarantees that you will minimize the proxy loss, not that the proxy matches what you actually care about. A stronger optimizer just exploits this gap; the field has largely moved on from theoretical analysis of generalization dynamics to empirical scaling laws and best practices. RL post-training in particular is notoriously finicky and prone to this kind of overfitting (1, 2, 3, 4). In contrast, text-layer edits apply weaker optimization pressure while remaining highly auditable, and in many cases also composable.

Neural networks are universal function approximators and can represent anything.

Representational capacity is not the right thing to look at; even a two-layer MLP can in principle represent any function, but that doesn’t mean it can learn to do so efficiently or reliably. We should be looking at reachable behavior, i.e., what behaviors are sufficiently high-likelihood under the implicit prior. Harnesses can demonstrably execute behaviors that we wouldn’t expect frozen models to via a single forward pass.

Text artifacts are not portable. They are overfit to one model’s quirks and often break on the next checkpoint.

The relevant comparison is with other update artifacts. A text artifact written for one model may fail on another, but a weight delta trained for one architecture is usually not portable at all. Text artifacts are slightly more portable since text still carries meaning across models.

Perhaps the Pendulum Has Swung Too Far

The “weights are the real learning” view is partly a reaction to early AI, when researchers were focused on building systems that could learn by changing their internal parameters. For decades, the dominant picture treated intelligence as explicit symbol manipulation. Newell and Simon’s physical symbol system hypothesis and Haugeland’s GOFAI are canonical examples of this mindset. Neural networks showed that this was too narrow: useful information can clearly live in weights; modern LLMs are the strongest evidence for that claim.

We seem to have overcorrected towards viewing weights as the only serious home for knowledge. This is strange when zoomed out because human cognition routinely depends on external artifacts. In Cognition in the Wild, Edwin Hutchins analyzes ship navigation as a cognitive system made of people, instruments, procedures, and external representations. Clark and Chalmers make a related point in The Extended Mind: the boundary of a cognitive system can extend beyond the internal state of a single component. The computer-science version of this lineage runs at least back to Vannevar Bush’s Memex: an external memory organized around associative trails through a personal archive. Modern tools-for-thought systems like Notion and Obsidian are concrete attempts to make external memory part of everyday knowledge work.

Scientific practice is a useful comparison. One of the core goals of science is to construct compact representations of the world, which is aided by private intuitions inside scientists’ heads but not reducible to them. The usual products are crystallized: an abstraction, a theorem, or a causal model, which can be written down and shared. Their value comes in large part from externalization: they can be criticized, compared against new evidence, revised, and applied to new cases. Text artifacts occupy a similar functional role in model systems: they are external representations that encode behavior-relevant abstractions. Updating them is “learning” in the same sense that revising a scientific theory in light of new evidence is learning.

A Call for Good Research on the Text Layer

I think text optimization deserves the same kind of community we built around weight optimization, and I wish there were more high-quality research here. Several directions seem ripe for foundational work in the very near future:

Theoretical analysis of the text layer. Generally, text space gives a much better prior than weight space, and cleanly formalizing this observation could be very useful for guiding practice. This old-ish paper is a promising start applying PAC-Bayes to prompts in 2023-level models, which seems very much worth revisiting with the latest generation of models and text artifacts.
Better evals. CL-bench is an initial attempt at a proper eval for context learning, and agentic benchmarks like TerminalBench-2 have partly become a battleground for harnesses. Still, we need more benchmarks that isolate useful properties of the text layer, controlling for weight capability while flagging the weird new classes of overfitting and cheating that the text layer enables.
“Architecture research”, i.e., understanding the design space. There are so many proposed designs for the text layer, from the instruction hierarchy, DSPy programs, agent skills, OpenClaw-style agents, and the massive number of memory system designs. There is a sense in which these are all points on one huge design space, but we don’t have a good way to talk about that space, let alone compare different points in it.
HCI research on how to elicit input from humans to optimize the text layer, and how to present the system’s internal state back to users for inspection and revision. I think figuring out the right ways to interact with the text layer can make it economically viable to routinely have top domain experts sit down for “verbal fine-tuning” sessions with AI systems. I don’t know of a good example of work in this direction, but this paper of mine had essentially this motivation, though it worked in a very limited domain.
Seriously scaling up text optimization, including establishing scaling laws. The compute budgets currently allocated to text optimization are orders of magnitude smaller than weight post-training scale. For example, a scaled-up artifact might look like a Wikipedia-scale knowledge/harness layer, optimized from the ground up against measurable model-system performance⁴.

Thanks to Omar Khattab, Allen Nie, Chelsea Finn, Alex Zhang, Ahmad Beirami, and Qizheng Zhang for excellent feedback on an earlier draft. This post is a distillation of many conversations with researchers over the past year or so, which I won’t attempt to list here in full.

Footnotes

I use “text” because language is the clearest and most common case, but the argument should apply more broadly to external artifacts that can condition a model’s future behavior, including images, audio, video, and other tokenized state. ↩
This is contested. ProRL and The Art of Scaling RL Compute argue that with the right training recipe, RL can expand reasoning capacity beyond the base model. Personally, I think the truth is somewhere in the middle: RL by design should be able to discover new behaviors, but there’s definitely a strong empirical dependence on the quality of the base model. Either way, the details here don’t matter for the argument I’m making in the post. ↩
I do see this happen somewhat often, at least more than in weight-space research. Everyone seems to have a strong opinion on what the best system prompt or skill is. This probably has more to do with social media dynamics: “one weird trick to make your model 10x smarter” is much more actionable than any weight-space intervention. ↩
I’ve started to think this may be a good startup. ↩