Yoonho Lee

We Should Take Text Optimization More Seriously

2026-06-08T00:00:00+00:00

There is a common negative sentiment I observe among ML researchers toward prompting, or more broadly, text optimization. The underlying view seems to be something like “real learning happens in the weights.” By text optimization, I broadly mean methods that modify the mutable text layer around a model: prompts, context, filesystem state, memory, retrieval databases, and model harnesses.¹ I think this layer should be taken more seriously by the broader research community. I’ll argue for text optimization on three counts:

Text optimization is a legitimate update mechanism. It holds the same functional role as gradient-based weight optimization: changing future behavior in response to new information.
Text optimization is much more sample-efficient than weight optimization, particularly in the low-data regime. Relatively short, high-likelihood text has low description length, giving text optimization a favorable inductive bias.
Text optimization enables a new scaling axis: update-time compute. Reflective text optimization lets a system spend more compute learning from a single experience, the way inference-time scaling lets a model spend more on a single input.

Learning Outside the Weights

Deployed AI systems are no longer just a parameter vector queried in isolation; they are complex, stateful machines with many moving parts, the weights being just one of them. Once this whole system is the object of study, learning can mean changing any behavior-conditioning state. Weights are one state, typically updated through gradient-based optimization. Prompts, memories, retrieval indices, and harness code are others, with different costs, capacities, and failure modes. The important question is which update target is the most appropriate for a given piece of information.

Text artifacts have a useful inductive bias. The usual Kolmogorov-style compression intuition applies: short specifications that explain many cases are more likely to capture real structure than long lists of exceptions. In this sense, good text updates are compact patches to a pretrained world prior. Empirically, text optimization is orders of magnitude more sample-efficient in the low-data regime (1, 2, 3). Because of this, a recurring pattern at scale is to use the text layer to elicit and compose existing capabilities in the model, and then distill this into the weights over time (Anthropic, OpenAI, Cursor, Letta, Hippocratic AI, Harvey).

Update-Time Compute: A New Scaling Axis

The text layer enables reflective learning (Reflexion, Trace, GEPA, Meta-Harness): an optimization loop grounded in text can externalize its own hypotheses about how it should change. This makes hypothesis testing scalably useful at update time: systems can propose multiple ideas in text and test them against new evidence before accepting or rejecting them, the way a scientist might propose and test multiple theories before settling on one. See e.g. Appendix A.2 of Meta-Harness for a real example of such hypothesis-testing behavior. SGD can’t cheaply do this; its single running parameter vector commits each update, with no easy way to fork and compare.

I think the core promise of text optimization is that we can scale “update-time compute”: just as inference-time scaling lets a model spend more compute to solve a single instance, reflective text optimization lets a system spend more compute learning from a single experience. A failed trajectory can be reread, diagnosed, abstracted, tested against candidate revisions, and then converted into a proposed update. Text-space learning is therefore especially useful when (1) failures are expensive, (2) the desired behavior is hard to specify, or (3) there is abundant offline trace data that does not work well otherwise (SFT or offline RL).

The Strongest Case for Weights, and My Counterpoints

There are some compelling arguments for keeping learning in the weights. For each, I will state my strongest interpretation of the argument, and then respond in rebuttal style.

Weights give amortization. Once a behavior is trained into the model, the system no longer has to carry the full specification of that behavior in every context window. The context window, in contrast, is a finite resource.

I think this is a strong argument for many types of information to ultimately belong in weights. I agree; for example, LLMs should not need a long prompt to explain basic arithmetic for every request. Even here, though, many pieces of useful information are not stable or general enough to be worth the cost of amortization, as with search agents that gather dynamic internet context or personalized agents that depend on changing user history, preferences, and private state. I think the right framing is as a routing problem: weights are where stable, repeatedly useful information belongs, while text is where information stays while it is volatile, local, auditable, or not yet trusted enough to amortize.

Additionally, good text-layer systems do not dump all available information into the context window. They implement progressive disclosure of information, where the system retrieves and conditions on relevant information as needed (RAG, MemGPT, RLM, Anthropic dynamic workflows, Meta-Harness). With the right organization, it’s fairly straightforward to implicitly condition on a much larger context than the model’s input window. When you know what to include, you can pack a surprising amount of information into context. 1% of a 1M-token context is 10K tokens, which is more than three copies of this post; hopefully enough that reading it meaningfully shifts a reader’s mental model.

Even if some information is worth amortizing into weights, it need not be amortized immediately. I’ve come to view the text layer as a kind of flexible “staging ground” for information that may eventually be distilled into weights. This layer makes it very easy to test and refine behavioral hypotheses before committing them to the model. The mechanics of how to evolve the text layer and use it to improve the weights over time is an interesting open research question, whether through direct distillation, (1, 2, 3, 4), synthetic data generation (5, 6, 7, 8), modifying the training loop itself (9, 10), or fast-slow learning frameworks (11).

Training the weights creates new neural circuits. Text optimization only ever elicits existing behavior from a fixed set of weights, and given those weights, there is a ceiling on what the text layer can reach.

Agreed that a weak model gives text optimization very little to work with. However, such a ceiling is not unique to text optimization, and this argument has even been made against RL.² Text optimization does not need to create completely new latent capabilities to be useful. Many deployed systems are bottlenecked not by whether the model could in principle perform a certain behavior primitive, but by whether the system can elicit and compose that behavior reliably (mgh). The practical question is therefore how much useful headroom remains between the model’s latent capabilities and the behavior the deployed system actually exhibits.

Empirically, the headroom for improving the text layer is significant. It shows up across retrieval-augmented QA, test-time scaling, and tool-use agents: fixed-model behavior improves when we change the context or execution environment rather than the weights (1, 2, 3, 4, 5). Scale also appears to increase the value of text conditioning: larger models become better at using information supplied at inference time, and some context-conditioned abilities appear only at larger scale (1, 2, 3).

The “existence argument”: the human brain is clearly intelligent. It must be possible to learn by changing weights alone.

I’d actually make a similar existence argument for text optimization. Look at the collection of all written text (books, papers, code, webpages, etc.): good external representations greatly amplify human intelligence. How much would the quality of our work suffer if we were suddenly cut off from all external text?

Anyone can change a text artifact and get a seemingly better-looking output. Text optimization is unusually vulnerable to benchmark leakage and folk theories about model psychology.

First, text optimization has been poorly marketed by its early successes. The most visible examples were amusing model quirks like “let’s think step by step”, “take a deep breath”, “this is very important to my career”, personas, and threats and tipping. It’s perhaps tempting to conclude that text optimization itself will disappear as newer models become more robust to such tricks. But this confuses a weak early framing of the field with the underlying research problem.

It’s very easy to tinker on the text layer: anyone can edit an instruction and declare victory based on cherry-picked outputs.³ This low barrier to entry makes bad science here common. If anything, I view such immature methodological norms as a strong argument for studying text optimization more rigorously, especially given its practical importance.

Gradient descent is a real optimizer. You can lean on the large literature on optimization, generalization, and convergence to understand how it works. Text optimization is heuristic hill-climbing.

Convergence theory only guarantees that you will minimize the proxy loss, not that the proxy matches what you actually care about. A stronger optimizer just exploits this gap; the field has largely moved on from theoretical analysis of generalization dynamics to empirical scaling laws and best practices. RL post-training in particular is notoriously finicky and prone to this kind of overfitting (1, 2, 3, 4). In contrast, text-layer edits apply weaker optimization pressure while remaining highly auditable, and in many cases also composable.

Neural networks are universal function approximators and can represent anything.

Representational capacity is not the right thing to look at; even a two-layer MLP can in principle represent any function, but that doesn’t mean it can learn to do so efficiently or reliably. We should be looking at reachable behavior, i.e., what behaviors are sufficiently high-likelihood under the implicit prior. Harnesses can demonstrably execute behaviors that we wouldn’t expect frozen models to via a single forward pass.

Text artifacts are not portable. They are overfit to one model’s quirks and often break on the next checkpoint.

The relevant comparison is with other update artifacts. A text artifact written for one model may fail on another, but a weight delta trained for one architecture is usually not portable at all. Text artifacts are slightly more portable since text still carries meaning across models.

Perhaps the Pendulum Has Swung Too Far

The “weights are the real learning” view is partly a reaction to early AI, when researchers were focused on building systems that could learn by changing their internal parameters. For decades, the dominant picture treated intelligence as explicit symbol manipulation. Newell and Simon’s physical symbol system hypothesis and Haugeland’s GOFAI are canonical examples of this mindset. Neural networks showed that this was too narrow: useful information can clearly live in weights; modern LLMs are the strongest evidence for that claim.

We seem to have overcorrected towards viewing weights as the only serious home for knowledge. This is strange when zoomed out because human cognition routinely depends on external artifacts. In Cognition in the Wild, Edwin Hutchins analyzes ship navigation as a cognitive system made of people, instruments, procedures, and external representations. Clark and Chalmers make a related point in The Extended Mind: the boundary of a cognitive system can extend beyond the internal state of a single component. The computer-science version of this lineage runs at least back to Vannevar Bush’s Memex: an external memory organized around associative trails through a personal archive. Modern tools-for-thought systems like Notion and Obsidian are concrete attempts to make external memory part of everyday knowledge work.

Scientific practice is a useful comparison. One of the core goals of science is to construct compact representations of the world, which is aided by private intuitions inside scientists’ heads but not reducible to them. The usual products are crystallized: an abstraction, a theorem, or a causal model, which can be written down and shared. Their value comes in large part from externalization: they can be criticized, compared against new evidence, revised, and applied to new cases. Text artifacts occupy a similar functional role in model systems: they are external representations that encode behavior-relevant abstractions. Updating them is “learning” in the same sense that revising a scientific theory in light of new evidence is learning.

A Call for Good Research on the Text Layer

I think text optimization deserves the same kind of community we built around weight optimization, and I wish there were more high-quality research here. Several directions seem ripe for foundational work in the very near future:

Theoretical analysis of the text layer. Generally, text space gives a much better prior than weight space, and cleanly formalizing this observation could be very useful for guiding practice. This old-ish paper is a promising start applying PAC-Bayes to prompts in 2023-level models, which seems very much worth revisiting with the latest generation of models and text artifacts.
Better evals. CL-bench is an initial attempt at a proper eval for context learning, and agentic benchmarks like TerminalBench-2 have partly become a battleground for harnesses. Still, we need more benchmarks that isolate useful properties of the text layer, controlling for weight capability while flagging the weird new classes of overfitting and cheating that the text layer enables.
“Architecture research”, i.e., understanding the design space. There are so many proposed designs for the text layer, from the instruction hierarchy, DSPy programs, agent skills, OpenClaw-style agents, and the massive number of memory system designs. There is a sense in which these are all points on one huge design space, but we don’t have a good way to talk about that space, let alone compare different points in it.
HCI research on how to elicit input from humans to optimize the text layer, and how to present the system’s internal state back to users for inspection and revision. I think figuring out the right ways to interact with the text layer can make it economically viable to routinely have top domain experts sit down for “verbal fine-tuning” sessions with AI systems. I don’t know of a good example of work in this direction, but this paper of mine had essentially this motivation, though it worked in a very limited domain.
Seriously scaling up text optimization, including establishing scaling laws. The compute budgets currently allocated to text optimization are orders of magnitude smaller than weight post-training scale. For example, a scaled-up artifact might look like a Wikipedia-scale knowledge/harness layer, optimized from the ground up against measurable model-system performance⁴.

Thanks to Omar Khattab, Allen Nie, Chelsea Finn, Alex Zhang, Ahmad Beirami, and Qizheng Zhang for excellent feedback on an earlier draft. This post is a distillation of many conversations with researchers over the past year or so, which I won’t attempt to list here in full.

Footnotes

I use “text” because language is the clearest and most common case, but the argument should apply more broadly to external artifacts that can condition a model’s future behavior, including images, audio, video, and other tokenized state. ↩
This is contested. ProRL and The Art of Scaling RL Compute argue that with the right training recipe, RL can expand reasoning capacity beyond the base model. Personally, I think the truth is somewhere in the middle: RL by design should be able to discover new behaviors, but there’s definitely a strong empirical dependence on the quality of the base model. Either way, the details here don’t matter for the argument I’m making in the post. ↩
I do see this happen somewhat often, at least more than in weight-space research. Everyone seems to have a strong opinion on what the best system prompt or skill is. This probably has more to do with social media dynamics: “one weird trick to make your model 10x smarter” is much more actionable than any weight-space intervention. ↩
I’ve started to think this may be a good startup. ↩

Meta-Harness: End-to-End Optimization of Model Harnesses

2026-03-29T00:00:00+00:00

Project page for Meta-Harness. Interactive demo, results on text classification, math reasoning, and agentic coding.

What Is Taste?

2026-03-22T00:00:00+00:00

What Is Taste?

A recent paper trains a model to predict which of two papers will receive more citations, and calls this learning scientific taste. The results look solid, but it left me unsatisfied. Citations are a byproduct of taste, sort of like training a film critic by predicting box office revenue. It measures something real but misses the thing itself. This got me thinking: what is taste, exactly?

What Taste Is (and Isn’t)

It seems that taste is one of those things that people feel viscerally but resist precise definition. Here I’ve collected some of the attempts at defining it, each of which I think captures a different aspect of the same thing.

Ira Glass describes the taste gap: you get into creative work because you have good taste, but early on, your ability can’t match it. You can tell your work falls short, and that gap is painful enough that many people quit. The ones who don’t quit close the gap through sheer volume of work. In ML language, the verifier (taste) runs ahead of your generator (execution).

Chris Olah distinguishes research intimacy from research taste. Intimacy is internalizing raw, undigested knowledge about your domain: e.g., memorizing hundreds of neurons in InceptionV1 and knowing how they behave. Taste is different, but intimacy feeds it. Olah suspects that many “brilliant insights” are natural next steps for someone deeply intimate with a topic, and that deep intimacy is “one of the key ingredients in beating the research taste market.”

Michael Nielsen identifies two researcher archetypes: the problem-solver and the problem-creator. Problem-solvers attack well-posed challenges, and problem-creators ask new questions or find simple connections no one noticed. Arguably, the problem-creator’s core skill is taste: knowing which questions to ask, which areas will thrive, which promising ideas won’t pan out. Richard Hamming makes the same point more bluntly: “What are the important problems of your field? Why aren’t you working on them?” The ability to answer the first question is taste. Hamming also noticed that people who worked with their office doors open, despite constant interruptions, ended up working on more important problems than those who kept their doors closed. The closed-door people were more productive day to day, but “somehow they seem to work on slightly the wrong thing.” Taste, it seems, is sharpened by ambient exposure to peers.

Harriet Zuckerman interviewed nearly every American Nobel laureate of the 20th century for the book Scientific Elite. She found that the primary benefit of apprenticeship under great scientists was adopting their research style and standards rather than access to resources. Many laureates identified the simplicity of solutions as a mark of taste.

Michael Polanyi discusses something related to taste in Personal Knowledge. His central concept is tacit knowledge: we always know more than we can tell. For example, a cyclist can’t articulate the physics of balance. Polanyi argues that all explicit knowledge rests on a tacit substrate, and that scientific discovery depends on trained intuition and aesthetic judgment.

My short synthesis of these descriptions is that taste is an emergent felt sense, acquired bottom-up through practice and proximity. It operates as a fast, pre-verbal filter on an enormous space of possibilities. You recognize it in others but can’t directly transfer it, aside from osmosis over consistent interactions.

Why Defining Taste Matters Now

I believe that taste is the core skill of research. It’s what tells you which question to ask, which results are surprising, which directions are worth your time. Precisely because it’s so difficult to define, it’s a skill that current models are far from having.

Can models eventually develop taste? I think so. The “human existence proof” shows that some configuration of neurons “have taste” in the sense that they can reliably produce good research, and I see no first-principles reason to believe artificial networks can’t. But if the best human researchers can’t articulate their own taste, it’s not obvious what loss or reward would elicit it from a model. The research community will likely push beyond citation counts toward higher-bandwidth proxies: test-of-time awards, reviewer discourse, survey coverage, and the broader written conversation around papers. There is dense signal here. These are all lagging indicators, but the lag gets easier to bridge if you know what they’re proxying for.

Thanks to Kangwook Lee for a conversation that motivated this post and feedback on an early draft.

Are We Managers Now?

2026-03-12T00:00:00+00:00

Are We Managers Now?

Paul Graham writes in “Maker’s Schedule, Manager’s Schedule”:

There are two types of schedule, which I’ll call the manager’s schedule and the maker’s schedule. The manager’s schedule is for bosses. It’s embodied in the traditional appointment book, with each day cut into one hour intervals. You can block off several hours for a single task if you need to, but by default you change what you’re doing every hour. […]

For someone on the maker’s schedule, having a meeting is like throwing an exception. It doesn’t merely cause you to switch from one task to another; it changes the mode in which you work.

He makes the case that the maker’s schedule and the manager’s schedule are fundamentally incompatible, and I’ve always worked on the maker’s schedule. But I noticed that heavy use of Claude Code has shifted my role. Instead of doing the work myself, I mostly specify tasks and evaluate outputs. That is managerial work, and I’m not very good at it yet.

Learning to Manage

AI agents execute tasks quickly, which creates effectively infinite demand for review. The pull to check outputs immediately is strong, but humans are not built to process a constant stream of results. I think makers will benefit from learning and adopting some structure for sustainable (human) management. I’m experimenting with a few ways of working with AI agents.

Viewing my job as defining the goal and plan, and treating my evaluation as a scarce resource to allocate.
The “meeting/briefing” model: one dense sync session followed by long autonomous AI execution. This might look like spending an hour writing a detailed plan, pre-empting any blockers, and dispatching the agent. Reviewing only after everything finishes.
Batching. Grouping agent tasks by the type of thinking they require from me.
Boundaries. Just because an agent finishes does not mean I have to look immediately.

I think this will be made even better by improvements in test-time optimization loops, which allow LLMs to autonomously perform meaningful work over long periods of time. In these systems, the human defines an objective and the model iterates against it, replacing many small supervision decisions with a small number of objective definitions. This role seems far more sustainable than managing a stream of agent outputs in real time.

Two Modes of AI Work

That said, not all AI work feels managerial. AI seems to create two different modes of work. I play a manager role when delegating tasks to agents, but when interacting with a model in real time (iterating, testing ideas, building something together) the maker loop is still there, and in fact it’s become even better in my experience. That mode, where each response pushes the work forward in real time, has produced some of the strongest creative flow states I’ve experienced.

Following the Text Gradient at Scale

2025-12-01T00:00:00+00:00

Cross-posted from SAIL blog.

RL Throws Away Almost Everything Evaluators Have to Say

When you get feedback on your work, it usually tells you what went wrong and how to fix it. But existing reinforcement learning (RL) algorithms throw most of that information away; it compresses potentially rich feedback into a single number, a reward¹, then tries to learn by correlating rewards with actions across hundreds or thousands of attempts. We do this because our algorithms were designed for scalar supervision, not because of a fundamental constraint in learning from experience².

To illustrate this, let’s consider a simple example. Suppose you’re judging cakes. You take a bite, you like the shavings on top, the ganache is perfectly tempered, but you want way more cherries throughout. Yet you only record: “4/5.” The baker learns nothing about the cherry distribution or what else worked well, only that this cake scored higher than a 3. If this is the only information you provide, the baker will likely have to do a lot more baking to figure out what you actually want. ³

More generally, an expert given a candidate solution can articulate specific failure modes, causal mechanisms, and concrete fixes. Full verbal feedback for the cake above would contain far more actionable information than the numerical score “4/5”. The baker can confidently keep the parts of the cake that worked well, rather than blindly exploring recipe variations. This is the core insight: rich feedback enables targeted improvements rather than random exploration, resulting in fewer trials to achieve a better outcome.

This striking mismatch between the information available and the information used by RL has been aptly described as sucking supervision through a straw: you run minutes of rollout and compress it all into a final reward signal broadcast across the entire trajectory. This scalar bottleneck is becoming increasingly costly in the tasks we’re deploying LLMs on: for example, research agents run 5-30 minutes per task. Each run produces rich diagnostic logs—tool calls, intermediate reasoning, error traces—all of which are collapsed into a single scalar that discards the causal signal of where and why things failed. While rich feedback requires a bit more work from the evaluator, they’ve already done the reasoning; we’re just asking them to write it down. When rollouts themselves are expensive, the marginal annotation cost is small relative to the sample-efficiency gains we can achieve with richer feedback.

In this post, we survey an emerging learning paradigm that fully embraces all the feedback an environment has to offer—avoiding the scalar bottleneck of RL—and discuss our recent work called Feedback Descent (check out the paper here), which outperforms specialized RL methods in challenging optimization domains such as molecular design and prompt optimization.

From Scalar Rewards to Text-Based Optimization

A growing body of work hints at an alternative to reward-based learning: directly using rich feedback to guide model improvement. Given a textual artifact (e.g., a prompt, source code, molecule specs), we can often provide natural-language explanations of how to improve it. That explanation is already a form of supervision. Rather than compressing it into a single number, we can feed it back into the system during the update.

In recent work, two broad patterns have emerged around this principle:

Critique-based or “text gradient” methods. The model proposes an artifact and receives a natural-language critique of its errors or omissions. The critique explicitly suggests a direction of improvement: adjust the retrieval query, remove this redundancy, change the control flow, etc. A revised artifact is then produced by editing the original in line with the critique. This pattern appears in systems such as Self-Refine, APO, Trace, and TextGrad.
Evolutionary methods. Instead of iteratively editing a single artifact, these methods maintain a population of artifacts. Language models generate mutations and recombinations conditioned on the current population, and evaluators select the better ones. Iterating this variation-selection loop gradually shifts the population toward higher-performing algorithms or designs, as in EvoPrompt, GEPA, and AlphaEvolve/OpenEvolve, and has driven novel mathematical discoveries.

Both lines demonstrate the same underlying principle: textual feedback can serve as structured supervision, often far more informative than scalar rewards. In the remainder of this post, we build a single, domain-agnostic loop around the two primitives these approaches rely on: an evaluator that produces structured feedback, and an editor that turns accumulated feedback on the current best candidates into concrete revisions. We will demonstrate how this loop can sustain meaningful improvement for up to 1000 iterations, far beyond the stability range of standard self-refinement methods.

Example Domain: Drug Discovery

Let’s make this concrete with a problem where the stakes are high and real-world evaluation is expensive: computational drug discovery. The goal is to find small molecules that bind strongly to a target protein, a critical first step in developing new therapeutics. We can navigate the (huge) space of possible molecules using a standard text representation called SMILES: for example, COCCc1ccc(OCC(O)CNC(C)C)cc1 is metoprolol, one of the most prescribed blockers of ADRB1 (one of our target proteins). Given a target protein, docking simulators can give us a proxy score for binding affinity; if we treat this as a standard RL-style optimization problem, the environment returns only a single scalar reward for each SMILES:

Molecule 1 (O=C(O)C1=CC=CC=C2C=CCCCCCN1C(=O)c1cccc(c1)C2): Reward = 5.037
Molecule 2 (COCCc1ccc(OCC(O)CNC(C)C)cc1): Reward = 4.236

A scalar reward like this hides almost everything about why one molecule is better than the other. But nothing stops us from designing evaluators that expose a much richer structure. For each candidate, the evaluator can report a detailed breakdown of its molecular properties. Below is a small subset of RDKit-computed features for these two molecules⁴:

Property	Molecule 1	Molecule 2	Implication
Core scaffold	macrocycle	benzene	Rigid fused system vs. flexible benzene
Docking score	−6.8	−7.1	Molecule 1 binds weaker
Drug-likeness (QED)	0.824	0.714	Molecule 1 is more drug-like
Basic amines	0	1	No salt bridge with Asp138—explains weak binding
Rotatable bonds	1	9	Rigidity boosts QED but pre-organizes wrong pose
LogP	4.3	1.6	Too lipophilic, solubility risk

A medicinal chemist would be able to look through this table and reason almost immediately:

Molecule 1 has a better overall drug-likeness, but a worse docking score.
Molecule 1’s lack of basic amines explains the weak docking score. Molecule 2 binds better because it forms a salt bridge.
A promising candidate would merge the strengths of both: keep the favorable macrocycle scaffold of Molecule 1, but introduce a basic amine.

The scalar reward reveals none of this structure, but the rich feedback exposes a clear path forward. This is precisely the kind of targeted, interpretable guidance that feedback-driven optimization can exploit (and what traditional scalar-based RL discards).

The Feedback Descent Algorithm

Having seen how rich feedback can reveal an actionable structure beyond what a scalar reward can, we now describe the general framework that turns this idea into a scalable optimization procedure.

Feedback Descent is a domain-agnostic loop built from two components:

Evaluators: rich feedback instead of scalars. An evaluator (an LLM judge, a programmatic tool, or even a human) provides natural-language feedback describing what worked and what didn’t. For different domains, the feedback may include chemical properties and nearest neighbors in a database (molecules), missing structure or aesthetic flaws (SVG images), or reasoning errors and unmet conditions (prompts). The evaluator exposes why an artifact performs the way it does, rather than just whether it did well or poorly.
Editors: revisions guided by accumulated feedback. The editor is an LLM that takes the top candidates and the evaluator’s accumulated feedback and outputs a revised version. This is the descent step: the LLM implicitly incorporates the strongest signals from prior feedback into its following proposal.

Feedback Descent alternates these two steps while maintaining a small frontier of top-performing candidates. For each newly proposed candidate, the evaluator provides feedback. We aggregate all prior feedback and pass it to the editor, and the editor proposes a new candidate:

Over many iterations, the candidate population continually improves as useful feedback accumulates and unproductive directions are discarded. Since both evaluation and editing occur entirely through text, the same loop transfers cleanly across domains, with the only domain-specific component being the evaluator that supplies feedback.

Does Feedback Descent Work?

We applied the same Feedback Descent framework to three fundamentally different domains: molecular design, SVG image optimization, and prompt optimization.

Molecular Design. We compared Feedback Descent with specialized graph-based molecular optimizers that explicitly encode chemical structures, as well as REINVENT, a reinforcement learning method specifically designed for molecular optimization. Feedback Descent, operating purely on text representations (SMILES strings), matched or exceeded these specialized methods. On multiple targets, our text-based approach identified molecules surpassing the 99.9th percentile of DOCKSTRING’s 260,000-compound database. In several cases, we matched or exceeded the best molecule in the entire database. In this domain, Feedback Descent achieved an average 3.8x reduction in docking calls relative to reinforcement learning (REINVENT).

SVG Optimization. Starting from basic SVG drawings, Feedback Descent consistently improved designs through iterative visual critique. After just five iterations, designs reliably outperformed a baseline that conditioned on the judge prompt verbatim, demonstrating a generator-verifier gap where iterative feedback elicits better outputs from the same model.

Prompt Optimization. On four diverse tasks (multi-hop reasoning, instruction following, privacy-aware delegation, and retrieval verification), Feedback Descent achieved competitive performance with GEPA, the state-of-the-art prompt optimization method, while outperforming GRPO, a reinforcement learning baseline.

Together, these results demonstrate that the same evaluator-editor loop can drive continual improvement in domains that differ in representation, evaluation, and failure modes. The only requirement is that we can obtain and express informative feedback in text; no task-specific optimizers, mutation rules, or architectural changes are needed. The informative signal is carried by the textual feedback itself, and the editor LLM uses this feedback to guide its next revisions through in-context learning.

Is Text a Viable Medium for Learning?

In conventional gradient-based learning, progress accumulates in the model’s weights. These parameters absorb broad statistical structures and give models the general competence we rely on. But weight updates are not the only place where learning can happen.

Text-based optimization suggests a complementary substrate: semantic space.

This is especially promising for continual learning, where parameter updates often struggle because the knowledge stored inside the weights is highly entangled, new updates risk catastrophic forgetting and require careful regularization or access to past data. In contrast, textual artifacts persist. They accumulate naturally as the system operates, and grow in a form that LLMs can readily condition on. New feedback can be integrated immediately without retraining the underlying model.

This is early territory. We don’t yet know the full limits of what can be stored or refined in semantic space. But the evidence so far suggests that text-level artifacts can absorb detailed feedback from the environment and unlock forms of improvement that are difficult or inefficient to achieve through weight updates alone. Understanding how to organize and scale this semantic layer, and how to integrate it cleanly with parameter learning, is an exciting direction for future work.

This post is based on our recent work “Feedback Descent: Open-Ended Text Optimization via Pairwise Comparison.”

Thanks to Anikait Singh, Henrik Marklund, Mert Yuksekgonul, Jubayer Ibn Hamid, Allen Nie, Omar Khattab, Sergey Levine, SAIL blog editors (James Burgess, Megha Srivastava), and anonymous ICLR reviewers for their helpful feedback on earlier drafts.

Footnotes

Dense rewards can help with temporal credit assignment, but don’t address the information bottleneck; a scalar at every step still doesn’t tell you what went wrong or how to fix it. Even when dense rewards are available, they’re notoriously hard to design well and prone to reward hacking. In practice, rewards for LLM post-training are usually sparse (outcome-based verification, human preferences), making this limitation especially acute. ↩
Here, we’re primarily talking about policy gradients since that is the dominant paradigm for LLM post-training. Value-based methods are more sample-efficient because they propagate credit across time. However, this addresses temporal credit assignment while leaving the information bottleneck intact. This gap is exponential: Du et al. show that even in settings where value functions are perfectly representable, RL requires exponentially more samples than richer supervision (i.e., imitation learning). ↩
This is, of course, a reference to Yann LeCun’s cake analogy. One “cherry on top” is too little for some appetites 🙁 ↩
For clarity, we only show condensed feedback from two molecules in this table. In practice, the Feedback Descent system is shown full feedback on all top-k molecules proposed so far. ↩