Feedback Models
Training small LMs to give actionable natural-language feedback via RL
Most LLM refinement pipelines either ask the model to critique itself (which amplifies blind spots) or use a scalar reward model that says “7/10” without explaining what went wrong. We’re training dedicated feedback models—small LMs whose only job is to produce natural-language critiques that actually help a generator improve its output.
How it works
- A frozen generator writes a draft.
- A separate feedback model samples several critiques of that draft.
- The generator revises using each critique.
- We score the revisions with a downstream reward signal and use GRPO to update the feedback model. The generator is never touched.
The generator stays frozen so we’re optimizing purely for “critiques that lead to better revisions.” We train across multiple generators to prevent the feedback model from overfitting to one system’s quirks, and average over multiple critique samples to reduce variance.
Results
Summarization (head-to-head win rates, 20 trials):
| Comparison | Wins | Losses |
|---|---|---|
| Trained feedback vs. prompted feedback | 19 | 1 |
| 3-step iterative vs. 1-step | 15 | 5 |
| Trained 4B model vs. GPT-5.1 (prompted) | 19 | 1 |
The trained feedback model generalizes across domains—summarization, creative writing, and research plan generation—without domain-specific tuning.
Creative writing (stylistic shifts from feedback-guided revision):
| Metric | Change |
|---|---|
| Clause density | -44% |
| Flesch-Kincaid readability | +10% |
| Gunning Fog grade level | -23% |
Example: summarization critique
Input (CNN/DailyMail article excerpt)
The city of Detroit filed for bankruptcy Thursday, becoming the largest U.S. city ever to do so. Emergency manager Kevyn Orr submitted the filing in federal court. The city owes between $18 billion and $20 billion in debt...
Generator draft
Detroit declared bankruptcy, the biggest city to do so. The city has $18-20 billion in debt.
Trained feedback model critique
"The summary omits two key actors: Emergency manager Kevyn Orr, who submitted the filing, and the federal court where it was filed. Including 'filed for' rather than 'declared' would also be more precise---bankruptcy is a legal process, not a declaration. Consider: 'Emergency manager Kevyn Orr filed for Chapter 9 bankruptcy for the city of Detroit in federal court Thursday, making it the largest U.S. city to seek bankruptcy protection. The city carries $18-20 billion in debt.'"
GPT-5.1 prompted critique
"The summary captures the main point but could include more details about the emergency manager and the significance of the event."
Why this matters
A 4-billion-parameter model, trained with the right objective, produces more useful feedback than a frontier model asked to critique via prompting. This suggests that what you optimize for matters more than raw scale when it comes to critique quality. It also opens up a modular design: swap in any generator, keep the same feedback model.
Links
- Paper (coming soon)
- Code (placeholder)