Agentic Value Functions

Can retrieval-augmented verifiers benefit from past trajectories for math reasoning?

Updated 2026-02-14

Process reward models compress all training signal into weights, discarding the raw trajectories after training. But those trajectories contain exactly the evidence a value function needs: among past attempts at similar problems, what fraction succeeded after this type of step? We test whether giving verifiers retrieval access to a database of scored trajectories improves step-level error detection and solution reranking.

Setup

We evaluate verification strategies on ProcessBench (step-level error detection, 3,400 competition-level problems) and Math-Shepherd (best-of-N reranking). The judge is gpt-oss-120b via vLLM with no fine-tuning.

Prefix-VF: Score each partial solution prefix with (P(\text{correct})), threshold to find the first error step.
Retrieval-augmented VF: Retrieve similar solved trajectories via FAISS for base-rate calibration.
Critic: Show the full solution with tagged paragraphs, ask the LLM to critique step-by-step (standard ProcessBench protocol).
Few-shot critic: Prepend retrieved similar trajectories as labeled examples to the critic prompt.
REPL critic: Critic with Python code execution – the model writes verification code, runs it, then decides.

Results

Step-level error detection (ProcessBench F1):

\[\begin{array}{lccccc} \hline \textbf{Method} & \textbf{GSM8K} & \textbf{MATH} & \textbf{OlymBench} & \textbf{OmniMath} & \textbf{Avg} \\ \hline \text{Qwen2.5-72B-Instruct} & 76.2 & 61.8 & 54.6 & 52.2 & 61.2 \\ \text{QwQ-32B-Preview} & 88.0 & 78.7 & 57.8 & 61.3 & 71.5 \\ \text{o1-mini} & 93.2 & 88.9 & 87.2 & 82.4 & 87.9 \\ \hline \textbf{Prefix-VF} & 83.6 & 62.0 & 37.5 & 31.2 & 53.6 \\ \textbf{Prompted Critic} & 95.2 & 88.7 & 81.4 & 74.1 & 84.9 \\ \textbf{Few-shot Critic (n=2)} & \textbf{100.0} & 85.7 & 83.7 & 74.2 & \textbf{85.9} \\ \textbf{REPL Critic} & 84.2 & 77.5 & 82.2 & 69.5 & 78.4 \\ \hline \end{array}\]

The zero-shot critic (F1=84.9) outperforms all trained PRMs and comes within 3 points of o1-mini (87.9) without any training. Adding 2 retrieved examples pushes it to 85.9.

Best-of-N reranking (Math-Shepherd, 200 problems):

\[\begin{array}{ccccc} \hline N & \text{Random} & \text{Majority} & \text{VF pass@1} & \text{Oracle} \\ \hline 2 & 0.357 & 0.203 & 0.490 & 0.510 \\ 4 & 0.356 & 0.252 & 0.610 & 0.662 \\ 8 & 0.357 & 0.287 & 0.712 & 0.812 \\ \hline \end{array}\]

Retrieval doesn’t help

Retrieval-augmented methods provide small, statistically insignificant improvements across all settings. The best retrieval mode (outcome-only) yields +0.021 AUROC; most modes hurt performance.

The bottleneck on hard problems is false negatives: the critic rejects valid solutions it cannot verify (correct-solution accuracy drops from 1.0 on GSM8K to 0.692 on OmniMath). Retrieved trajectories don’t help with this—the model’s ability to follow complex reasoning is the limiting factor, not lack of calibration signal.

Few-shot ablation

Two retrieved examples give perfect F1 on GSM8K and the best average. More examples help hard splits (MATH: 85.7 to 90.4 at n=16) but hurt easy ones—likely because the 1,500-char truncation limit crowds out the actual solution being verified.

REPL critic

Giving the critic a Python REPL to verify arithmetic drops average F1 from 84.9 to 78.4. The model writes correct verification code but draws wrong conclusions from it, or skips verification entirely on hard problems.

REPL catches an error (GSM8K, sample idx=5)

The model writes code to verify each step's arithmetic, confirms all pass, then identifies the error in step 3's logic.

[Iter 0]
```python
# Paragraph 0: Verify 1/4 * 60 = 15
result0 = (1/4) * 60
print("Paragraph 0:", result0, "expected 15", result0 == 15)
```

```python
# Paragraph 1: Verify 1/2 * 60 = 30
result1 = (1/2) * 60
print("Paragraph 1:", result1, "expected 30", result1 == 30)
```

```python
# Paragraph 2: Verify 15 + 30 = 45
result2 = 15 + 30
print("Paragraph 2:", result2, "expected 45", result2 == 45)
```

```python
# Paragraph 3: Verify 60 + 60 = 120
result3 = 60 + 60
print("Paragraph 3:", result3, "expected 120", result3 == 120)
```

[Code output]
Paragraph 0: 15.0 expected 15 True
Paragraph 1: 30.0 expected 30 True
Paragraph 2: 45 expected 45 True
Paragraph 3: 120 expected 120 True

[Iter 1]
\boxed{3}

Label: step 3. Prediction: step 3. Correct.

REPL gives up on a hard problem (MATH, sample idx=32)

On harder problems, the model skips code entirely:

[Iter 0]
\boxed{-1}

Label: step 3. Prediction: -1 (no error). Wrong.

View REPL examples

Code execution helps on easy problems where errors are arithmetic, but the hard problems have logical errors that code can’t catch. On OmniMath, ~60% of samples get zero code execution.