# Checking reasoning steps

Where did we go wrong?

Let’s change the interface of the verifier so that it doesn’t just take an answer, but also a sequence of reasoning steps leading up to it. This way, we can check each step independently and get a probability that it’s correct.

**Representing and rendering reasoning steps**

**Representing and rendering reasoning steps**

First, let’s represent reasoning steps as a list (so that we can more easily manipulate them programmatically) and make a function to render them as a string (so that we can use them in prompts):

If we run `render_steps(DEFAULT_STEPS)`

, we get back the original numbered list:

**Verifying a step**

**Verifying a step**

Given a list of steps, let’s first think about how we can verify the last step, assuming all previous ones are correct.

This is effectively the same as the global verifier above, except that we need to render the steps before we make the prompt. We’ll also already factor out the step-verification into a function `check_step`

so that we can reuse it later.

If we run this with the default question and steps:

We get:

Note that (as we’d expect) this probability of the last step being correct is significantly higher than the probability the model assigned to the entire answer being correct.

**Verifying all steps**

**Verifying all steps**

To verify all steps, we simply replace `verify_answer`

with an (async) map over the prefix of each step:

Instead of just returning the probabilities, we return pairs of probabilities and steps to make the result easier to read. It looks like this:

The more difficult the math, the lower the probability the model assigns to the step being correct.

## Exercises

How could you use the probabilities we get for each step? One idea is to use a model to resample steps that are wrong. Can you use this to answer questions more correctly?

If we multiply the probabilities above to get the probability that the argument overall is correct, we get $0.76 \cdot 0.57 \cdot 0.51 \cdot 0.83 = 0.18$. In general, the more steps, the lower we should expect the product probability to be. If we can’t get high probability by just checking the answer, and we can’t get it by checking many steps, how can we ever confidently conclude that an answer is correct? What does your answer to this question mean for how to implement and check reasoning using language models?

## References

Cobbe, Karl, Vineet Kosaraju, Mohammad Bavarian, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training Verifiers to Solve Math Word Problems. October 27, 2021.

Last updated