Does this sound plausible?
from fvalues import F
from ice.recipe import recipe
def make_verification_prompt(question: str, answer: str) -> str:
return F(
f"""Consider this question: "{question}"
Q: Is the potential answer above correct? Say "A: Yes" or "A: No".
A:"""
)
choice_probs, _ = await recipe.agent().classify(
prompt=prompt, choices=(" Yes", " No")
)
return choice_probs.get(" Yes", 0)
The interesting bit here is that we don’t just want a boolean Yes/No answer from the model, but that we want the probability of the “Yes” answer to the correctness question. This way, we get a more graded signal that we can use, e.g., to only show or use model responses when they exceed a threshold.

## Sanity checks

Let’s test it:
0.9948396822920341
Good.
0.0010152581398344962
Basic sanity checks pass.
0.0005455832226911594
Also correct. Execution trace (view online)

## A math problem

Let’s try something harder: A problem from the GSM8K math problems dataset:
Beth bakes 4x 2 dozen batches of cookies in a week. If these cookies are shared amongst 16 people equally, how many cookies does each person consume?
The correct answer is 6, but it takes a few steps of reasoning to work that out.
python verify_answer.py --question "Beth bakes 4x 2 dozen batches of cookies in a week. If these cookies are shared amongst 16 people equally, how many cookies does each person consume?" --answer "6"
0.06723949284762187
The model can’t see that the answer is correct.
What if we also give the reasoning steps?