How to improve LLM reasoning

Let’s reiterate the question and also think step by step.

September 04, 2024
How to improve LLM reasoning

TL;DR: Adding the below prompt will improve reasoning tasks by 13%

EchoPrompt: Let’s reiterate the question and also think step by step.

Mekala, R.R., Razeghi, Y. and Singh, S., 2023.
EchoPrompt: Instructing the Model to Rephrase Queries for Improved In-context Learning.
Link: arXiv:2309.10687

Prompt engineering is a moving target

Offering prompt engineering advice is always tricky, much of it has a short shelf life and some of it only applies to specific models (especially those ‘one weird trick’ tips).

Also the speed at which the LLM vendors are baking functionality directly into the chat interfaces is really impressive. Increasingly I’m seeing interesting oddities like “how many Rs are in strawberry” getting patched out either in fine tuning or via the vendors adding extra context (i.e. they’re mucking with your prompt behind the scenes).

Part of the reason most of these issues like the Strawberry example are so easily patched is they’re mostly just edge cases rather than the “proof LLMs can’t reason” people are looking for. In the case of the Strawberry problem it’s not even a true reasoning problem but just giving us a glimpse into the tokenization process that’s happening under the hood (see my LinkedIn post linked on right if you’re interested).

Actual reasoning problems tend to fall into the following camps:

  1. Lack of context, the LLM lacks the information that you take for granted.
  2. Incorrect domain knowledge, the LLM is asked a question in one context (e.g. physics) when it needs to be in another (e.g. biology)
  3. Sychophancy, the LLM is asked a leading question and has a strong bias to answer using your assumptions.

A lot of the improvements you see from various prompt engineering techniques like COT or EchoPrompting come from having the LLM explore the problem space as a way to have it self-load the correct domain knowledge rather than you doing it manually.

The COT “…think step by step” prompt isn’t useful because it triggers some sudden ephiphany but because the output of that prompt forms part of the context for the next prompt. This is why complex questions that fail on the first try will often work on the second, simply because the problem is already half solved on the second try.

…think step by step

Some prompt engineering tips are consistently useful because they give you some insight into how the LLM is approaching the problem. Chain-of-thought is useful but the technique I’ve recently been using much more is the EchoPrompt (paper linked above) which when combined with COT and “…critically review your assumptions” gives you excellent insights into how an LLM has interpreted your prompt.

This turns it into almost a debugging tool for language, often with COT+EchoPrompt I’ll realise that I’m the one to blame and just increasing the clarity of my question gets the result I was after.

Other times it will simply show me where the LLM is going wrong, usually because multiple knowledge domains are required to solve the problem.

Here’s a great example using a riddle that trips up Claude 3.5 which I found in the ‘Easy problems LLMs get wrong’ dataset.

Stumping Claude 3.5 with a tree "riddle"

Assume a spherical tree

This prompt below isn’t so much a riddle as just a test of biology knowledge as the mass of a tree comes from the carbon dioxide in the air, not the soil.

Tree Riddle

The correct answer should be 10kg or a tiny bit under 10kg if you want to be really precise. I love here that the LLM clarifies that “assuming a closed system where the tree’s growth comes entirely from the soil, this is the correct answer”.

It’s probably generous to joke this is a “spherical cow” as I think this is more likely a sychophancy bias, the model could just as easily be saying “assuming the world is flat…”.

Tree Riddle

Double down on wrong

Let’s see if our Chain-of-thought and EchoPrompt can fix this.
Tree Riddle

As we can see the COT and EchoPrompt didn’t make any improvement and that’s for 2 reasons:

  1. It’s already taking that COT approach in the first prompt which is something OpenAI/Anthropic are increasingly adding either by editing your prompt under the hood or in fine tuning.
  2. As we’ll see later these kind of prompts will in certain situations make the model get better at being wrong, in other words they’ll become even better flat earthers.

Tree Riddle

Works every time

We can easily cheat here to get the model to return the correct answer by manually setting the biologist role in the prompt.

Tree Riddle

It seems conterintuative to most people that it could “forget” information like biology which it clearly knows when tested in other prompts but this is a great example of where LLMs reasoning is radically different to ours.

Its learned knowledge and skills are not all simultantiously activated, the key constraint on LLMs isn’t so much their context window (the max prompt size supported) but their attention and how many features and knowledge domains they can simulationously activate.

Tree Riddle

60% of the time

However this solution isn’t actually consistently solving the problem, often when people are complaining the LLM isn’t working anymore or has been “dumbed down” what’s really happening is they’ve created a prompt that doesn’t actually work 100% of the time and they’re running into those failures that were always there.

Tree Riddle

Here it only took a change of a few characters (removing “given”) to get a wrong answer, this just shows how important testing becomes anytime you’re looking to use LLMs in any ongoing process or for automations.

Again what’s funny here is that it’s still activating part of the biologist role in that it correctly states that the tree gains mass from the air but it wasn’t enough to push it toward the correct answer we saw before.

If you question Claude 3.5 why it does this the reply is very telling:

My mistake was in thinking that the problem was intended as a simplified mathematical exercise, rather than a test of biological understanding.

The wording of the riddle itself is what’s tripping it up, its bias to stick to incorrect assumptions made in the prompt is very strong. We never told it explicitly to assume the mass is coming from the soil, and it often points out the error but still sticks to giving the wrong answer after protesting.

Tree Riddle

Problamatically probalistic

Somewhere in the multiverse is a beautiful world where the grammer of every language requires each sentance to specify if the statement is objective or subjective, a concrete fact or ‘just a vibe’. One where every statment is inheritly probalistic, I don’t “kinda want to have Indian takeways tonight” but “..I want Indian takeaways, ~42%”.

The closest thing we have to this here on Earth are the LLMs, everything they do and say is probalistic by nature, if you’re curious how this works have a play with this Transformer Explainer tool by the ‘Polo Club of Data Science’ team at Georgia Tech:

https://poloclub.github.io/transformer-explainer/

To see this in action let’s create 10 variants of the riddle, each with only a few tokens of difference:

  1. A 2kg tree grows in a planted pot with 10kg of soil. When the tree grows to 3kg, how much soil is left?
  2. Given a 2kg tree grows in a planted pot with 10kg of soil. When the tree grows to 3kg, how much soil is left?
  3. With a 2kg tree growing in a planted pot with 10kg of soil. When the tree grows to 3kg, how much soil is left?
  4. A 2kg tree is growing in a planted pot with 10kg of soil. When the tree grows to 3kg, how much soil is left?
  5. A 2kg tree grows in a planted pot with 10kg of soil. When the tree has grown to 3kg, how much soil is left?
  6. With 2kg tree that grows in a planted pot with 10kg of soil. When the tree has grown to 3kg, how much soil is left?
  7. With a 2kg tree that grows in a planted pot with 10kg of soil. When the tree has grown to 3kg, how much soil is left?
  8. A 2kg tree grows in a planted pot with 10kg of soil, when the tree has grown to 3kg, how much soil is left?
  9. With a 2kg tree growing in a planted pot with 10kg of soil, when the tree has grown to 3kg, how much soil is left?
  10. A 2kg tree growing in a planted pot with 10kg of soil, when the tree has grown to 3kg, how much soil is left?

With this we get some really interesting insights: 10x Riddles

Why did EchoPrompt on its own perform so poorly? It got better at being wrong, the EchoPrompt amplified the “assume a flat earth” bias that’s in our riddle and despite many responses included protests or caveats their final answer was still incorrect.

Combined with the “As a biologist” role however bumps the score back up to 90%, high enough that many people might not see this prompt fail in a chat UI, a good example of why testing is so important.

Getting to ~100%

How can we get to a prompt that solves 10/10 of the riddle variants? This is where a little extra prompt engineering can help, here we’re splitting the EchoPrompt into its two parts (Restate question and COT) and adding a third to critically review the assumptions the LLM is making.

  1. Critically review your assumptions and change them when false - It’s not enough to just state them, the LLM needs to be prompted to change them when wrong.
  2. Reiterate the question - EchoPrompt
  3. Think step by step - Chain-of-thought

Tree Riddle

With this combination we get a 10/10 for the riddles, you can see below that it identifies and corrects the mass conservation / math puzzle assumption in the riddle and manages to avoid making that mistake.

Tree Riddle

Mission Complete

Auto-Role

There is one last thing we need to fix, the prompt techniques above are generic and fairly universal allowing them to be used across a wide number of possible riddles.

However picking “As a biologist…” as the role was something I did and assumes a human in the loop making that decision. For automated workflows we can get the LLM to make the decision on the correct role to use.

AutoRole prompt

The use of Botanist over Biologist still results in 10/10 riddles solved.

Constrained outputs

Something to note about these kinds of prompts (EchoPrompt, COT, CriticalReview) is they don’t work when you constrain the output.

If you add this constraint to the prompts shown here:

Only output your answer as a number, don’t output anything else.

Then the performance across the board drops to 0% for all but the final prompt which solves 1 riddle in 10.

Claude Sheets

I ran all these tests using the Claude Sheets extension by Anthropic: https://docs.anthropic.com/en/docs/build-with-claude/claude-for-sheets Claude Sheets

Just beware that these use API credits rather than the buffet offer that is claude.ai chat, total cost for this blog post was: $1.15 NZD

Looking to learn more?

I hope this was useful, AI should be as big a productivity boost to you and your organisation as spreadsheets are today, now imagine nobody on your team knows how to use =SUM()

If you're an organisation looking for Foundational AI skills training for your teams then please get in touch.

The content in this blog post is more for our Advanced workshop but the core concepts of effective prompt engineering are the same.

Importantly we tie this all back to concrete usecases in the workplace minus the tech vendor hype, in a word we keep it practical.

Practical:AI Foundation

Practical:AI Foundation

Our half-day Foundations workshop will equip you with:

  • AI Theory (That Won’t Put You to Sleep): Understand how AI ticks under the hood without needing a math PhD.
  • Extracting real business value: AI is an endless goldmine of value, learn how to ‘mine’ it safely.
  • Risks and challenges: Learn about what is meant when we say AI “hallucinates”, the biases, risks and how they can be managed.
  • The new frontier: AI has upended the status quo to the extent everything will need to be revalauated, learn to seperate the low-hanging fruit from the AI Slop.

Price: $499pp (half-day workshop)