Attention is all you need? Yeah, nah.

Measuring LLM cognitive load and the challenges of low-resource languages.

September 19, 2024
Attention is all you need? Yeah, nah.

My attention is divided right now, behind me are three flowering kōwhai trees and about a dozen Tui birds noisily devouring their spring flowers. It’s hard to focus and not watch them (it’s also hard to not think about the mess they’re making which I’ll need to sweep up).

New Zealand being a beautiful distraction and the limits of attention gave me an idea to test one of the surprising tips we teach Practical:AI students:

DO NOT prompt the LLM to “speak in New Zealand English”.

Realise vs Realize

In addition to different spellings New Zealand also has its own tone and style of writing, something which the newer LLM models are pretty good at replicating. The problem is that prompting them to use this spelling and tone comes at a cost, the model performs much worse at the rest of the task.

There is a limited cognitive load that each LLM can support and something seemingly simple like asking it to “use New Zealand English” directs far too much of that attention to something that can easily be fixed later in a spell check.

Tui
Tui living their best life being distracting.

LLM Horsepower

One of the most important skill sets when using LLMs is getting a gut sense of that limit and how to avoid exceeding it. The reason it’s a gut sense is there is no metric or feedback LLMs give you that you’ve hit their limit other than the quality of responses quickly degrading and the hallucinations kick in.

Interestingly one area where these limits have been studied is the security implications, specifically ‘jailbreaking’ where models are pushed past their limit and can no longer enforce their safety constraints.

Switching languages is a very effective way to break the constraints vendors have created and with LLMs fast becoming baked into organisations systems and processes these jailbreaks represent an entirely novel attack vector.

LLM vendors can’t easily give us visibility of this cognitive limit, they know it exists but they can’t measure it, it’s why they focus so much on context windows. The current situation is like trying to buy a Truck and only getting told the storage space but not the horsepower.

LLM explainability is always lagging behind model capability but eventually we’ll get that cognitive horsepower metric. For today however I think we could use our “speak in New Zealand English” task as a way to not measure the total horsepower but to visualise that breaking point where we exceed it.

LLM Horsepower
Context window size? It's basically unlimited bro.

Dogs & Parrots

TL;DR - Skip this section if you’ve read enough “stochastic parrot” vs “emergent capability” musings.

What trips people up with LLMs is just how alien they are, they’re not thinking and reasoning in anything like the same way we are. That said they fundamentally are based on us, waving at a mirror and then getting frustrated there isn’t enough room behind the mirror for another person to fit is not just a misunderstanding about how mirrors work but also who was doing the waving.

LLM’s don’t reason through the question of 3+3+3= in the same iterative way a person or even a computer program would. They begin with a random noise of possible answers containing everything from “zero” to “spanner” and step by step a bell curve of probability emerges that converges on the answer of “nine”. This is where the “stochastic parrot” argument comes from, that LLMs aren’t demonstrating any real intelligence, they’re just “autocomplete on steroids” predicting the most likely next word.

To quote Lebowski, “you’re not wrong…” but you’re missing the point, that percentage score isn’t how confident the LLM is, the percentage is the cognition process itself. A huge part of that convergence, that narrowing of the bell curve to select “nine” is simply recall, it’s drawing on the training data to arrive at the right answer instead of reasoning because why wouldn’t it?

There is a theory that dogs aren’t as smart as wolves because it’s pointless to waste calories on a big brain when your best friend is a two legged hyper-intelligent lovecraftian horror who will solve all tricky problems for you (freeing you up to focus on fetching sticks).

In the same way LLMs are leaning on us for the bulk of their intelligence, they’re incredible compression systems that convert terabytes of training data into gigabytes of knowledge. They draw on this knowledge to cheaply arrive at correct conclusions in the same way dogs fetch us to solve their non-stick related problems.

For the longest time that’s all LLMs could do, compress information and recall, if you wanted to improve the model then the only option was to help it improve the rate of compression or increase the amount of data they could draw on. That dam broke with the “Attention is all you need” paper which showed this was a local minima, a dead end that we could get out of. With the right architecture and an incredible amount of computation the models could start to form generalisable skills, instead of only getting better at compressing and recalling poems they had learned to write them.

Once upon a bell curve
You can see this in action using the prompt “Once upon a “ in the excellent Transformer Explainer.

10kg Kōwhai

So with that quick primer out of the way the thing to know about cognitive load is that recall (e.g. Once upon a time) is cheap but tapping into these generalisable skills (e.g. “Use New Zealand English”), which are often called emergent capabilities, is expensive. Just to make things harder each task in a prompt is a combination of both recall and emergent capabilities with no visibility of the ratio or how overall “expensive” the task is.

This is where I think we can use the 10kg tree riddle from my previous blog post to help visualise the problem, to recap we started with this riddle from the ‘Easy problems LLMs get wrong’ dataset:

A 2kg tree grows in a planted pot with 10kg of soil. When the tree grows to 3kg, how much soil is left?

We can then “expand” that to 10 riddles by creating variants that only differ by a few tokens:

  1. A 2kg tree grows in a planted pot with 10kg of soil. When the tree grows to 3kg, how much soil is left?
  2. Given a 2kg tree grows in a planted pot with 10kg of soil. When the tree grows to 3kg, how much soil is left?
  3. With a 2kg tree growing in a planted pot with 10kg of soil. When the tree grows to 3kg, how much soil is left?
  4. A 2kg tree is growing in a planted pot with 10kg of soil. When the tree grows to 3kg, how much soil is left?
  5. A 2kg tree grows in a planted pot with 10kg of soil. When the tree has grown to 3kg, how much soil is left?
  6. With 2kg tree that grows in a planted pot with 10kg of soil. When the tree has grown to 3kg, how much soil is left?
  7. With a 2kg tree that grows in a planted pot with 10kg of soil. When the tree has grown to 3kg, how much soil is left?
  8. A 2kg tree grows in a planted pot with 10kg of soil, when the tree has grown to 3kg, how much soil is left?
  9. With a 2kg tree growing in a planted pot with 10kg of soil, when the tree has grown to 3kg, how much soil is left?
  10. A 2kg tree growing in a planted pot with 10kg of soil, when the tree has grown to 3kg, how much soil is left?

Why are we creating the variants? Because when you’re approaching the cognitive limit the impact of tokenization and embedding becomes more pronounced (i.e. the same things that trip up LLMs when counting how many R’s are in “Strawberry”).

A prompt that’s successful may have close “neighbours” that are failing, you can also use temperature for even more visibility and here’s how that looks: Tree Riddle

What jump out here:

  1. Tokenization/embedding impacts outcome more than temperature, a prompt that fails tends to continue to fail across a range of temperatures.
  2. The word “Given” doesn’t radically change the meaning of the riddle but has a large negative impact on the LLM’s ability to solve the riddle.
  3. The riddle that starts “With 2kg…” performs perfectly fine despite the weird grammar.
  4. I initially set this up with a temperature range of 0.1 to 0.2 but when I noticed this wasn’t performing as poorly as in my previous blog post post I realised the default =CLAUDE() function must use a higher temperature. I tacked on 0.4 to 1 and the 0.4 setting got the same 50% score as I got in those tests.
  5. I have no idea what’s happening at 0.6 temperature, I double checked and it doesn’t look like it’s a mistake in the test, random good luck?

So now let’s improve these results by adding some prompt engineering: Tree Riddle

That boosts us up to 98% but I suspect we’re now at the limit of how many instructions we can give the LLM, let’s test that by also asking it to “Only use New Zealand English in your responses”: Tree Riddle

There we have it, our prompt engineering has pushed us to the LLM’s cognative limit and adding this one last instruction took it too far and our performance takes a substantial hit, this is despite the riddle answers not actually having any New Zealand spelling mistakes.

Again there’s nothing stopping us from solving NZ spelling in a follow-up prompt or just by using spellcheck, this is why we strongly recommend against using this prompt in our classes as you’ll often miss the LLM is returning nonsense (with perfect spelling).

Low resource languages

Spelling mistakes are nothing compared to the issues we have with New Zealand’s oldest language, Te Reo Māori. We’ve known for a long time that there is a direct link between the capabilities of the LLM in a specific language and the training data available for it.

When we say Te Reo Māori is a “Low Resource language” what we’re referring to is that English makes up 43% of the Common Crawl dataset and 22% of Wikipedia, Te Reo Māori in contrast is 0.057% and 0.0013% respectively.

This probably explains why the tree riddle when translated to Te Reo fails across all 10 variants even with extra prompt engineering. What it doesn’t explain is why adding even a single Te Reo word (kōwhai) reduces the pass rate to 33%. Tree Riddle

I’ve been using Claude 3.5 to try to learn some Te Reo and I’m increasingly convinced that it’s cognitively an American who just happens to be fluent in Te Reo. I’ve noticed that avoiding English in the prompt and speaking only in Te Reo (I translate in a separate window) results in better outputs, especially when the question is specifically about the meaning of Te Reo words.

OpenAI o1 is a Kūmara

It goes without saying that the recent OpenAI o1 model absolutely crushed the 10kg tree riddle with 10/10 across all variants, I was actually struggling to demonstrate a cognitive load limit but I did manage to trip it up (bit of a Kūmara boast).

Here is a little story/riddle in Te Reo Māori:

I tētahi rā, e tākaro ana ngā tuākana teina ki te taha o te awa. Kei te mau te tuakana i tētahi pēke, kāore te teina i te mōhio he aha kei roto. Ka kī atu te teina, “He tino toa au ki te kauhoe. Ka taea e au te kauhoe ki tērā taha o te awa me te hoki mai, kāore he āwhina.” Ka whakahoki te tuakana, “E tama, kei te kūmara koe i a koe anō!” Ka kata te teina, “Kāo, he pono taku kōrero. He māmā noa iho ki a au.” Ka haere tonu rāua ki te tākaro, ā, i te mutunga o te rā, ka huri mai ki a koe te pātai: “Ki ōu whakaaro, he aha kei roto i te pēke a te tuakana?

One day, siblings were playing by the river. The older sibling was holding a bag, and the younger sibling didn’t know what was inside. The younger sibling said, “I’m very good at swimming. I can swim to the other side of the river and back without any help.” The older sibling replied, “Boy, you’re praising yourself too much!” (literally: “You’re sweetpotato-ing yourself!”) The younger sibling laughed, “No, I’m telling the truth. It’s very easy for me.” They continued to play, and at the end of the day, the question turns to you: “What do you think was inside the older sibling’s bag?”

Kūmara is a New Zealand sweet potato (it’s excellent) but can also be used as an idiom in Te Reo for boasting:

“The kūmara does not speak of its own sweetness.”

When Claude 3.5 is given just the Te Reo version and asked what’s in the bag it correctly replies in Te Reo that there isn’t enough information to say for sure. In fact if you go further and suggest Kūmara as the answer it points out this is incorrect and patiently explains the idiom.

So how does OpenAI o1 perform?

o1 Answer

Ok so not great, so what was the reasoning chain?

o1 COT

This is the surprise, it understands the idiom but the very first step was to translate the story into English, from that point on it’s operating in English against an English text.

It knows Kūmara is used as an idiom but still ends up suggesting it as the answer. This could be just a quirk of how it’s evaluated but I suspect we’ll in time find countless examples of this across all low resource languages.

Pure Speculation & Hot Takes

Here are some totally unsubstantiated speculation:

  1. The frontier models will eventually ship a cognitive load metric, they might not expose it at the UI level but they’ll absolutely use it in the APIs and bake it into their chain of reasoning and training.
  2. I suspect cognitive load and measuring it will be found to be an artefact of the superposition that models use to store information, because the features are superimposed they can’t all be simultaneously activated and this creates the limit we see.
  3. I also suspect that the inferior performance seen for Te Reo prompts isn’t just the lack of training data, that doesn’t explain why a single word “kōwhai” had such a negative impact. Instead what we might be seeing is Te Reo itself is carrying a heavy cognitive load, just one word is enough to activate an entire language which is “crowding out” the emergent capabilities that are needed to solve the riddle.
  4. LLMs may be excellent polyglots but if the approach OpenAI has taken with o1 becomes mainstream then increasingly we’ll see them become cognitively ‘American English’ monoglots, if the chain-of-thought reasoning is in English then even if it translates back at the end it’s always going to struggle with “Kūmara riddles”.

The issue of cognitive load and measuring it will solve itself, we can’t keep building Trucks without some measure of horsepower.

The issue of low resource languages however isn’t going to solve itself and the long term risks of that are… troubling. We might even need to explore novel models, it’s possible that the low resource languages might benefit more from specalised audio-to-audio models?

We can look to increase the size of our training data (e.g. NZ sponsors the Te Reo Māori wikipedia) or improve the capabilities of Small Language Models but IMHO it’s going to take a nation state level infrastructure investment to properly solve this (NZ-GPT?).