Reasoning failures highlighted by Apple research on LLMs

A new paper from Apple's artificial intelligence scientists has found that engines based on large language models, such as those from Meta and OpenAI, still lack basic reasoning skills.

The group has proposed a new benchmark, GSM-Symbolic, to help others measure the reasoning capabilities of various large language models (LLMs). Their initial testing reveals that slight changes in the wording of queries can result in significantly different answers, undermining the reliability of the models.

The group investigated the "fragility" of mathematical reasoning by adding contextual information to their queries that a human could understand, but which should not affect the fundamental mathematics of the solution. This resulted in varying answers, which shouldn't happen.

"Specifically, the performance of all models declines [even] when only the numerical values in the question are altered in the GSM-Symbolic benchmark," the group wrote in their report. "Furthermore, the fragility of mathematical reasoning in these models [demonstrates] that their performance significantly deteriorates as the number of clauses in a question increases."

The study found that adding even a single sentence that appears to offer relevant information to a given math question can reduce the accuracy of the final answer by up to 65 percent. "There is just no way you can build reliable agents on this foundation, where changing a word or two in irrelevant ways or adding a few bit of irrelevant info can give you a different answer," the study concluded.

An absence of critical thinking

A particular example that illustrates the issue was a math problem that required genuine understanding of the question. The task the team developed, called "GSM-NoOp" was similar to the kind of mathematic "word problems" an elementary student might encounter.

Watch the Latest from AppleInsider TV

The query started with the information needed to formulate a result. "Oliver picks 44 kiwis on Friday. Then he picks 58 kiwis on Saturday. On Sunday, he picks double the number of kiwis he did on Friday."

The query then adds a clause that appears relevant, but actually isn't with regards to the final answer, noting that of the kiwis picked on Sunday, "five of them were a bit smaller than average." The answer requested simply asked "how many kiwis does Oliver have?"

The note about the size of some of the kiwis picked on Sunday should have no bearing on the total number of kiwis picked. However, OpenAI's model as well as Meta's Llama3-8b subtracted the five smaller kiwis from the total result.

The faulty logic was supported by a previous study from 2019 which could reliably confuse AI models by asking a question about the age of two previous Super Bowl quarterbacks. By adding in background and related information about the the games they played in, and a third person who was quarterback in another bowl game, the models produced incorrect answers.

"We found no evidence of formal reasoning in language models," the new study concluded. The behavior of LLMS "is better explained by sophisticated pattern matching" which the study found to be "so fragile, in fact, that [simply] changing names can alter results."

43 Comments

foregoneconclusion 13 Years · 2940 comments

About 5 months ago

The primary issue with LLM computing is the ridiculously high power requirements. It goes against all of the low power hardware development of the last couple of decades.

5 Likes · 0 Dislikes

hexclock 11 Years · 1345 comments

About 5 months ago

Of course they can’t reason. It’s not a living mind. It’s the illusion of intelligence.

7 Likes · 0 Dislikes

22july2013 12 Years · 3794 comments

About 5 months ago

hexclock said:

Of course they can’t reason. It’s not a living mind. It’s the illusion of intelligence.

And of course those Boston Dynamics robot dogs can't run. It's not a living body. It's the illusion of running. Illusion, shmillusion. If it works, that's all I care about. Maybe you agree with me, you're just quibbling over a word.

5 Likes · 0 Dislikes

iOSDevSWE 5 Years · 29 comments

About 5 months ago

The article lacks fact checking and details like when were the tests conducted either OpenAIs models and which model was used. When I perform the request I get the following answer from chatGPT 4o:

Question: ” Oliver picks 44 kiwis on Friday. Then he picks 58 kiwis on Saturday. On Sunday, he picks double the number of kiwis he did on Friday. Of the kiwis picked on Sunday, five of them were a bit smaller than average. How many kiwis does Oliver have?”

Answer: “ Let’s break this down:

• On Friday, Oliver picks 44 kiwis.

• On Saturday, he picks 58 kiwis.

• On Sunday, he picks double the number of kiwis he did on Friday, so he picks 44 \times 2 = 88 kiwis.

The total number of kiwis he picks is:

44 \text{ (Friday)} + 58 \text{ (Saturday)} + 88 \text{ (Sunday)} = 190 \text{ kiwis.}

So, Oliver has 190 kiwis in total. The fact that five of the kiwis picked on Sunday are smaller doesn’t affect the total number.”

Perfect answer!

7 Likes · 0 Dislikes