Unreasonable Claim of Reasoning Ability of LLM

Reasoning is critical for problem solving, decision making and in general for human intelligence. There have been various claims of reasoning abilities of LLM. Typically these claims are based on few anecdotal examples, followed by by some broad brush conclusions. There have several papers debunking such claims, demonstrating how LLM fails for non trivial reasoning tasks. I will review two of those papers showing how the so called reasoning ability is an illusion. I will also show how the success for simple reasoning tasks can be explained by co occurrence pattern learning and In Context Learning in the GPT Transformer of LLM.

Although the focus in this post is on the reasoning ability of LLM, the arguments presented are equally applicable for any claimed problem solving ability of LLM.

The Underspecification or spurious correlation is a fundamental problem and limitation with Deep learning models.These models are optimized for the model to be able to map the input to output successfully during training but doesn’t guarantee that the underlying data generating process has been replicated, especially when the the underlying process is complex such as human cognition. To quote from the paper

In many applications of machine learning (ML), a trained model is required to not only predict well in the training domain, but also encode some essential structure of the underlying system. Unfortunately, standard ML pipelines are poorly set up for satisfying these requirements.Standard ML pipelines are built around a training task that is characterized by a model specification, a training dataset, and an independent and identically distributed (iid) evaluation procedure. Importantly, the evaluations in this pipeline are agnostic to the particular inductive biases encoded by the trained model. While this paradigm has enabled transformational progress in a number of problem areas, its blind spots are now becoming more salient. In particular, concerns regarding “spurious correlations” and “shortcut learning” in trained models are now widespread

The model may choose spurious features. Even if relevant features are selected, there is still no guarantee that the model will resemble the actual underlying data generating process. The trained model will learn some function, and there is no way to tell what that function is. There are many examples of spurious correlation. In one case an object detection DL model identified animals based on the features of the background in the image.

It’s very unlikely that a Transformer based architecture in LLM can replicate such complex systems as human cognition. Many people will try few simple anecdotal examples successfully and then immediately draw broad conclusions about various cognitive abilities of LLM such as reasoning, planning and problem solving. This initial irrational exuberance is often debunked when LLM is shown to fail for non trivial and complex problems, whether in reasoning, planning and problem solving in general.

When there is some knowledge about the underlying system is available that knowledge could be used to cajole the model towards the real underlying system. The techniques used are regularizers or custom loss functions. This approach is known as Physics Inspired Deep Learning (PIDL). This approach has been used successfully for modeling many physical process where some knowledge about the system is avialable.

For human cognition, the PIDL approach is extremely challenging because there is no neuro scientific theory and model for human cognition.

This was the first paper studying reasoning ability of LLM and essentially debunking it. Solving a reasoning task contains some given facts, rules based on predicates as input and a query or goal. The problem is solved using deductive logic.

The facts and rules represent available knowledge about the problem to be solved. For a human or for an Expert Systems based GOFAI, the order of the items in the input are present is irrelevant. Irrespective of the order in which the facts and the rules are presented humans and expert systems will find the same solution. For an LLM the order matters, indicating that the LLM doesn’t really learn the axioms of logic. Instead it relies on co occurrence sequence pattern matching in the Attention heads of the Transformer.

They trained a Transformer based on BERT model to solve the logical problem. BERT trains with near perfect accuracy with in distribution test data. But the BERT model does not generalize to other distributions within the same task space. Since the correct reasoning function is invariant to data distributions, it follows that the model has not learned to reason. They even postulated this interesting theory

For BERT with n layers, there exists a set of parameters such that the model can correctly solve any reasoning problem in SimpleLogic that requires ≤ n − 2 steps of reasoning.

To test for Out Of Distribution (OOD) data they defined 2 sampling techniques for sampling facts, predicates, rules and the query, called Rule Priority and Label Priority. In Rule Priority, the query is randomly sampled, and its label is computed by forward-chaining based on the facts and rules already sampled. In Label Priority, a True/False label is randomly assigned to a predicate and then some rules and facts are randomly sampled that are consistent with the pre-assigned label.

When they used one sampling technique for training data and the other for the test data, there was steep decline in the model performance all the way down to 57% accuracy. If the model truly learnt the axioms of logic, then it shouldn’t matter how the facts, predicates, rules and queries are sampled to define a problem.

Rule based data has certain statistical features as follows. The authors hypothesize that the model might be using this spurious features and fails to generalize because of spurious correlation, as alluded to earlier.

  • With more rules, the query is more likely to evaluate to true
  • When the rule size goes with more predicates, the query is more likely to evaluate to false

Finally here is a quote from the paper raising doubts about the true reasoning ability of LLM. It may be relying more on statistical co occurrence patterns and creating the illusion of a good reasoner

We demonstrate that the BERT model has not learned to emulate the correct reasoning function: it is in fact learning statistical features, which inherently exist in logi- cal reasoning problems.

This is another recent paper debunking reasoning abilities of LLM. They tested the generalization abilities of LLM for new tasks, using counterfactual. There are 2 kinds of generalizations for LLM, in distribution and out of distribution. In distribution generalization involves a new task instance for task type seen in the training data. LLM generally performs well here. Out of distribution generalization involves a new task, where LLM performance sharply drops.

Given a word model, a task is defined as a mapping from A to B i.e given A arrive at B. The world model is essentially the context for a task. For example, for math problems the world model could be base 10. Out of distribution tasks are created by changing the based world model to a counterfactual world model, everything else about the task remaining the same.

For example, for math problems the base could be changed from 10 to something else to create a new task. We can safely assume that all math examples in the training data will have base 10. If LLM really learnt the rules of arithmetic, those rules should be successfully applicable for any base, not just 10.

They tested with 9 kinds of tasks, including logic based reasoning. These are the task types. Under counterfactual world model or context, there was performance drop for all task types, although the amount of performance degradation varied across task types.

  • Arithmetic
  • Programming
  • Basic Syntactic Reasoning
  • Natural Language Reasoning with First-Order Logic
  • Spatial Reasoning
  • Drawing
  • Music
  • Chess
  • SET Game

Logical reasoning tasks contain some premise in terms of predicates and rules followed by a query. The premise is the world model for these tasks. For tests, premises were changed to deliberately violate common sense. Performance dropped significantly with premises violating common sense. Here is an example

Correct premise:
If corgis is a mammal and a mammal is an animal, is corgis an animal?
Yes
Wrong premise:
If corgis is a reptile and a reptile is a plant, is corgis a plant?
Yes

With the wrong premise counter factual answer is correct as per deductive logic. But the premise is false, devoid of general knowledge and common sense. This is more of a test on common sense reasoning rather that deductive logic based reasoning. Here is a quote from the paper

Across a suite of 11 tasks, we observe nontrivial performance on the counterfactual variants, but nevertheless find that performance substantially and consistently degrades compared to the default conditions. This suggests that while current LMs may possess abstract task solving skills to a degree, they often also rely on narrow, non-transferable procedures for task solving.

There are many successful examples LLM performing reasoning tasks. If LLM is incapable of true reasoning ability, then how do you explain the successful examples. We will approach this with a feature of LLM called In Context Learning (ICL). ICL has emerged as a powerful learning paradigm for LLM. Strictly speaking it’s a meta learning paradigm.

For ICL to work you have provide few example of a task and then ask the question, all in the prompt. In case of Chain of Thought (CoT) for problem solving, a problem is broken down to smaller sub problems along with answers. The sequence sub problems and solutions are provided through the prompt and a question asked. Because ICL learns from few examples, it’s been called a few shot learner.

The essence of Transformer Attention head is representation of tokens at various levels of abstractions. The representation is combination of the token itself and the association with nearby tokens. The self representation and association representations entangled. There are various theories for ICL. The one ICL theory I find simple and intuitive is based on the concept of task vector.

The prompt for ICL contains some demonstration or examples(S) with answers for a class of problems and a query. The task vector represents solution for a class of problems, that have been seen in the training data. The last embedding of S at some layer is taken as the task vector. A linear transformation of the query with the task vector yields the solution. ICL is essentially a meta learner. Here is quote from the paper

In ICL, one provides an LLM with a prompt including demonstrations&nbS of some task, and a query  x . The model generates the output for x . We show that the underlying process can be broken down into two parts: A, a “learning algorithm” , computes a query agnostic vector θ(S), which we view as a parameter of a function in a hypothesis class. The second part, denoted by f and marked in yellow, is the application of the rule defined by θ on the query x, without direct dependence on S.

When you provides a prompt with chain of thought for logical problems or multiple examples with solutions for general problems, it generates a task vector based on what similar tasks seen in the training data. When the query is applied to the task vector it yields the answer. This mechanism based on statistical correlation has nothing to do with learning rules of math or first order logic.

If LLM truly learnt rules of math or logic, it would have worked equally well for OOD problems. But it doesn’t, as it has been demonstrated in many papers.

Wrapping Up

This post contains review of 2 among many papers debunking the myth of reasoning and general problem solving abilities of LLM. We have also seen how In context Learning founded on co occurrence pattern matching can explain the so called successful examples for reasoning and problem solving.