Artificial intelligenceBrain powerFeatured Stories

The illusion of thinking: large reasoning models (LRMs) collapse in the face of complicated tasks

This article is part of an ongoing series looking at AI in KM, and KM in AI.

Large language models (LLMs)1 are machine learning models designed to mathematically understand and generate human language. They are what is used in generative artificial intelligence (AI) chatbots such as ChatGPT.

Newly emerging large reasoning models (LRMs), also known as reasoning language models (RLMs)2, are LLMs that are trained further to solve tasks that take several steps of reasoning3. They tend to do better on logic, maths, and programming tasks than standard LLMs, can revisit and revise earlier steps, and make use of extra computation while answering as another way to scale performance. Examples of LRMs are OpenAI’s o1 and o3, DeepSeek-R1, and Alibaba’s QwQ.

But just how good are LRMs really? In a paper4 recently published by Apple Machine Learning Research, Parshin Shojaee, Iman Mirzadeh, Keivan Alizadeh, Maxwell Horton, Samy Bengio, and Mehrdad Farajtabar alert that the fundamental benefits and limitations of LRMs are insufficiently understood.

To help address this knowledge gap, Shojaee and colleagues investigated the performance of LRMs in complicated puzzle environments. Shojaee and colleagues actually describe these puzzle environments as ‘complex’ rather than ‘complicated’, but ‘complicated’ is used in this article to be consistent with terminology more commonly used in knowledge management (KM)5. That Shojaee and colleagues did not define what they meant by ‘complex’ and ‘complexity’ is a shortcoming of their paper, and had this been done, they may have realized that ‘complicated’ is the more appropriate term for the puzzle environments.

The puzzle environments

As shown in Figure 1, the four puzzle environments are:

  • Tower of Hanoi – A puzzle featuring three pegs and disks of different sizes stacked on the first peg in size order (largest at bottom). The goal is to transfer all disks from the first peg to the third peg. Valid moves include moving only one disk at a time, taking only the top disk from a peg, and never placing a larger disk on top of a smaller one. The difficulty in this task can be controlled by the number of initial disks.
  • Checker Jumping – A one-dimensional puzzle arranging red checkers, blue checkers, and a single empty space in a line. The objective is to swap the positions of all red and blue checkers, effectively mirroring the initial configuration. Valid moves include sliding a checker into an adjacent empty space or jumping over exactly one checker of the opposite color to land in an empty space. No checker can move backward in the puzzle process. The level of complication of this task can be controlled by the number of checkers.
  • River Crossing – A constraint satisfaction planning puzzle involving actors and their corresponding agents who must cross a river using a boat. The goal is to transport all individuals from the left bank to the right bank. The boat can carry at most a certain number of individuals and cannot travel empty. Invalid situations arise when an actor is in the presence of another agent without their own agent present, as each agent must protect their client from competing agents. The level of complication of this task can also be controlled by the number of actor/agent pairs present.
  • Blocks World – A block-stacking puzzle requiring rearrangement of blocks from an initial configuration into a specified goal configuration. The objective is to find the minimum number of moves needed for this transformation. Valid moves are restricted to the topmost block of any stack, which can be placed either on an empty stack or on top of another block. The level of complication in this task can be controlled by the number of blocks present.
Illustration of the four puzzle environments.
Figure 1: Illustration of the four puzzle environments. Columns show the progression from initial
state (top) through intermediate state (middle) to target state (bottom) for puzzles: Tower
of Hanoi (disk transfer across pegs), Checkers Jumping (position swapping of colored tokens), River
Crossing (transporting entities across a river), and Blocks World (stack reconfiguration). Source: Shojaee et al. 2025.

Fundamental limitations in current models revealed

Through their puzzle environment investigations, Shojaee and colleagues have revealed what they say are fundamental limitations in current models. Despite sophisticated self-reflection mechanisms, these models fail to develop generalizable reasoning capabilities beyond certain thresholds.

Shojaee and colleagues identified three distinct reasoning regimes:

  1. Standard LLMs outperform LRMs at low levels of complication.
  2. LRMs excel in moderately complicated environments.
  3. Both collapse in highly complicated environments.

Particularly concerning is the counterintuitive reduction in reasoning effort as problems approach critical levels of complication, which Shojaee and colleagues say suggests an inherent compute scaling limit in LRMs. They alert that their detailed analysis of reasoning traces further exposed complication-dependent reasoning patterns, from inefficient “overthinking” on simpler problems to complete failure on complicated ones.

Shojaee and colleagues conclude that these insights challenge prevailing assumptions about LRM capabilities and suggest that current approaches may be encountering fundamental barriers to generalizable reasoning.

The challenges in developing what at some point in the future may be genuinely fully intelligent LLM-based agents will be further explored in the RealKM Magazine series Advances & challenges in foundation agents.

Article source: Shojaee et al. 2025, CC BY 4.0.

Header image source: Stochastic Parrots at Work by IceMing & Digit / Better Images of AI / CC BY 4.0.

References:

  1. Wikipedia, CC BY-SA 4.0.
  2. Wikipedia, CC BY-SA 4.0.
  3. Besta, M., Barth, J., Schreiber, E., Kubicek, A., Catarino, A., Gerstenberger, R., … & Hoefler, T. (2025). Reasoning language models: A blueprint. arXiv preprint arXiv:2501.11223.
  4. Shojaee, P., Mirzadeh, I., Alizadeh, K., Horton, M., Bengio, S., & Farajtabar, M. (2025). The illusion of thinking: Understanding the strengths and limitations of reasoning models via the lens of problem complexity. arXiv preprint arXiv:2506.06941.
  5. Wikipedia, CC BY-SA 4.0.

Bruce Boyes

Bruce Boyes is editor, lead writer, and a director of RealKM Magazine and winner of the International Knowledge Management Award 2025 (Individual Category). He is an experienced knowledge manager, environmental manager, project manager, communicator, and educator, and holds a Master of Environmental Management with Distinction and a Certificate of Technology (Electronics). His many career highlights include: establishing RealKM Magazine as an award-winning resource with more than 2,500 articles and 2 million reader views, leading the knowledge management (KM) community KM and Sustainable Development Goals (SDGs) initiative, using agile approaches to oversee the on time and under budget implementation of an award-winning $77.4 million recovery program for one of Australia's iconic river systems, leading a knowledge strategy process for Australia’s 56 natural resource management (NRM) regional organisations, pioneering collaborative learning and governance approaches to empower communities to sustainably manage landscapes and catchments in the face of complexity, being one of the first to join a new landmark aviation complexity initiative, initiating and teaching two new knowledge management subjects at Shanxi University in China, and writing numerous notable environmental strategies, reports, and other works.

Related Articles

Back to top button