As Apple struggles to integrate AI into Siri and its other software, Apple researchers posted a controversial paper provocatively titled: “The Illusion of Thinking: Understanding the Strengths and Limitations of Reasoning Models via the Lens of Problem Complexity.” The researchers are: Parshin Shojaee*†, Iman Mirzadeh*, Keivan Alizadeh, Maxwell Horton, Samy Bengio, and Mehrdad Farajtabar.
First off, who knew Apple was conducting AI research?
The paper questions the current benchmarks or evaluation of AI models. Their article focuses on the reasoning process of AI models, instead of just final answer accuracy.
As the Abstract states: “In this work, we systematically investigate these gaps with the help of controllable puzzle environments that allow precise manipulation of compositional complexity while maintaining consistent logical structures. This setup enables the analysis of not only final answers but also the internal reasoning traces, offering insights into how LRMs “think”. Through extensive experimentation across diverse puzzles, we show that frontier LRMs face a complete accuracy collapse beyond certain complexities. Moreover, they exhibit a counter- intuitive scaling limit: their reasoning effort increases with problem complexity up to a point, then declines despite having an adequate token budget.”
The paper asserts: “Our key contributions are:
“We uncover surprising limitations in LRMs’ ability to perform exact computation, including their
failure to benefit from explicit algorithms and their inconsistent reasoning across puzzle types.
“We question the current evaluation paradigm of LRMs on established math benchmarks and
design a controlled experimental testbed by leveraging algorithmic puzzle environments that enable controllable experimentation with respect to problem complexity.
“We show that state-of-the-art LRMs (e.g., o3-mini, DeepSeek-R1, Claude-3.7-Sonnet-Thinking)
still fail to develop generalizable problem-solving capabilities, with accuracy ultimately collapsing to zero beyond certain complexities across different environments.
“We find that there exists a scaling limit in the LRMs’ reasoning effort with respect to problem
complexity, evidenced by the counterintuitive decreasing trend in the thinking tokens after a
complexity point.
“We question the current evaluation paradigm based on final accuracy and extend our evaluation
to intermediate solutions of thinking traces with the help of deterministic puzzle simulators. Our
analysis reveals that as problem complexity increases, correct solutions systematically emerge at
later positions in thinking compared to incorrect ones, providing quantitative insights into the
self-correction mechanisms within LRMs.”
It will be interesting to hear any responses of AI researchers from OpenAI, Anthropic, or DeepSeek about the assertions by Apple.