Skip to main content

Command Palette

Search for a command to run...

PhysReason Explained: Why Advanced AI Still Struggles with Physics Reasoning

Updated
21 min read
C
Telecom Engineer & PhD student passionate about AI and its evolution. My research applies AI to tackle real-world problems – building systems that understand, reason, and act in the physical world.

Understanding the benchmark that evaluates whether large language models can truly reason through physical processes, equations and constraints

Large language models can solve equations, explain scientific concepts and generate detailed step-by-step answers. In mathematics and logic, their progress has been remarkable. But physics introduces a deeper challenge.

Solving a physics problem is not simply a matter of recalling a formula or performing a calculation. A model must understand what physical system is being described, identify which forces or constraints matter, determine how the situation evolves over time, select the correct laws and maintain accuracy through multiple reasoning steps.

A model may correctly remember Newton’s second law and still apply it to the wrong system. It may calculate accurately while misunderstanding the physical process. It may produce a plausible final answer even though an early reasoning step is fundamentally wrong.

This is the problem addressed by the paper PhysReason: A Comprehensive Benchmark towards Physics-Based Reasoning, published by Xinyu Zhang and colleagues and later accepted at ACL 2025.

PhysReason is not a new physics-solving model. It is a benchmark designed to test whether advanced artificial intelligence systems can genuinely reason through physics problems rather than merely reproduce familiar solution patterns.

Its conclusion is significant: even some of the strongest reasoning-oriented models still perform poorly when physics problems require long chains of reasoning, careful interpretation of physical conditions and reliable understanding of how real processes unfold.


Why Physics Reasoning Is Different from Mathematical Reasoning

At first glance, a physics problem may look similar to a mathematical exercise. It may contain equations, numerical values and an expected final answer. But physics adds an essential difficulty: equations must represent a real or imagined physical situation.

Consider a problem involving an object moving on an inclined plane. Before calculating anything, a model must determine whether friction exists, which direction it acts in, whether the system is accelerating, which forces belong in the free-body analysis and whether energy conservation can be applied without additional work terms.

A mathematically correct calculation based on an incorrect physical interpretation is still a wrong solution.

This is why physics reasoning requires several abilities at the same time:

  • identifying the relevant physical system;

  • understanding the process taking place;

  • selecting laws that are valid under the stated conditions;

  • handling diagrams, variables and constraints;

  • performing mathematical derivations correctly;

  • maintaining consistency across a long solution.

Existing benchmarks often measure only whether a model reaches the correct final answer. According to the authors of PhysReason, that approach is insufficient for complex physics, because a model can fail in important intermediate steps even when its final answer appears plausible.

PhysReason therefore evaluates not only whether the answer is correct, but also whether the reasoning process remains physically and mathematically valid step by step.


What Is PhysReason?

PhysReason is a comprehensive benchmark containing 1,200 physics problems designed to evaluate large language models and vision-language models on physics-based reasoning.

The benchmark combines two broad categories of questions. Twenty-five percent are knowledge-based questions, designed to assess whether a model knows relevant concepts and principles. The remaining seventy-five percent are reasoning-based questions, divided into easy, medium and hard levels.

Each of these four categories represents twenty-five percent of the benchmark: knowledge, easy reasoning, medium reasoning and hard reasoning.

The problems span several major areas of physics, including classical mechanics, quantum mechanics, fluid mechanics, thermodynamics, electromagnetics, optics and relativity. Collectively, they cover 147 physics theorems or principles.

PhysReason is also strongly multimodal. According to the paper, 81 percent of its problems include diagrams or visual elements. This matters because physical reasoning often depends on correctly interpreting a graph, geometric configuration, trajectory, circuit or force diagram.

The benchmark was assembled from publicly available physics education and competition materials, including international physics olympiad problems, Chinese college entrance examination materials, Indian Joint Entrance Examination Advanced problems and additional examination sources.

The official paper is available here: PhysReason: A Comprehensive Benchmark towards Physics-Based Reasoning.

The official project resources are available here: PhysReason Project Page.


Why the Benchmark Is More Difficult Than Simple Question Answering

One of the defining characteristics of PhysReason is the depth of the required solutions.

The paper reports that problems in the benchmark require an average of 8.1 reasoning steps. For hard problems, this average rises to 15.6 steps.

This is important because long solutions expose weaknesses that remain invisible in short exercises.

A model may begin a solution correctly by identifying an appropriate law. It may then make a small error when substituting a variable, misinterpret a boundary condition or apply an otherwise valid equation in a situation where its assumptions do not hold. As the number of reasoning steps increases, these errors accumulate.

This means that a model capable of answering simple conceptual questions may still fail on realistic physical reasoning tasks.

In physical systems, one incorrect assumption can invalidate an entire derivation. If a model ignores friction, misunderstands an interaction, applies conservation of energy incorrectly or confuses the role of initial conditions, the final numerical result may become meaningless even if the algebra appears polished.

PhysReason was designed specifically to reveal these failures.


Evaluating More Than the Final Answer

A central contribution of the paper is the Physics Solution Auto Scoring Framework, abbreviated as PSAS.

The framework contains two complementary forms of evaluation: PSAS-A and PSAS-S.

PSAS-A evaluates answers at the final-answer level. It extracts the answers generated by a model and compares them with reference answers. This makes it suitable for efficient large-scale performance evaluation.

PSAS-S evaluates reasoning at the step level. Instead of asking only whether the final answer is correct, it examines the sequence of reasoning steps, identifies where a solution first deviates from the correct path and classifies the type of error that occurred.

This distinction is essential.

A final-answer evaluation can tell us that a model failed. A step-level evaluation can help us understand why it failed.

The authors report that their PSAS evaluation framework achieves accuracy exceeding 98 percent in their validation experiments, substantially outperforming direct evaluation approaches in identifying answer correctness and the first erroneous reasoning step.

This makes PhysReason useful not only as a ranking benchmark, but also as a diagnostic tool for analysing the limitations of artificial intelligence in physics.


What Models Were Evaluated?

The authors evaluate a range of advanced language models and multimodal models, including conventional large language models and reasoning-oriented models.

The benchmark includes evaluations of models such as GPT-4o, Claude-3.5-Sonnet, Gemini-2.0 variants, DeepSeek-V3, DeepSeek-R1, o1-mini, o3-mini-high, QwQ-32B, QvQ-72B and other systems.

Some models can directly process images, while others receive image captions generated from the original visual inputs. This distinction is necessary because many PhysReason problems depend on diagrams or visual information.

The paper distinguishes between ordinary models and what it calls O-like models, meaning models designed or prompted for stronger reasoning behaviour. These reasoning-oriented systems generally perform better than conventional models, but the central finding remains clear: even the strongest systems still struggle with difficult physics reasoning.


The Main Result: Strong Models Still Remain Below 60 Percent

The benchmark results reveal a major limitation in current artificial intelligence systems.

Among the strongest evaluated systems, DeepSeek-R1 reaches an average answer-level score of 56.75 percent on the full PhysReason benchmark. Gemini-2.0-Flash-Thinking-0121 reaches 54.73 percent, while o3-mini-high reaches 53.32 percent.

In other words, even the strongest models reported in the study remain below 60 percent average answer-level performance.

More importantly, performance drops sharply as physics problems become harder.

DeepSeek-R1 achieves 75.11 percent on knowledge-based questions, but only 31.95 percent on hard reasoning problems. Gemini-2.0-Flash-Thinking-0121 reaches 73.44 percent on knowledge-based questions but only 31.90 percent on hard problems. o3-mini-high reaches 70.67 percent on knowledge questions and 30.12 percent on hard problems.

These results show that knowing physics facts is not the same as sustaining reliable physical reasoning.

A model may know a theorem, recognise a familiar equation or correctly answer conceptual questions, yet fail when it must connect multiple principles across a long chain of reasoning.

This is one of the paper’s most important conclusions: current models appear considerably stronger at physics knowledge than at deep, reliable physics reasoning.


Why Step-Level Evaluation Matters

The results also reveal an interesting difference between final-answer scores and step-level scores.

Step-level scores are generally higher than final-answer scores. This means that models often produce some valid intermediate reasoning steps even when they ultimately fail to reach the correct answer.

For example, a model might correctly identify the initial physical law and begin a derivation in the right direction, but then make a later mistake in the physical interpretation or numerical calculation.

This distinction matters for the development of future AI systems.

A model that fails immediately because it has no understanding of the problem is very different from a model that begins correctly but loses consistency during a long reasoning chain. The second model may benefit from better verification, error correction or internal process monitoring.

PhysReason therefore helps researchers ask a more precise question than “Did the model get the answer right?”

It allows them to ask:

At what point did the model first go wrong, and what kind of reasoning failure caused the error?


The Seven Error Categories Analysed by PhysReason

The PSAS-S framework considers seven categories of errors when analysing the first incorrect step in a model-generated solution.

The first category is Diagram Analysis Error. This occurs when a model misreads visual information, such as graph axes, curve trends, geometric relationships or important features in a diagram.

The second category is Physics Theorem Application Error. This occurs when a model applies an incorrect physical law or uses a correct law in a situation where its assumptions do not hold.

The third category is Physics Condition Analysis Error. This occurs when a model incorrectly assesses the physical system, its boundaries, relevant forces or interacting components.

The fourth category is Physics Process Understanding Error. This occurs when a model misunderstands how a physical phenomenon develops over time, how states change or how one event physically leads to another.

The fifth category is Variable Relationship Error. This occurs when a model misunderstands how physical quantities depend on one another, such as confusing direct and inverse relationships.

The sixth category is Calculation Process Error. This occurs when the physical setup may be correct, but the model makes algebraic, arithmetic, unit-conversion or numerical substitution mistakes.

The seventh category is Boundary Condition Analysis Error. This occurs when a model ignores limiting cases, initial conditions, approximation limits or special constraints required for a valid solution.

Although all seven categories are used in the detailed analysis framework, the authors identify four error types as the most prevalent obstacles across the evaluated models: Physics Theorem Application, Physics Process Understanding, Calculation Process and Physics Condition Analysis.


Physics Theorem Application: Knowing a Formula Is Not Enough

One of the dominant failure modes identified by PhysReason is Physics Theorem Application Error.

This type of failure occurs when a model uses an incorrect principle, misremembers a law or applies a correct formula in circumstances where it is not valid.

For example, a model may attempt to use conservation of mechanical energy in a system where friction performs significant work without accounting for that energy loss. It may apply Newtonian reasoning in a reference frame that requires additional fictitious forces. It may use a small-angle approximation when the relevant angle is too large for the approximation to remain valid.

These errors reveal an important distinction between formula retrieval and physics reasoning.

A model may be capable of recalling a famous equation, but genuine physical reasoning requires understanding the conditions under which the equation applies.

This problem is especially significant for artificial intelligence systems intended to reason about physical environments. In robotics or autonomous systems, selecting the wrong physical principle is not merely an academic error. It can lead to incorrect predictions about motion, stability, interaction or safety.


Physics Process Understanding: When AI Misunderstands How Reality Evolves

The second major failure mode is Physics Process Understanding Error.

This category is particularly important for research on world models and physical intelligence because it concerns the agent’s understanding of how a situation develops over time.

A model may identify relevant objects and calculate accurately, yet still misunderstand the process itself.

For example, it may incorrectly analyse projectile motion by failing to separate horizontal and vertical components. It may misunderstand how energy transforms between potential and kinetic forms. It may incorrectly predict the direction of motion from the forces involved. It may assume that an object requires a continuous net force in order to maintain constant velocity.

These are not merely arithmetic errors. They indicate that the model has an incorrect internal representation of how physical states change.

For a system intended to build a useful model of the world, this is a serious weakness. A world model must be able to represent transitions: how objects move, how interactions unfold, how forces alter trajectories and how causes produce observable effects.

PhysReason therefore provides an important evaluation perspective for future physical world models. Before an AI system can reliably imagine future physical outcomes, it must demonstrate that it understands the structure of physical processes rather than merely generating equations that resemble correct solutions.


Calculation Process: Correct Physics Can Still Fail Mathematically

The third dominant error type is Calculation Process Error.

In this case, the model may understand the underlying physical setup and may even select the correct formula, but it fails during the mathematical execution of the solution.

These errors include incorrect algebraic rearrangement, arithmetic mistakes, incorrect unit conversions and incorrect substitution of numerical values into equations.

For instance, a model may correctly derive an expression for acceleration but substitute a distance in centimetres as if it were in metres. It may lose a square term during simplification or make an elementary multiplication mistake near the end of an otherwise valid solution.

This category matters because long physics problems require reliable coordination between conceptual reasoning and symbolic calculation.

A physically intelligent system must not only understand what should happen. It must also compute consequences accurately enough to act on that understanding.

Interestingly, the paper observes that some advanced reasoning models, including o1 and o3-mini-high, display relatively fewer calculation process errors but more physics process understanding errors. The authors cautiously suggest that this may reflect a trade-off between computational precision and conceptual understanding in some model behaviours.

This observation is especially valuable because it shows that better arithmetic performance does not automatically imply stronger physical reasoning.


Physics Condition Analysis: Understanding What System Is Being Studied

The fourth dominant category is Physics Condition Analysis Error.

These errors occur when a model misunderstands the physical configuration of the problem: the boundaries of the system, the relevant forces, the interacting components or whether important assumptions are valid.

For example, a model may neglect friction even when friction is essential to the problem. It may incorrectly treat a system as isolated despite the presence of external forces. It may fail to include all relevant forces acting on an object. It may select an incorrect boundary for an energy or momentum analysis.

This category answers a very fundamental question:

Does the model correctly understand what is happening in the physical system before it begins calculating?

This is crucial because every equation depends on an interpretation of the system.

If an AI system chooses the wrong objects, forces, boundaries or constraints, even perfect calculation cannot rescue the solution. Physics reasoning begins before the first formula is written. It begins with correctly modelling the situation.

For embodied artificial intelligence, this is highly relevant. A robot interacting with the real world must identify which contacts matter, which forces constrain motion, which objects are coupled and which assumptions remain valid as conditions change.

PhysReason reveals that current advanced models still struggle with this foundational level of physical interpretation.


A Surprising Finding: Identifying the First Error Helps Models Improve

The paper does more than identify failures. It also investigates whether models can improve when they are informed about where their reasoning first became incorrect.

Using PhysReason-mini, a balanced 200-problem subset of the benchmark, the authors compare two correction strategies.

In the first strategy, a model simply receives its previous reasoning process and is asked to attempt the problem again. This direct concatenation approach actually decreases performance by approximately three to five percentage points.

In the second strategy, PSAS-S first identifies and analyses the earliest erroneous reasoning step. The model then receives the problem, its earlier reasoning and targeted information about the location and nature of its first error.

This guided error localisation approach improves performance by approximately three to six percentage points across the evaluated models.

For example, DeepSeek-R1 improves from 56.60 percent to 58.33 percent on PhysReason-mini when guided by error localisation. DeepSeek-V3 improves more substantially, from 34.07 percent to 40.78 percent.

This result suggests that simply encouraging a model to think again is not necessarily enough. Reliable reasoning may require structured diagnosis of where the reasoning path first diverged from physical validity.

That conclusion is highly relevant for future AI systems. Intelligent physical reasoning may depend not only on generating predictions, but also on detecting and correcting internal modelling errors before they propagate into incorrect actions.


What PhysReason Reveals About Current AI

PhysReason shows that current artificial intelligence systems possess meaningful physics knowledge but remain unreliable when required to apply that knowledge across complex reasoning chains.

The models evaluated in the study are often capable of beginning a solution correctly. They can recognise physical concepts, identify useful equations and perform some valid intermediate steps.

However, as the number of required steps grows, their accuracy declines sharply. The difficulty is not limited to calculation. It includes conceptual failures concerning physical laws, process evolution, system conditions and the interpretation of constraints.

This finding challenges a common assumption about advanced reasoning models. Producing long and detailed reasoning does not guarantee that the reasoning reflects a coherent internal understanding of the physical world.

A model may generate a persuasive explanation while making a crucial mistake near the beginning of the solution. Without step-level verification, that mistake may remain hidden behind fluent language and correct-looking equations.

PhysReason provides a framework for exposing those hidden weaknesses.


Why PhysReason Matters for World Models

Although PhysReason is not itself a world model benchmark, its findings are deeply relevant to world model research.

A world model is expected to represent how an environment changes over time and how actions influence future states. In physical environments, this requires more than visual prediction. It requires reliable understanding of forces, constraints, interactions, transformations and causal sequences.

The errors identified by PhysReason map closely onto challenges that a physical world model must eventually overcome.

A model that makes Physics Process Understanding Errors may struggle to predict how a physical scene evolves.

A model that makes Physics Condition Analysis Errors may represent the wrong objects, forces or system boundaries.

A model that makes Physics Theorem Application Errors may apply invalid physical principles when predicting future outcomes.

A model that makes Calculation Process Errors may convert a correct conceptual prediction into an incorrect numerical or symbolic result.

For this reason, PhysReason offers an important diagnostic perspective for researchers working on physical intelligence and causal world models. It reminds us that predicting future observations is not sufficient if the internal reasoning behind those predictions violates physical principles.

A robust physical world model should eventually combine perceptual representation, dynamic prediction, correct physical constraints and reliable reasoning across multiple steps.


From Physics Reasoning to Causal World Models

The connection between PhysReason and causal world models must be stated carefully.

PhysReason does not directly test formal causal inference. It does not ask models to learn structural causal graphs, estimate interventions or answer systematic counterfactual questions.

Instead, it evaluates whether models can reason through physics problems requiring laws, conditions, diagrams and multi-step derivations.

Nevertheless, several of its error categories are closely connected to the ambitions of causal world modelling.

Physics Process Understanding concerns whether a model understands how one physical state leads to another.

Physics Condition Analysis concerns whether a model identifies the relevant interacting system and its constraints.

Boundary Condition Analysis concerns whether a model respects the assumptions under which a predicted relationship remains valid.

These abilities are necessary ingredients for building AI systems that do more than recognise patterns. A causal world model must eventually understand how interventions change outcomes, why events unfold in a particular way and whether learned mechanisms remain stable under new situations.

PhysReason therefore does not provide a complete test of causal world modelling, but it highlights one of its central requirements: an intelligent agent must be able to reason accurately about physical processes and detect when its internal explanation of the world is wrong.


The Limitations of the Study

PhysReason is an important benchmark, but it should not be interpreted as a complete measurement of physical intelligence.

First, the benchmark evaluates problem-solving in an academic format. Solving examination and competition questions is relevant to physics reasoning, but it is not the same as interacting with a changing physical environment through perception and action.

Second, many questions involve diagrams, but the benchmark does not replace interactive robotics or real-world experimental evaluation.

Third, the step-level scoring framework relies on language models as evaluators, although the authors validate PSAS against manually annotated data and report evaluation accuracy above 98 percent.

Finally, strong performance on PhysReason would not automatically prove that a model possesses a physically grounded or causal understanding of the world. It would indicate stronger competence in structured physics reasoning tasks, which is necessary but not sufficient for embodied intelligence.

These limitations do not reduce the value of the benchmark. They clarify its role: PhysReason is a diagnostic instrument for measuring an important capability that many current AI systems still lack.


What Readers Should Remember

PhysReason introduces a rigorous benchmark for evaluating whether large language models can reason through complex physics problems.

It contains 1,200 problems across multiple fields of physics, with long reasoning chains, varied difficulty levels and substantial visual content.

Its evaluation framework measures both final-answer correctness and step-level reasoning quality.

The results show that even leading reasoning-oriented AI systems remain below 60 percent average answer-level performance and deteriorate sharply on difficult problems.

Most importantly, the benchmark identifies four recurring obstacles: incorrect application of physics laws, misunderstanding of physical processes, calculation mistakes and incorrect analysis of physical conditions.

These findings carry a broader message for the future of artificial intelligence.

An AI system cannot be considered physically intelligent merely because it recalls equations or generates plausible explanations. It must understand which principles apply, how systems evolve, which conditions constrain them and where its own reasoning first becomes unreliable.


Conclusion

The paper PhysReason: A Comprehensive Benchmark towards Physics-Based Reasoning offers an important evaluation framework for one of artificial intelligence’s most difficult challenges: reasoning reliably about the physical world.

Its results reveal a clear gap between physics knowledge and physics understanding. Advanced models may recognise formulas and begin solutions correctly, yet they continue to fail when problems demand long reasoning chains, correct physical interpretation and careful respect for constraints.

This matters far beyond academic problem solving.

Future intelligent agents, robots and physical world models will need to anticipate how environments evolve, identify the consequences of actions and reason under changing conditions. To do that safely and reliably, they must not only calculate. They must understand processes, conditions and causal structure.

PhysReason provides a valuable step toward measuring that ability.

Before AI can build trustworthy internal models of the physical world, it must first demonstrate that it can reason correctly about the physical world itself.


References and Further Reading

Zhang, X., Dong, Y., Wu, Y., Huang, J., Jia, C., Fernando, B., Shou, M. Z., Zhang, L., & Liu, J. (2025). PhysReason: A Comprehensive Benchmark towards Physics-Based Reasoning. ACL 2025.

Direct PDF: Download the original PhysReason paper

Official Project Page: PhysReason Resources and Results

Official Code Repository: PhysReason on GitHub

Official Dataset: PhysReason Dataset on Hugging Face

For readers interested in the relationship between internal simulation and intelligent action, see also Ha, D., & Schmidhuber, J. (2018). World Models.

More from this blog

C

Causal World Model

5 posts

Causal World Model is an independent publication exploring how artificial intelligence learns to represent, predict and reason about the physical world. Through accessible analysis of scientific papers, we cover world models, physical reasoning, causal AI, JEPA architectures and embodied agents. Our goal is to make emerging research clear without overstating scientific results.