Evaluating GPT- and reasoning-based large language models on Physics Olympiad problems: Surpassing human performance and implications for educational assessment

Artikel in Fachzeitschrift › Forschung › begutachtet

Publikationsdaten

Von	Paul Tschisgale, Holger Maus, Fabian Kieser, Ben Kroehs, Stefan Petersen, Peter Wulff
Originalsprache	Englisch
Erschienen in	Physical Review Physics Education Research, 21(2), Artikel 020115
Seiten	21
Herausgeber (Verlag)	American Physical Society
ISSN	2469-9896
DOI/Link	https://doi.org/10.1103/6fmx-bsnl
Publikationsstatus	Veröffentlicht – 08.2025

Large language models (LLMs) are now widely accessible, reaching learners across all educational levels. This development has raised concerns that their use may circumvent essential learning processes and compromise the integrity of established assessment formats. In physics education, where problem solving plays a central role in both instruction and assessment, it is therefore essential to understand the physics-specific problem-solving capabilities of LLMs. Such understanding is key to informing responsible and pedagogically sound approaches to integrating LLMs into instruction and assessment. This study therefore compares the problem-solving performance of a general-purpose LLM (𝐺⁡𝑃⁢𝑇−4⁢𝑜, using varying prompting techniques) and a reasoning-optimized model (𝑜⁢1-preview) with that of participants in the German Physics Olympiad, based on a set of well-defined Olympiad problems. In addition to evaluating the correctness of the generated solutions, the study analyzes the characteristic strengths and limitations of LLM-generated solutions. The results of this study indicate that both tested LLMs (𝐺⁡𝑃⁢𝑇−4⁢𝑜 and 𝑜⁢1-preview) demonstrate advanced problem-solving capabilities on Olympiad-type physics problems, on average outperforming the human participants. Prompting techniques had little effect on 𝐺⁡𝑃⁢𝑇−4⁢𝑜’s performance, and 𝑜⁢1-preview almost consistently outperformed both 𝐺⁡𝑃⁢𝑇−4⁢𝑜 and the human benchmark. The main implications of these findings are twofold: LLMs pose a challenge for summative assessment in unsupervised settings, as they can solve advanced physics problems at a level that exceeds top-performing students, making it difficult to ensure the authenticity of student work. At the same time, their problem-solving capabilities offer potential for formative assessment, where LLMs can support students in evaluating their own solutions to problems.

Aktuelles

Über uns

Abteilungen

Forschungslinien

Projekte

Alle Publikationen des IPN

Open Science & Gute Wissenschaftliche Praxis

Kooperationen & Vernetzung

Themen

Unterrichtsergänzende Angebote

Unterrichts- und Fortbildungsmaterialien

Podcasts - Forschung zum Hören

IPN Journal

Evaluating GPT- and reasoning-based large language models on Physics Olympiad problems: Surpassing human performance and implications for educational assessment

Artikel in Fachzeitschrift › Forschung › begutachtet

Publikationsdaten

DOI

IPN - Leibniz-Institut für die Pädagogik der Naturwissenschaften und Mathematik