Christian Stump and his colleagues have given current large language models math problems to solve.

© RUB, Marquard

Mathematics

Large Language Models Solve High-Level Math Problems

In a benchmark study, 49 international researchers put artificial intelligence to the test with 100 problems. Only a few remained unsolved.

A group of 49 international mathematicians put current large language models to the test with 100 questions: Which high-level math problems can they solve? Which ones can’t they solve yet? “We were impressed by the results,” reports Professor Christian Stump of Ruhr University Bochum, Germany, the initiative’s organizer. “Only two problems remained unsolved. "This shows that artificial intelligence's mathematical problem-solving abilities have improved significantly."

The researchers gathered for a three-day workshop at the Max Planck Institute for Mathematics in the Sciences in Leipzig. There, using the ScienceBench platform, they compiled a benchmark consisting of 100 mathematical questions. These questions were at least as complex as those found in doctoral dissertations. The answers had to be unambiguous and known to the authors, but could not have been explicitly published in any academic journals.

They posed these questions to five current large language models (LLMs) just once. Afterward, 41 tasks remained unsolved. They then presented the same questions to the top three models from the first round 20 more times. “There is significant variation in the answers a model provides to the exact same question across different rounds,” explains Christian Stump. “After 20 rounds, we already see significantly more solved questions than in a single round. Only 16 unsolved questions remained.”

Finally, they posed the questions three times in a row to two so-called "heavy-thinking" models. These models were able to solve an additional 14 practice problems, leaving only two problems completely unsolved in the end.

The models tested were GPT-5.5, Gemini 3.1 Pro, Claude Opus 4.7, DeepSeek-V4-Pro, Grok 4.3, GPT-5.5 Pro (Extended Thinking), and Gemini 3.1 Pro Deep Think.

Original publication

Christian Stump et al.: Benchmarks in Leipzig, online erschienen auf arxiv.org, 2026, DOI: 10.48550/arXiv.2606.05818

Press contact

Prof. Dr. Christian Stump
Algebraic Combinatorics
Faculty of Mathematics
Ruhr University Bochum
Germany
E-Mail: christian.stump@ruhr-uni-bochum.de

•    Webseite

Download high-resolution images
The selected images are downloaded as a ZIP file. The captions and image credits are available in the HTML file after unzipping.
Conditions of use
The images are free to use for members of the press, provided the relevant copyright notice is included. The images may be used solely for press coverage of Ruhr-Universität Bochum that relates solely to the contents of the article that includes the link for the image download. By downloading the images, you receive a simple right of use for one-time reporting. Saving the images for other purposes or further processing of the images that goes beyond adapting them to the respective layout requires an extended right of use. Should you therefore wish to use the photos in any other way, please contact redaktion@ruhr-uni-bochum.de

Published

Tuesday
09 June 2026
2:40 pm

By

Meike Drießen (md)

Translated by

Maschinelle Übersetzung

Share