Christian Stump hat mit Kolleginnen und Kollegen aktuellen Large Language Models mathematische Übungsaufgaben gestellt.
Mathematics
Large Language Models Solve High-Level Math Problems
In a benchmark study, 49 international researchers put artificial intelligence to the test with 100 problems. Only a few remained unsolved.
A group of 49 international mathematicians put current large language models to the test with 100 questions: Which high-level math problems can they solve? Which ones can’t they solve yet? “We were impressed by the results,” reports Professor Christian Stump of Ruhr University Bochum, Germany, the initiative’s organizer. “Only two problems remained unsolved. "This shows that artificial intelligence's mathematical problem-solving abilities have improved significantly."
The researchers gathered for a three-day workshop at the Max Planck Institute for Mathematics in the Sciences in Leipzig. There, using the ScienceBench platform, they compiled a benchmark consisting of 100 mathematical questions. These questions were at least as complex as those found in doctoral dissertations. The answers had to be unambiguous and known to the authors, but could not have been explicitly published in any academic journals.
They posed these questions to five current large language models (LLMs) just once. Afterward, 41 tasks remained unsolved. They then presented the same questions to the top three models from the first round 20 more times. “There is significant variation in the answers a model provides to the exact same question across different rounds,” explains Christian Stump. “After 20 rounds, we already see significantly more solved questions than in a single round. Only 16 unsolved questions remained.”
Finally, they posed the questions three times in a row to two so-called "heavy-thinking" models. These models were able to solve an additional 14 practice problems, leaving only two problems completely unsolved in the end.
The models tested were GPT-5.5, Gemini 3.1 Pro, Claude Opus 4.7, DeepSeek-V4-Pro, Grok 4.3, GPT-5.5 Pro (Extended Thinking), and Gemini 3.1 Pro Deep Think.