>Math Word Problem (MWP) Benchmarks. Certain benchmarks inherently possess richer logical
structures that facilitate targeted perturbations. MWPs exemplify this, as their logic can be readily abstracted
into reusable templates. Researchers use this property to generate variants by sampling numeric values
(Gulati et al., 2024; Qian et al., 2024; Li et al., 2024b) or substituting irrelevant entities (Shi et al., 2023;
Mirzadeh et al., 2024). Structural transformations – such as exchanging known and unknown components
(Deb et al., 2024; Guo et al., 2024a) or applying small alterations that change the logic needed to solve
problems (Huang et al., 2025b) – further highlight deeper robustness limitations.
I'm willing to bet this is no longer true as well. We have models that are doing better than humans at IMO.
> We have models that are doing better than humans at IMO.
Not really. From my brief experience they can guess the final answer but the intermediate justifications and proofs are complete hallucinated bullshit.
(Possibly because the final answer is usually some sort of neat and beatiful answer and human evaluators don't care about the final answer anyways, in any olympiad you're graded on the soundness of your reasoning.)
I'm willing to bet this is no longer true as well. We have models that are doing better than humans at IMO.