Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

>Math Word Problem (MWP) Benchmarks. Certain benchmarks inherently possess richer logical structures that facilitate targeted perturbations. MWPs exemplify this, as their logic can be readily abstracted into reusable templates. Researchers use this property to generate variants by sampling numeric values (Gulati et al., 2024; Qian et al., 2024; Li et al., 2024b) or substituting irrelevant entities (Shi et al., 2023; Mirzadeh et al., 2024). Structural transformations – such as exchanging known and unknown components (Deb et al., 2024; Guo et al., 2024a) or applying small alterations that change the logic needed to solve problems (Huang et al., 2025b) – further highlight deeper robustness limitations.

I'm willing to bet this is no longer true as well. We have models that are doing better than humans at IMO.



> We have models that are doing better than humans at IMO.

Not really. From my brief experience they can guess the final answer but the intermediate justifications and proofs are complete hallucinated bullshit.

(Possibly because the final answer is usually some sort of neat and beatiful answer and human evaluators don't care about the final answer anyways, in any olympiad you're graded on the soundness of your reasoning.)


what's the best way to falsify it?


You could start by reading research on the topic instead of disregarding expert opinion based on your own gut feeling

E.g. https://www.anthropic.com/research/tracing-thoughts-language...


It’s specific on Claude.


Falsify what? The claim that LLM's are good for olympiad problems?

I'm just an end user who tried to use these "frontier models" to actually solve real olympiad problems. They're useless.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: