>Math Word Problem (MWP) Benchmarks. Certain benchmarks inherently possess riche...

otabdeveloper4 · 2026-02-21T13:28:45 1771680525

> We have models that are doing better than humans at IMO.

Not really. From my brief experience they can guess the final answer but the intermediate justifications and proofs are complete hallucinated bullshit.

(Possibly because the final answer is usually some sort of neat and beatiful answer and human evaluators don't care about the final answer anyways, in any olympiad you're graded on the soundness of your reasoning.)

simianwords · 2026-02-21T13:39:04 1771681144

what's the best way to falsify it?

tveita · 2026-02-21T15:42:59 1771688579

You could start by reading research on the topic instead of disregarding expert opinion based on your own gut feeling

E.g. https://www.anthropic.com/research/tracing-thoughts-language...

simianwords · 2026-02-21T16:26:53 1771691213

It’s specific on Claude.

otabdeveloper4 · 2026-02-22T09:46:50 1771753610

Falsify what? The claim that LLM's are good for olympiad problems?

I'm just an end user who tried to use these "frontier models" to actually solve real olympiad problems. They're useless.