>Basic Arithmetic. Another fundamental failure is that LLMs quickly fail in arithmetic as operands
increase (Yuan et al., 2023; Testolin, 2024), especially in multiplication. Research shows models rely on
superficial pattern-matching rather than arithmetic algorithms, thus struggling notably in middle-digits
(Deng et al., 2024). Surprisingly, LLMs fail at simpler tasks (determining the last digit) but succeed in harder
ones (first digit identification) (Gambardella et al., 2024). Those fundamental inconsistencies lead to failures
for practical tasks like temporal reasoning (Su et al., 2024).
This is very misleading and I think flat out wrong. What's the best way to falsify this claim?
I provided really hard 20 digit multiplications without tools. If you looked at the reasoning trace, it does what is normally expected and gets it right. I think this is enough to suggest that the claims made in the paper are not valid and LLMs do reason well.
To anyone who would disagree, can you provide a counter example that can't be solved using GPT 5 pro but that a normal student could do without mistakes?
I see that your prompt includes 'Do not use any tools. If you do, write "I USED A TOOL"'
This is not a valid experiment, because GPT models always have access to certain tools and will use them even if you tell them not to. They will fib the chain of thought after the fact to make it look like they didn't use a tool.
This isn't an experiment a consumer of the models can actually run. If you have a chance to read the article I linked, it is difficult even for the model maintainers (openai, anthropic, etc.) to look into the model and see what it actually used in it's reasoning process. The models will purposefully hide information about how they reasoned. And they will ignore instructions without telling you.
The problem really isn't that LLM's can't get math/arithmetic right sometimes. They certainly can. The problem is that there's a very high probability that they will get the math wrong. Python or similar tools was the answer to the inconsistency.
"I should explain that both the “python” and “python_user_visible” tools execute Python code and are stateful. The “python” tool is for internal calculations and won’t show outputs to the user, while “python_user_visible” is meant for code that users can see, like file generation and plots."
But really the most important thing, is that we as end-users cannot with any certainty know if the model used python, or didn't. That's what the alignment faking article describes.
> To avoid timeouts, try using background mode. As our most advanced reasoning model, GPT-5 pro defaults to (and only supports) reasoning.effort: high. GPT-5 pro does not support code interpreter.
You are wrong from the link you shared. It was about ChatGPT not the api.
The documentation makes it unambiguously clear that gpt 5 pro does not support code interpreter. Unless you think they secretly run it which is a conspiracy, is it enough to falsify?
> Unless you think they secretly run it which is a conspiracy
tbh this doesn't sound like a conspiracy to me at all. There's no reason why they couldn't have an internal subsystem in their product which detects math problems and hands off the token generation to an intermediate, more optimized Rust program or something, which does math on the cheap instead of burning massive amounts of GPU resources. This would just be a basic cost optimization that would make their models both more effective and cheaper. And there's no reason why they would need to document this in their API docs, because they don't document any other internal details of the model.
I'm not saying they actually do this, but I think it's totally reasonable to think that they would, and it would not surprise me at all if they did.
Let's not get hung up on the "conspiracy" thing though - the whole point is that these models are closed source and therefore we don't know what we are actually testing when we run these "experiments". It could be a pure LLM or it could be a hybrid LLM + classical reasoning system. We don't know.
“Code interpreter” is a product feature the customer can use that isn’t being discussed.
They can obviously support it internally, and the feature exists for ChatGPT, but they’re choosing not to expose that combo in the API yet because of product rollout constraints.
Alright let's say I'm wrong about the details/nuances. That's still really not the point.
The point is this:
> we as end-users cannot with any certainty know if the model used python, or didn't
These tools can and do operate in ways opposite to their specific instructions all the time. I've had models make edits to files when I wasn't in agent mode (just chat mode). Chat mode is supposedly a sandboxed environment. So how does that happen? And I am sure we've all seen models plainly disregard an instruction for one reason or another.
The models, like any other software tool, have undocumented features.
You as an end-user cannot falsify the use of a python tool regardless of what the API docs say.
I know what falsifiable means--you're misusing it and I simply adopted your misuse. A claim is falsifiable or not ... it can't be made falsifiable. The way you're using it is "Can we come up with a test to show that it's false"--no, we can't, because it's not false.
Again, there's nothing that one can do to prove that something that isn't false is false. Sheesh. I won't respond to you again as there's no need to simply repeat it.
Please don't cross into posting like this, no matter how wrong someone else is or you feel they are. It's not what this site is for, and destroys what it is for.
Please don't cross into posting like this, no matter how wrong someone else is or you feel they are. It's not what this site is for, and destroys what it is for.
It's a well known fact that LLMs struggle with basic arithmetic of large numbers, that's not what they are made for. Most chatbots will just call a python interpreter in the background.
How do you know GPT-5 does not call a Python interpreter remotely on OpenAI servers when you ask it to do arithmetic ? Your prompt goes to their servers, you have no way to know what happens there.
The only way to be sure a model calls no tool is to run it locally and control the network.
I chose Mistral because it's a large SOTA model that allows me to disable tool calling.
If you try using ChatGPT, you'll see it goes into reasoning mode (It flashes the text "Working") and takes a while when asked to do arithmetic. Chatbots that don't forbid the user from viewing the reasoning chain of thought, such as Deepseek or Mistral, reveal that these chatbots work exactly as I described:
====
Thought for 1s
Thinking
Okay, the user wants to multiply two large numbers: 63,157,997,633 and 6,311,490,009. Since these numbers are quite large, I should use the code interpreter to handle this calculation accurately. Let's write a simple piece of code to compute this product.
Code
output
generated
python
# Calculating the product of the two numbers
number1 = 63157997633
number2 = 6311490009
product = number1 * number2
product
ResultCopy
398621071049125148697
Thinking
Great, the code interpreter has given us the product of the two numbers. Now, let's present the result to the user.
====
Falsifying my claim is simple: Find a model that verifiably does basic arithmetic correctly without relying on any external tools.
I run Qwen3-32b locally without any tools (just llama.cpp) and it can do basic arithmetic for smaller numbers ( like 134566) but I didn't check it for much larger numbers. I'm not at the PC right now but trying to do it via OpenRouter on much larger numbers overflows the context and it stops without giving a result :)
I think the point of the line of questioning is to illustrate that "tools" like a code interpreter act as scratch space for models to do work in, because the reasoning/thinking process has limitations much like our own.
>Math Word Problem (MWP) Benchmarks. Certain benchmarks inherently possess richer logical
structures that facilitate targeted perturbations. MWPs exemplify this, as their logic can be readily abstracted
into reusable templates. Researchers use this property to generate variants by sampling numeric values
(Gulati et al., 2024; Qian et al., 2024; Li et al., 2024b) or substituting irrelevant entities (Shi et al., 2023;
Mirzadeh et al., 2024). Structural transformations – such as exchanging known and unknown components
(Deb et al., 2024; Guo et al., 2024a) or applying small alterations that change the logic needed to solve
problems (Huang et al., 2025b) – further highlight deeper robustness limitations.
I'm willing to bet this is no longer true as well. We have models that are doing better than humans at IMO.
> We have models that are doing better than humans at IMO.
Not really. From my brief experience they can guess the final answer but the intermediate justifications and proofs are complete hallucinated bullshit.
(Possibly because the final answer is usually some sort of neat and beatiful answer and human evaluators don't care about the final answer anyways, in any olympiad you're graded on the soundness of your reasoning.)
Just look at the dates of the cited articles. 2023, 2024: that's prehistory, before thinking models anyway. It's like concluding that humans don't understand arithmetic because they can't multiply large numbers at sight.
I'm not sure what the paper is really about despite the enthusiasm of the LLM haters here. Certainly there isn't something called "LLMs" that stayed reasonably the same in the last 4 years- GPT-2 is an LLM but a finding on it most likely doesn't apply to Opus 4.6. You can't document a failure on a 2024 model and claim "LLMs can't do this".
>Basic Arithmetic. Another fundamental failure is that LLMs quickly fail in arithmetic as operands increase (Yuan et al., 2023; Testolin, 2024), especially in multiplication. Research shows models rely on superficial pattern-matching rather than arithmetic algorithms, thus struggling notably in middle-digits (Deng et al., 2024). Surprisingly, LLMs fail at simpler tasks (determining the last digit) but succeed in harder ones (first digit identification) (Gambardella et al., 2024). Those fundamental inconsistencies lead to failures for practical tasks like temporal reasoning (Su et al., 2024).
This is very misleading and I think flat out wrong. What's the best way to falsify this claim?
Edit: I tried falsifying it.
https://chatgpt.com/share/6999b72a-3a18-800b-856a-0d5da45b94...
https://chatgpt.com/share/6999b755-62f4-800b-912e-d015f9afc8...
I provided really hard 20 digit multiplications without tools. If you looked at the reasoning trace, it does what is normally expected and gets it right. I think this is enough to suggest that the claims made in the paper are not valid and LLMs do reason well.
To anyone who would disagree, can you provide a counter example that can't be solved using GPT 5 pro but that a normal student could do without mistakes?