I asked it when apple announced 64 bit support for the iPhone, and I'll ask it a...

terhechte · on Aug 12, 2014

It's not only about memory:

"In short, the improvements to Apple's runtime make it so that object allocation in 64-bit mode costs only 40-50% of what it does in 32-bit mode. If your app creates and destroys a lot of objects, that's a big deal." [1]

One example: in 64bit NSNumber and (short) NSString objects on iOS can be stored completely in a pointer (tagged pointers) on the stack without having to create anything on the heap. That's possible because the size of a 64bit pointer is large enough to contain the required information. Creating and accessing one of these objects becomes far faster. Based on what I gathered at WWDC this year, Apple is also inclined to move as many various objects there as possible (as long as they can be stored in a 64 bit pointer).

[1] https://www.mikeash.com/pyblog/friday-qa-2013-09-27-arm64-an...

DCKing · on Aug 12, 2014

It is worth noting that this is likely to be an iOS-on-AArch64 specific tweak. Apple doesn't do this on x64 OS X (right?).

Furthermore, nothing has been said (AFAIK) about similar tricks Google is doing with Android on AArch64 or 64 bit architectures in general. From what I've heard they just made the Android runtime capable of emitting AArch64 (and x64, MIPS64) instructions instead of 32-bit ARMv7 ones.

That alone gives plenty of a performance boost as AArch64 instructions are supposedly quite a bit faster than the old 32-bit ones. It remains to be seen though whether Google can make as much use of clever allocation tricks as much as Apple can: Android is much more platform agnostic and uses a garbage collector for some of these tasks.

pjmlp · on Aug 12, 2014

Google is already doing it in ART.

ART also brings a new GC, which is quite improved compared to the fossilized Dalvik version that has barely changed since Android 2.3.

pohl · on Aug 12, 2014

There is nothing iOS specific about it. MacOS X has been benefiting from this optimization since 10.7.

DCKing · on Aug 12, 2014

I stand corrected. So it's a feature of Apple's runtime on all 64 bit systems.

frankchn · on Aug 12, 2014

64-bit support on ARM entails the implementation of AArch64 - not just widening the current registers to 64-bits and nothing else.

AArch64 has a better instruction set and more general purpose registers (31 vs 16), so applications recompiled to target AArch64 should see performance benefits even if it only uses 1 MB of RAM and never encounters integers greater than 2^32.

rayiner · on Aug 12, 2014

You can't just release a 64 bit chip and expect a full stack of 64-bit clean software to be available on day 1. Indeed, last time I checked Dalvik wasn't fully 64-bit clean. Releasing earlier than you need it ensures that by the time you do need it, the software is ready to go.

Also, besides the correct points made in sister comments, there is the issue that virtual address space is used for things other than simply mapping physical memory. There was a time, for example, when Linux mapped all of physical memory into the kernel's virtual address space. This was simple. These days, on 32-bit systems, it splits the 4GB address space into 3GB for the user and 1GB for the kernel, and maps physical memory in and out of a 128MB window in kernel space, as needed. This is obviously more complicated.

antimagic · on Aug 12, 2014

Ugh, I started to write a response, but then I realised I was more or less just going over the same ground that mikeash has more eloquently covered. https://mikeash.com/pyblog/friday-qa-2013-09-27-arm64-and-yo...

One thing that Mike missed though is that you can do large bandwidth operations faster. You can blend two pixels at a time with 64bit, so your graphical blend can be twice as fast (of course, if you're really worried about the speed of these types of operation, you go SIMD - but that's a pain in the butt and requires in line assembly language to get the biggest boosts. In particular, ARM Neon's C function wrappers on SIMD instructions generally gives you a much smaller boost than if you hand code the assembly language calls. With 64bit, you can get some of the advantage without the hassle).

exDM69 · on Aug 12, 2014

> In particular, ARM Neon's C function wrappers on SIMD instructions generally gives you a much smaller boost than if you hand code the assembly language calls.

My practical experience using intrinsics is quite the opposite. Using C + NEON or SSE intrinsics yielded a lot better performance than what my hand written assembly could.

I've heard similar stories to yours from others but they were from years ago. Later versions of compilers seem to be a lot better at this.

In particular, the C compilers (GCC, Clang) were able to further optimize code written using intrinsics, where using assembler code practically inhibits all compiler optimizations.

While I am able to do pretty good instruction selection by hand (which the compiler wasn't very good at), what blew me away was the compiler's ability to do instruction scheduling and register allocation. I could not have matched that without spending a huge amount of time reading architecture specific optimization manuals.

So I got pretty neat, readable and portable code (x86+SSE and ARM+NEON) using a single code base where I had only written some "primitive" functions using low level intrinsics. I could write nice high level code and the compiler would inline my primitive function calls and then further optimize and re-organize the instructions.

antimagic · on Aug 13, 2014

Um, I don't think my GCC was that old, it was for an Android tablet from about two years ago - GCC 4.3 or 4.4 if memory serves me correctly. But yes, it was definately a failure to optimise that was causing the problem. To be honest, I generated my "hand-written" assembly by taken the output of gcc compiling intrinsic-based code, and then removing all of the dumb, unnecessary shuffles to main memory by better use of the registers... That said, there were clearly bugs in the compiler - you could crash GCC by doing certain innocuous things in your assembly code...

fulafel · on Aug 12, 2014

Your virtual address space gets cramped before your physical RAM hits your addressable VA limit. See the pain points in 32-bit desktop operating systems trying to run with 4GB ram.

brigade · on Aug 12, 2014

Linus has said that the pain point actually starts around 1GB, especially when using a 3/1 userspace/kernel address split. At 1GB, you can't map the entirety of physical ram into kernel address space simultaneously, since you also have memory-mapped IO taking up space. Let alone mapping the same physical memory to multiple virtual addresses with different cache attributes, which is useful because approximately none of these SoCs have full cache coherence across all hardware blocks.

qwerta · on Aug 12, 2014

We have embedded system with 128MB which has problem with 32bit addressing limit as well. You can not mmap files...

ris · on Aug 12, 2014

Being able to mmap large files, even if you don't have the memory to hold them completely is a great advantage. Address space isn't just about physical memory.

exDM69 · on Aug 12, 2014

This! I hope that 32 bit processors will soon be history and we will be able to use mmap (+ madvise, fadvise, etc) instead of read/write without having to worry about portability.

stock_toaster · on Aug 12, 2014

> I asked it when apple announced 64 bit support for the iPhone, and I'll ask it again now: what is the point in 64 bit cpus in devices that will be deprecated and replaced before we get around to having more than 32 bits worth of addressable ram in these devices?

I imagine the additional cpu registers (general purpose and floating point) are handy for crypto and encoder/decoder stuff.

pjmlp · on Aug 12, 2014

Yes, you can make use of extra bits for metadata like type tagging for treating objects as value types and GC information.

Quite useful when moving away from C into languages like Objective-C, Swift, Java, C#, ... as the ones with first class treatment on the vendor SDKs

sliken · on Aug 12, 2014

Double the performance, not because of 64-bit. For instance the dual core 64-bit apple chip does quite well in benchmarks and application performance against numerous 32 bit android devices with 4 cores.

personZ · on Aug 12, 2014

Cortex-A15 designs and up already offer 36 to 40 bits of addressable memory, or from 64GB to 1TB. This is at the OS level, of course, so individual 32-bit processes would still be limited to 2-3GB without intentional PAE, but that's hardly a limit in a mobile design.

But ultimately it isn't about memory in the near-term. ARMv8 offers more, larger registers, instructions, a higher hardware baseline, and in the future higher memory.