Has anyone done a study on the optimal number of registers to have? The website ...

Scaevolus · on Oct 11, 2013

Yes, it's been studied. You rapidly run into diminishing returns.

http://arxiv.org/ftp/arxiv/papers/1205/1205.1871.pdf

Here's a good thread discussing this: https://groups.google.com/forum/#!searchin/comp.arch/number$...

joosters · on Oct 11, 2013

Awesome! Thank you for the link.

brigade · on Oct 11, 2013

The studies would vary over time because CPU design and bottle necks have changed. Early designs were of course limited by transistor count, now we have OoOe and physical registers are limited by muxers and latency (see the presentations by the mill CPU guy [1]

Saving registers in functions is mostly irrelevant - you only save what you'd use, so saving more means fewer spills within the function.

Saving on context switches (interrupts alone aren't a big deal) was indeed a problem back when AltiVec was designed, thus it has a special register to keep track of which registers need to be saved. In modern designs this is less of a problem, between higher frequencies, multiple cores, and the other effects of a context switch dominating (effective flush of l1 cache and predictors).

The interesting bits nowadays are that load/store is expensive power-wise, which was what ARM identified as the major motivation behind having 32 registers (fewer spills in functions) and OoOe designs.

[1] http://m.youtube.com/watch?v=QGw-cy0ylCc&desktop_uri=%2Fwatc...

joosters · on Oct 11, 2013

Saving registers in functions is mostly irrelevant - you only save what you'd use, so saving more means fewer spills within the function.

Ah, but I'm sure that if you have more registers available, you'd use more registers. Up to a certain point. But what point? Just how many registers?

brigade · on Oct 11, 2013

No one uses more registers just to use more registers - in OoOe designs the main reason to use more registers is to reduce spilling and reloading. So in effect a compiler isn't going to use a register it has to save, unless in doing so it saves a spill+reload, which would result in the same number of load/store as without the additional register.

In-order designs have more reasons to use more registers, but again they aren't going to use more registers unless they gain something.

masklinn · on Oct 11, 2013

> The website answers the register question well, but leads to a further question: If registers are so great, why stick with just 16/32/64/n registers?

TFA gives at least one reason:

> Registers use an expensive and power-hungry active design. They're continuously powered, and when reading them, this means that their value can be read quickly. Reading a register bit is a matter of activating the right transistor and then waiting a short time for the powerful register hardware to push the read line to the appropriate state.

Registers use up a lot of silicon, and consume a lot of energy to power it. They also need to stay physically close to computing circuits, otherwise you end up with an L1 cache more than a register.

Furthermore, although ISA expose a number of registers A, OOO architectures (and their friends parallel and speculative executions) pretty much require the CPU to have > A registers and do register renaming, which lowers the number of registers the ISA can define. For instance the Alpha ISA defines 32 integer registers, but the Alpha 21264 had 80 physical integer registers.

joosters · on Oct 11, 2013

That's definitely another factor. Again though, I doubt it's the limiting one. No-one (as far as I know) has produced a power-hungry CPU with (say) 5000 registers on it.

terrymah · on Oct 11, 2013

I've heard that modern Intel processors have 100 < x < 200 physical registers. I'm not sure they actually document the exact number.

hornetblack · on Oct 12, 2013

Itanium has at least 256. (128 Integer + 128 Float + 128 predicate (1 bit), which are essentially flip-flops.)

solarexplorer · on Oct 11, 2013

Register windows are a way to put 1000 registers in a CPU. See the SPARC and Itanium instruction sets for how this can be done. There are also plenty of studies about both.

Vector registers are another way to use 1000 registers.

But directly coding 1000 registers into each instruction does not seem to be such a good idea. You might as well use a 1st level cache. The difference between the cache and the register file ist mostly how the instruction set architecture references it. Registers are usually easier to acces because each one has a single name and the CPU can detect dependencies and conflicts easily. Memory accesses and caches are more complex because you need to calculate the addresses before you can detect dependencies/conflicts.

PD: Yet another way to use 1000 registers is massive multi-threading like the Tera MTA.

terrymah · on Oct 11, 2013

It's complicated, but modern processors actually do have many more registers than you can name in the instructions. They use things like "register renaming" to avoid false conflicts between instructions.

Registers that you name in assembly != physical registers. And when you use a register in two different instructions, you won't necessarily get the same physical register each time.

edderly · on Oct 11, 2013

I thought this was an interesting insight in to that: http://ootbcomp.com/docs/belt/index.html

vilda · on Oct 11, 2013

Note that the actual number of registers is considerably different than the number of registers you can access through instruction set. They are used via register renaming and optimizations of complex instructions.

joosters · on Oct 11, 2013

Yes. As other commentors have said, if you are doing out-of-order execution well, the CPU will have many more 'hidden' registers and do register renaming to use them. But this has an interesting interaction with compilers.

Say you have a simple function that is going to add 1 to a bunch of variables. In an ARM-like assembly code, this could be written as:

  LDR r1, [r0, #0]
  ADD r1, r1, #1
  STR r1, [r0, #0]
  LDR r1, [r0, #4]
  ADD r1, r1, #1
  STR r1, [r0, #4]
  LDR r1, [r0, #8]
  ADD r1, r1, #1
  STR r1, [r0, #8]

Now, if your CPU can do OoOE, it can spot that register r1 is used for three independent loads, adds and stores, and can internally use three different registers for them, allowing the operations to be done in parallel. But, equally, the compiler could have written the code as:

  LDR r1, [r0, #0]
  ADD r1, r1, #1
  STR r1, [r0, #0]
  LDR r2, [r0, #4]
  ADD r2, r2, #1
  STR r2, [r0, #4]
  LDR r3, [r0, #8]
  ADD r3, r3, #1
  STR r3, [r0, #8]

Compilers and register renaming are fighting each other. In traditional compiler writing, you try to minimise the register usage and output the first code listing. But if you have plenty of registers, you could output the second code instead, and let the CPU do parallel execution without the need for register renaming.

In other words, once you have enough 'real' registers does it get rid of the need for register renaming? Intel added it to their pentiums to improve existing x86 code, but I wonder if it has that much of a benefit with newer ISAs that have 'enough' registers and properly tuned compilers?

brigade · on Oct 11, 2013

You still need OoOe to execute your second example optimally since you didn't schedule the instructions, which points to why OoOe isn't going away - there are going to be code sequences that the compiler cannot schedule optimally, particularly around branches. Additionally, cache misses are impossible to predict statically, and OoOe helps hide those.

And no one does OoOe without register renaming.

joosters · on Oct 11, 2013

Yeah, I avoided any other changes to avoid confusing the issue. But any reordering I could have done, the compiler could have done too. Your point about branches is fair though, as the 'active' renamed registers after a branch can only be known at runtime.

Still, I wonder whether some of the features of modern CPUs could be dropped if it wasn't for legacy code. On the other hand, Itanium tried to push the parallelism work onto the compiler and look where that ended up!

Symmetry · on Oct 11, 2013

Most high performance CPUs will have ~100 physical registers or so, possibly divided up in multiple segments.

But abstracting those you have your architectural registers that are presented by your ISA, and the CPU uses register renaming to map those onto the physical registers.

The tradeoffs involving ISA registers are more intense. You have to load and store all of them on thread swaps, but that's pretty tivial. More importantly the bits you have to use to specify which register you're using are bits that you're paying in every single instruction you have, increasing the size of your executable and the pressure on your caches.

Different sorts of architectures have their sweet spots at different places. In order processors doing lots of matrix math and such benefit from lots of architectural registers, the Itanium had 128 integer and 128 floating point registers and that was the right amount for a VLIW architecture with it's features. Modern GPUs are similar.

On the other hand, your typical OoO CPU will have either 16 or 32 registers you can address at a time, and that seems to be close to optimal. It's hard to say since instructions come in discrete chunks and your number of registers has to be a power of 2 as a practical matter.

codex · on Oct 11, 2013

Fundamentally, having more registers increases the speed of light delays in accessing the register. If it did not, we would just operate on main memory itself. However, two few registers and you lose the ability to perform complex computations efficiently. So I believe it is, indeed, a compromise between speed and a need to maintain scratch state. I would be surprised if Intel and AMD didn't constantly run simulations of common computations in an effort to find the optimal size of all on-chip structures.

joosters · on Oct 11, 2013

That's definitely another factor but I suspect it isn't the limiting factor. Sure, design a chip with a million registers and you'll end up constructing them like RAM. But with orders-of-magnitude fewer registers, 16 or 32 or whatever, the size of the register banks on the CPU can't be that significant to incur speed-of-light style delays, surely?

sliverstorm · on Oct 11, 2013

WITH 16x fewer registers, that equates to about 1 chip's worth of registers (remember a stick of DRAM often has 8-16 individual chips on it). While this is already clearly a huge problem, consider additionally that DRAM is made with trench capacitors, unlike SRAM. DRAM is dramatically slower and more dense than SRAM. So we either sacrifice speed, or bloat our one-chip's-worth of area by a few factors, say x4-8.

Then there's practicalities like sense amp design. Large register arrays are not read in a digital fashion, and current L2 and L3 sizes already press their sense amps to their limits. DRAM also uses sense amps, but the amps are again slower and larger.

http://en.wikipedia.org/wiki/Sense_amplifier

masklinn · on Oct 11, 2013

Probably not, but there are definitely delay effects at play or L2 and L3 cache would be unnecessary, you could just have humongous L1s.