The testing code is not open source yet, but we plan to make it available in the near future. We are also working on a code analysis tool, similar to Intel's IACA, that uses the instruction data from http://www.uops.info/. Finally, we also plan to add data for AMD processors.
Oh cool, seems like we'll jump from zero to two direct competitors to IACA soon. I look forward to your tool (hopefully it can use the same IACA marker bytes so binaries and be compiled once for all these tools).
There is also llvm-mca, which is kinda neat since it uses the I formation that LLVM has about the op cost. Contrasting that to IACA or OACA is valuable as well.
You are right, I had forgotten about that one. Do they actually provide results for various CPUs in human-readable form anywhere, or is it only a toolset for generating the benchmarks and collecting the results locally (plus PDF extraction to process the Intel SDM)?
I think they have internally tools for humans but the open source project only produces the machine-readable information intended to be consumed by compilers. If you look at their slide deck they have some kind of visualizer that explains dependency and port dispatch for a short snippet of asm.
Yeah I wasn't talking so much about machine vs human readable, but whether the results for various CPUs are available, published somewhere in the git repo or elsewhere. If not, then it's just a tool for getting the results yourself, on hardware you have - which is also very nice, but a different type of resource as compared to Agner's published guides for 20+ CPUs, and the uops.info tables, etc.
Latency numbers will differ (a lot) because the definitions of latency these sites use differs. Agner uses the delay that the instruction generates in a dependency chain. uops.info uses number of clock cycles that are required for the execution core to complete the execution of all of the μops that form an instruction. instalatx64 uses latency means the time that it takes for the next dependent same-type instruction to start. These are very different definitions.
For example, Agner gives a latency of 2 for pop r while uops.info gives 6.
The "number of clock cycles that are required for the execution core to complete the execution of all of the μops that form an instruction" is the definition that Intel uses in its manuals to define latency.
It is actually not the definition that uops.info uses. Instead, it uses a definition that takes into account that different operands of an instruction might be ready at different times (see section 4.1 of the paper that mentions Intel's definition as a "common definition" and then continues to introduce the new definition). Furthermore, uops.info also considers latency differences that can occur if an instruction uses the same register for multiple operands (the SHLD instruction is an example for this).
Is it even possible to measure the latency by Intel's definition? I think "operational" definitions that correspond to something you can actually measure (and by extension apply to measured performance in real code too) are far preferable.
In theory "it is complicated" because an instruction might have N inputs and M outputs and the latency matrix might have different values for every element in that N x M matrix, but in practice instructions have 1 output and the inputs are usually symmetric with regards to latency so one figure is enough for most. It's worth calling out the exceptions though, and uops.info is supposed to do that I think (an example would be great).
Section 7.3 of the paper describes several such examples.
On Sandy Bridge, for example, the result of the "AESDEC XMM1, XMM2" instruction is ready 8 cycles after XMM1 becomes available. If only XMM2 is on the critical path, however, the result is already available after a bit more than one cycle.
On Nehalem, the SHLD R1, R2, imm instruction has, according to Intel's manual, http://instlatx64.atw.hu/, and IACA, a latency of 4 cycles. Agner Fog reports a latency of 3 cycles. The measurements on uops.info show that the latency from R1 to R1 is 3 cycles, while the latency from R2 to R1 is 4 cycles.
Right, I read the paper and saw the examples, but I meant more an example of if/how this information is surfaced in the results as collected at uops.info.
Anyways, it is all there, using your SHLD example:
All of the memory-source instructions in Agner's guide don't really have a meaningful latency. In particular, the input and output domains are different. In the case of pop, the input is memory and the output is a register, so you can't really measure the latency of pop in isolation: you need to chain it together with at least one other instruction whose inputs and outputs are in the opposite domain (eg push, or mov [memory], reg).
Agner says himself that he "arbitrarily" divides the latency between load and store type instructions in these cases which gives the mostly meaningless value of 2 for load-type instructions.
They mention in the "Limitations" section that "Except for the division instructions, we do not consider performance differences that might be due to different values in registers, or different immediate values."
Besides division, are there other instructions that are known/expected/suspected to have different execution times based on the values in the input registers?
I have found that load instructions take one less cycle (4 vs 5) if using an index register, where the index register is zero, and the value was set to zero via zeroing idiom. More details on RWT [1].
You didn't include immediates in your question, but it was included in the quote you referred to so it's worth mentioning adc with immediate zero, which is "twice as fast" [2] from Sandy Bridge through Haswell.
Of course many FP instruction have value-dependent performance, particularly with denormals (although it is common even if denormals don't occur).
FWIW recent AMD chips seem to have fixed-latency integer dividers.
Is the testing code open source? There is no obvious link from the main page.
[1] https://www.agner.org/optimize/#manuals
[2] http://users.atw.hu/instlatx64/