/prog/ - AMD optimising against itself

Name: Cudder !MhMRSATORI!fR8duoqGZdD/iE5 2013-07-07 8:32

From http://support.amd.com/us/Processor_TechDocs/47414_15h_sw_opt_guide.pdf p.126:
Avoid using the LOOP instruction.
The LOOP instruction has a latency of 7 cycles in 32-bit protected mode and 8 cycles in 64-bit
protected mode.

Agner's instruction tables lists Bulldozer as having a latency of only 1-2 cycles for this instruction, vs. 3 for K10, 3-4 for K8 and K7, 4 for Bobcat, and 4 for Nehalem, 5 for Sandy Bridge. The alternative fused ALU+jcc has exactly the same timing on Bulldozer, but only 2 on Nehalem and 1-2 for Sandy Bridge. Interestingly enough on VIA Nanos LOOP is also faster.

AMD's document then contradicts itself on p.246, where it lists LOOP as a FastPath Single with a latency of 1. In other words, if you want to make Bulldozer look faster than Intel's, you SHOULD use LOOP, and it's also a significant improvement over the previous generations of AMD. I have no idea why they didn't take this opportunity, despite likely having spent effort on that improvement. Then again, what'd you expect from the company that came up with AMD64...

AMD optimising against itself

1 Name: Cudder !MhMRSATORI!fR8duoqGZdD/iE5 2013-07-07 8:32

13 Name: Anonymous 2013-07-08 14:17

Name: Cudder !MhMRSATORI!fR8duoqGZdD/iE5 2013-07-07 8:32

Name: Anonymous 2013-07-08 14:17