/prog/ - AMD optimising against itself

Name: Cudder !MhMRSATORI!fR8duoqGZdD/iE5 2013-07-07 8:32

From http://support.amd.com/us/Processor_TechDocs/47414_15h_sw_opt_guide.pdf p.126:
Avoid using the LOOP instruction.
The LOOP instruction has a latency of 7 cycles in 32-bit protected mode and 8 cycles in 64-bit
protected mode.

Agner's instruction tables lists Bulldozer as having a latency of only 1-2 cycles for this instruction, vs. 3 for K10, 3-4 for K8 and K7, 4 for Bobcat, and 4 for Nehalem, 5 for Sandy Bridge. The alternative fused ALU+jcc has exactly the same timing on Bulldozer, but only 2 on Nehalem and 1-2 for Sandy Bridge. Interestingly enough on VIA Nanos LOOP is also faster.

AMD's document then contradicts itself on p.246, where it lists LOOP as a FastPath Single with a latency of 1. In other words, if you want to make Bulldozer look faster than Intel's, you SHOULD use LOOP, and it's also a significant improvement over the previous generations of AMD. I have no idea why they didn't take this opportunity, despite likely having spent effort on that improvement. Then again, what'd you expect from the company that came up with AMD64...

Name: Anonymous 2013-07-07 21:28

This is on Nehalem



  mov ecx, 1000000

 tohere:

  loop tohere

3818322 cycles or around 3.8 cycles/iteration



  mov ecx, 1000000

 tohere:

  dec ecx

  jnz tohere

1909224 cycles or around 1.9 cycles/iteration



  mov ecx, 1000000

 tohere:

  sub ecx, 1

  jnz tohere

1909218 cycles or around 1.9 cycles/iteration

So just an empty LOOP is half the speed of DEC/JNZ or SUB/JNZ which matches the times in the OP. What about putting something in the loop like this



  mov ecx, 1000000

 tohere:

  mov eax, [var_x]

  xor eax, 12345678

  add eax, 87654321

  mov [var_x], eax

  loop tohere

Now LOOP is 4772868 cycles or 4.8 cycles/iteration, DEC/JNZ and SUB/JNZ 6681900 cycles or 6.7 cycles/iteration, about 40% longer! Whats going on here?

Anyone with a Bulldozer test this? According to the manual LOOP should be the same speed as DEC/JNZ or SUB/JNZ.

Name: Cudder !MhMRSATORI!fR8duoqGZdD/iE5 2013-07-08 2:07

>>2
Rationale? If I were to guess, they're keeping it to themselves and seeing what Intel will do, because if they advise using LOOP then you can bet Intel's next generation is going to make LOOP go even faster. But clearly AMD has lost its ability to improve performance much (unlike in the Athlon days).

>>3
There's no reason why LOOP can't be decoded immediately into the corresponding dec/jnz uops (with the small but important difference that flags are NOT affected), and as I mentioned in the OP, there's evidence that Bulldozer does do so, from both AMD's official manual and Agner's independent tests.

>>5
This has nothing to do with race. I couldn't care less who -- or what -- designs these things.

>>7
You've just discovered that accessing memory is slow no matter what, and the reason loop is faster now might be contention-related.I don't have a ~~Fail~~Bulldozer to test this though. (Maybe it's not quite a fail after all...?)

AMD optimising against itself

1 Name: Cudder !MhMRSATORI!fR8duoqGZdD/iE5 2013-07-07 8:32

7 Name: Anonymous 2013-07-07 21:28

8 Name: Cudder !MhMRSATORI!fR8duoqGZdD/iE5 2013-07-08 2:07

Name: Cudder !MhMRSATORI!fR8duoqGZdD/iE5 2013-07-07 8:32

Name: Anonymous 2013-07-07 21:28

Name: Cudder !MhMRSATORI!fR8duoqGZdD/iE5 2013-07-08 2:07