/prog/ - Branch Misses

Name: Anonymous 2011-09-26 17:26

How bad do branch misses hurt a programs performance? In C and Java?

for example how big of an impact would :



int shitinanus()

{

    int val = getanus();

    if(val != 0)

        return ANALTONIGHT;

    return ALONETONIGHT;

}

be knowing that val will most likely be 0 99% of the time compared to the below function



int shitinanus()

{

    int val = getanus();

    if(val == 0)

        return ALONETONIGHT;

    return ANALTONIGHT;

}

Name: Anonymous 2011-09-26 17:57

Shouldn't make any difference because the compiler would optimize the branch away.

Name: Anonymous 2011-09-26 18:46

Surely you could have taken the five minutes to actually benchmark that code. I suspect you just wanted to shit up /prog/ with your puerility.

Ah, well. With a bit of luck, you would find that in your example it doesn't make a lick of a difference, due to one of two factors:
1. Any modern CPU keeps a history buffer with taken branches, so no matter how you organize your code, it will predict a branch to the common case for frequently executed branches.
2. If you don't have a modern CPU, i.e., one with OoOE, your example should compile to a conditional move, eliminating the branch altogether.

To give some actual numbers: the branch misprediction penalty on a Cortex-A8 is 13 cycles, and on a Cortex-A9 it's 8 cycles.
I found some interesting data on other processors at http://www.7-cpu.com/ , though I can't vouch for its accuracy.

Name: Anonymous 2011-09-26 18:57

>>1
Theoretically, branching is prejudicial to performance in general, because it tends to force pipeline flushes or stalls. The predicting circuitry avoid great part of the penalty. Thus, a branch miss hits performance somewhat bad in the machine-level, but not as bad as a cache miss (except, evidently, if the branch miss also causes an instruction cache miss).

This means that you'll have visible performance differences in certain situations; for example, if you're branching dozens of thousands of times in a tight loop. Otherwise, it won't even contribute to the execution time noise; such latencies are typically in the nanosecond range. Also, depending on the final code layout, the static predictor always hits if certain conditions are met (for example, conditional backwards jumps are always seen as taken). In C, you can't decide the code layout, except by hinting the compiler with builtin directives (which he'll probably ignore anyway, since programmers are very bad at predicting bottlenecks).

Thus, "branch misses" are a concern far, far, far away from the average application code, specially in the kind of conditional you've suggested in the snippet. In Java (and probably every language which is not C), it's even more true: Java has an ocean of low-level overhead between source code and the final instruction stream, making any attempt of justifying some "optimized" construct look just plain ridiculous.

Also, x86 processors have an optimization circuitry called "loop stream detector" which is able to optimize very heavily the execution of small loops up to 16 bytes in code length. (Remember that a loop always have a conditional branch associated with it.)

Ultimately, people who're paranoid about such issues rarely understand anything about the subject. They typically have bad programming practices (like littering the code with __builtin_expect() directives), kludges data structures with "optimized" expressions, and overinline function calls, causing an even greater performance loss (due to the forementioned and imminent instruction cache miss caused by big thunks of linear code). People don't profile, because profiling makes you look uncool, since it always proves that all of "those highly and expertly optimized sites in your C code" is just girlyish bullshit.

IOW: don't waste energy "optimizing" branch sites in Java. In C, it'll make sense only in very specific scenarios, with which you'll probably never meet.

Name: Anonymous 2011-09-26 19:14

why dont you compile each and run them 9 billion times each and compare

Name: Anonymous 2011-09-26 19:20

ONE WORD: THE FORCED PROFILE-GUIDED OPTIMIZATION OF THE CODE THREAD OVER

>>3
http://www.7-cpu.com/
Cortex-A9
Whoah, I didn't expect toy CPUs to perform that well. OTOH at these low frequencies relative RAM latency is much lower and this benchmark is highly dependent on that, but still it's kinda impressive to have a CPU which can compete with x86 in IPC for once. Now they just need to put 12 cores in and bump the clock threefold.

Name: Anonymous 2011-09-26 19:21

>>3
Surely you could have taken the five minutes to actually benchmark that code. I suspect you just wanted to shit up /prog/ with your puerility.
what a faggot you sound like

Name: Anonymous 2011-09-26 19:26

>>4
Not OP, but this post was highly informative. If you could give a good source in which I can read about the subject to help me with my girlish optimization fantasies I would be thankful.

Name: Anonymous 2011-09-26 19:44

>>8
There's only one reliable source regarding assembly-level code optimization: the processor manuals. Everything else is either bullshit, or just resaying what the manuals already say.

Intel has a book dedicated to that subject, for IA-32: http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-optimization-manual.html. It's a very interesting book, albeit the reading is rather deep.

But it'll probably be of more use if you look for profiling techniques. Search for oprofile and gprof for a starting point. These tools are typically enough to detect real bottlenecks on profilable code.

Name: Anonymous 2011-09-26 19:47

>>9
Thank you.

Name: Anonymous 2011-09-26 19:52

>>9
The processor manuals don't always tell the whole story in an useful way.

http://www.agner.org/optimize/ is required reading, specially as far as x86 is concerned.

Name: Anonymous 2011-09-26 19:59

>>11
That is the greatest thing I've seen on the internet for some time now. Its like walking in on a treasure trove.

Name: Anonymous 2011-09-26 20:00

!=0 should still be the faster option..?

Name: Anonymous 2011-09-26 20:01

>>6
According to ARM's figures, the Cortex-A15 will have increased IPC, goes up to 2.5 GHz, and is designed for quad-core.
So far, so good, but if they want the desktop and server market they'll have to introduce a 64-bit architecture. LPAE is just a stopgap measure.

Name: Anonymous 2011-09-26 21:31

>>4
Okay, I can't tell if you're just plain fucking stupid, a jew, trolling, or any combination of the three. Anyways, from the view of the C abstract machine, there is no difference in the code.

BTW faggot, you're still confusing the abstract machine with the implementation itself. Now go scrub another toilet.

Name: Anonymous 2011-09-26 21:33

>>4
And your description really breaks down for Java. I'm suspecting this is because you don't know the difference between the Java language itself and the JVM.

Name: Anonymous 2011-09-26 21:34

>>15
You're not funny.

Name: Anonymous 2011-09-26 21:38

>>17
I'm not trying to be. Go read one the many fucking C standards you moron.

Name: Anonymous 2011-09-26 21:39

>>18
*one of the many*

Name: Anonymous 2011-09-26 21:45

>>15,18
kill yourself faggot

Name: Anonymous 2011-09-26 21:54

>>20
Listen little girl, there is a reason why those of us program who living preach the standards. This is because some of us have to compile the same fucking thing on 8 different platforms. If you have non standard/non conforming code, then the works becomes a lot more difficult.

But you wouldn't know anything about this. Now go scrub another toilet and keep studying SICP you fucking nowhere bitch.

Name: Anonymous 2011-09-26 22:00

it's nothing compared to memory allocation, deallocation, and cache coherency.

Name: Anonymous 2011-09-26 22:07

Are branch misses a problem in interpreted languages?

Name: Anonymous 2011-09-26 22:21

>>21
I only support GNU-C. Now kill yourself, faggot.

Name: Anonymous 2011-09-26 22:58

>>24
I only support GNU-C.
Then you're not an EXPERT PROGRAMMER

Name: sage 2011-09-26 23:03

>>15
You're visibly a complete illiterate, in the most obscene basic level. What the fuck are you talking about? Have you understood or even read anything that's been said?

Seriously, I fear you're retarded to a point way past of even minor recovery.

Name: Anonymous 2011-09-26 23:22

>>11
This is indeed an interesting source of information. I disagree on a couple of points described by the author, however; in some of them it seems to me that he's completely wrong.

For example, the author states that some common boolean arithmetic is typically implemented with branch trees. This seems rather absurd to me. Also, dynamic_cast is -extremely- costly if a type analysis is actually performed; in particular, much more costly than a virtual call -- maybe he should have stressed that. And, while pointer dereferencing is typically fast, sometimes it's slower than direct access by a considerable factor (pointer load, LEA and friends). The situation is much worse if pointer aliasing exists.

The material is nonetheless really valuable in the whole. Almost everything described is neutral knowledge to the assembly programmer, but he exposes it on a more tractable language. Despite that, some of his tips are really infeasible in my opinion (messing with code legibility or semantics to 'hint' an optimization to the compiler is particularly out of question to me).

Name: Anonymous 2011-09-27 0:31

^{S^{L^{A^{C^KW}A}R}E}_{S_{U_{P_{R_EM}A}C}Y}

Name: Anonymous 2011-09-27 0:58

>>15

this is just the "i wouldn't hire you", "go scrub a toilet", "go back to your job at arby's", etc etc guy.

Name: Anonymous 2011-09-27 1:32

^{S^{L^{A^{C^KW}A}R}E}_{S_{U_{P_{R_EM}A}C}Y}

Name: Grandma 2011-09-27 2:21

enum { SUNDAY, MONDAY, TUESDAY, WEDNESDAY, THURSDAY, FRIDAY, SATURDAY };

if(day % 7 == TUESDAY)
bananas_chase_the_teddies();
else
normal_day();

Name: Anonymous 2011-09-27 7:51

>>24
I support your cause and will gladly fight alongside you.

Name: Anonymous 2011-09-27 7:57

>>32
Terrible!

Name: Anonymous 2011-09-27 9:07

>>33
Nice du[spoiler][spoiler]bz!

Name: Anonymous 2011-09-27 9:09

>>33
Nice dubz!

Name: Anonymous 2011-09-27 9:33

>>34-35
Fuck off, spammer.

Name: Anonymous 2011-09-27 11:24

__gnu_gcc_builtin_func_car_cdr_eval_apply(void *);

Name: Anonymous 2011-09-27 11:32

>>37
#include <lspintrin.h>

Name: Anonymous 2011-09-27 11:58

if (val) is much faster than if (val != 0) I benchmarked the old code with the faster if test and it resulted in a 230% speedup.

Name: FrozenVoid 2011-09-27 12:06

>>39
If your code is so inefficient that one 'if' results in 230% speedups post it here, i could save you at least 50% more speed.

Branch Misses

1 Name: Anonymous 2011-09-26 17:26

2 Name: Anonymous 2011-09-26 17:57

3 Name: Anonymous 2011-09-26 18:46

4 Name: Anonymous 2011-09-26 18:57

5 Name: Anonymous 2011-09-26 19:14

6 Name: Anonymous 2011-09-26 19:20

7 Name: Anonymous 2011-09-26 19:21

8 Name: Anonymous 2011-09-26 19:26

9 Name: Anonymous 2011-09-26 19:44

10 Name: Anonymous 2011-09-26 19:47

11 Name: Anonymous 2011-09-26 19:52

12 Name: Anonymous 2011-09-26 19:59

13 Name: Anonymous 2011-09-26 20:00

14 Name: Anonymous 2011-09-26 20:01

15 Name: Anonymous 2011-09-26 21:31

16 Name: Anonymous 2011-09-26 21:33

17 Name: Anonymous 2011-09-26 21:34

18 Name: Anonymous 2011-09-26 21:38

19 Name: Anonymous 2011-09-26 21:39

20 Name: Anonymous 2011-09-26 21:45

21 Name: Anonymous 2011-09-26 21:54

22 Name: Anonymous 2011-09-26 22:00

23 Name: Anonymous 2011-09-26 22:07

24 Name: Anonymous 2011-09-26 22:21

25 Name: Anonymous 2011-09-26 22:58

26 Name: sage 2011-09-26 23:03

27 Name: Anonymous 2011-09-26 23:22

28 Name: Anonymous 2011-09-27 0:31

29 Name: Anonymous 2011-09-27 0:58

30 Name: Anonymous 2011-09-27 1:32

31 Name: Grandma 2011-09-27 2:21

32 Name: Anonymous 2011-09-27 7:51

33 Name: Anonymous 2011-09-27 7:57

34 Name: Anonymous 2011-09-27 9:07

35 Name: Anonymous 2011-09-27 9:09

36 Name: Anonymous 2011-09-27 9:33

37 Name: Anonymous 2011-09-27 11:24

38 Name: Anonymous 2011-09-27 11:32

39 Name: Anonymous 2011-09-27 11:58

40 Name: FrozenVoid 2011-09-27 12:06