I had to redefine >>20 as a function to stop my compiler from simply substituting in the values, since they're all constants. With even first-level optimizations all of them take ~18 cycles (the RDTSC latency on my CPU) since the compiler optimizes away the code entirely. I also added a loop to repeat each one 1048576 (1M) times. These are all without optimization: