>>9
Loop unrolling is obsolete on modern processors thanks to better hardware. It was most beneficial on the P4 with its huge pipeline, became somewhat contentious throughout the Core line, and slows things down with i7s and above.
http://www.agner.org/optimize/blog/read.php?i=142
http://x264dev.multimedia.cx/archives/201
tl;dr: The future of optimisation is code and data density. Unaligned accesses and stupid size/speed tradeoffs don't matter anymore, because smaller IS faster. (Sucks for you, RISCtards...)