Extensive loop unrolling is counterproductive on any good architecture, where the hardware will unroll the loop automatically inside the CPU instead of having to waste cycles fetching each instruction over and over again from the massive amount of wasted cache space.
It might look a little faster in microbenchmarks on that piece of code because you've reduced the barely existent branch delay, but in doing so, you've also evicted a bunch of useful code from the cache --- if it's not a microbenchmark --- and branch overhead is TINY compared to a cache miss. Those idiots who write 4KB memcpy()s have this problem; yes, your code probably is the fastest way to copy a chunk of data around. No, it's not going to make all programs that use it faster, because now the code in the 1/8th of the L1 cache that it evicted is going to need to be read back in.