I'm currently working on precomputed version. >>19 wasn't fast enough on my machine(even 600ms slower, athlon II x2 240 (it doesn't like branching apparently))
This algorithm is extremely simple and there could be further optimizations, but i can't find any yet(except precomputing alot of values)