>>16
SEE and MMX is good, but you will have increased latency and your data needs to be structured to accomodate the use of these instructions(typically vectors of similar variables)
[edit] Example
The following simple example demonstrates the advantage of using SSE. Consider an operation like vector addition, which is used very often in computer graphics applications. To add two single precision, four-component vectors together using x86 requires four floating-point addition instructions
vec_res.x = v1.x + v2.x;
vec_res.y = v1.y + v2.y;
vec_res.z = v1.z + v2.z;
vec_res.w = v1.w + v2.w;
This would correspond to four x86 FADD instructions in the object code. On the other hand, as the following pseudo-code shows, a single 128-bit 'packed-add' instruction can replace the four scalar addition instructions.
movaps xmm0,address-of-v1 ;xmm0 = v1.w | v1.z | v1.y | v1.x
addps xmm0,address-of-v2 ;xmm0 = v1.w+v2.w | v1.z+v2.z | v1.y+v2.y | v1.x+v2.x
movaps address-of-vec_res,xmm0