/prog/ - Dot product

Name: Anonymous 2011-08-02 22:21

>>1
Your routine is inefficient and not correct. Firstly the System V AMD64 ABI passes SSE vector values to functions directly in XMM0 through to XMM7 registers, it does not pass them via the stack. At a minimum it should be:



sse3_dot:

    mulps xmm0, xmm1

    haddps xmm0, xmm0

    haddps xmm0, xmm0

    ret

However, SSE 4.1 adds the DPPS instruction for performing dot-products:



sse41_dot:

    dpps xmm0, xmm1, 255

    ret

But your function is just bad design as it has a lot of call overhead for such a simple operation. Generally where you're taking the dot product of two vectors, you're also taking the dot product of lots of vectors. Your math library should liberally use lots of inline functions wrapping SSE intrinsics or should be designed to process large sets of both AoS and SoA formatted vectors, not single vectors one at a time. Otherwise your code just ends up spending most of it's time branching into functions and setting up and tearing down call frames instead of doing actual meaningful work.

Therefore, your code should probably look more like this:



// prog/vector_sse.h



#ifndef PROG_VECTOR_SSE_H

#define PROG_VECTOR_SSE_H



#include <mmintrin.h>



typedef __m128 float4;



extern float4 float4_mask_xy; // defined as { 0xFFFFFFFF, 0xFFFFFFFF, 0, 0 }

extern float4 float4_mask_xyz; // defined as { 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, 0 }



#if defined(__SSE4_1__)



inline float4 dot2(float4 a, float4 b) {

    return _mm_dp_ps(a, b, 0x33);

}



inline float4 dot3(float4 a, float4 b) {

    return _mm_dp_ps(a, b, 0x77);

}



inline float4 dot4(float4 a, float4 b) {

    return _mm_dp_ps(a, b, 0xFF);

}



#elif defined(__SSE3__)



inline float4 dot2(float4 a, float4 b) {

    float4 temp = _mm_mul_ps(a, b);

    temp = _mm_and_ps(temp, float4_mask_xy);

    temp = _mm_hadd_ps(temp, temp);

    return _mm_hadd_ps(temp, temp);

}



inline float4 dot3(float4 a, float4 b) {

    float4 temp = _mm_mul_ps(a, b);

    temp = _mm_and_ps(temp, float4_mask_xyz);

    temp = _mm_hadd_ps(temp, temp);

    return _mm_hadd_ps(temp, temp);

}



inline float4 dot4(float4 a, float4 b) {

    float4 temp = _mm_mul_ps(a, b);

    temp = _mm_hadd_ps(temp, temp);

    return _mm_hadd_ps(temp, temp);

}



#else

#error SSE1/SSE2 version left as an exercise for the reader!

#endif



#endif

With a simple, no-overhead abstraction layer, it's now very easy to write prog/vector_altivec.h and prog/vector_neon.h headers for Power and ARM support that have the same interface, and make your code portable.

Dot product

1 Name: Anonymous 2011-08-02 14:18

19 Name: Anonymous 2011-08-02 22:21

Name: Anonymous 2011-08-02 14:18

Name: Anonymous 2011-08-02 22:21