Return Styles: Pseud0ch, Terminal, Valhalla, NES, Geocities, Blue Moon. Entire thread

Dot product

Name: Anonymous 2011-08-02 14:18

Single precision floating point.
System V AMD64 ABI convention.


sses_dot:
    movaps xmm0, [rdi]
    mulps xmm0, [rsi]
    haddps xmm0, xmm0
    haddps xmm0, xmm0
    ret

Name: Anonymous 2011-08-02 22:21

>>1
Your routine is inefficient and not correct. Firstly the System V AMD64 ABI passes SSE vector values to functions directly in XMM0 through to XMM7 registers, it does not pass them via the stack. At a minimum it should be:

sse3_dot:
    mulps xmm0, xmm1
    haddps xmm0, xmm0
    haddps xmm0, xmm0
    ret


However, SSE 4.1 adds the DPPS instruction for performing dot-products:

sse41_dot:
    dpps xmm0, xmm1, 255
    ret


But your function is just bad design as it has a lot of call overhead for such a simple operation. Generally where you're taking the dot product of two vectors, you're also taking the dot product of lots of vectors. Your math library should liberally use lots of inline functions wrapping SSE intrinsics or should be designed to process large sets of both AoS and SoA formatted vectors, not single vectors one at a time. Otherwise your code just ends up spending most of it's time branching into functions and setting up and tearing down call frames instead of doing actual meaningful work.

Therefore, your code should probably look more like this:

// prog/vector_sse.h

#ifndef PROG_VECTOR_SSE_H
#define PROG_VECTOR_SSE_H

#include <mmintrin.h>

typedef __m128 float4;

extern float4 float4_mask_xy; // defined as { 0xFFFFFFFF, 0xFFFFFFFF, 0, 0 }
extern float4 float4_mask_xyz; // defined as { 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, 0 }

#if defined(__SSE4_1__)

inline float4 dot2(float4 a, float4 b) {
    return _mm_dp_ps(a, b, 0x33);
}

inline float4 dot3(float4 a, float4 b) {
    return _mm_dp_ps(a, b, 0x77);
}

inline float4 dot4(float4 a, float4 b) {
    return _mm_dp_ps(a, b, 0xFF);
}

#elif defined(__SSE3__)

inline float4 dot2(float4 a, float4 b) {
    float4 temp = _mm_mul_ps(a, b);
    temp = _mm_and_ps(temp, float4_mask_xy);
    temp = _mm_hadd_ps(temp, temp);
    return _mm_hadd_ps(temp, temp);
}

inline float4 dot3(float4 a, float4 b) {
    float4 temp = _mm_mul_ps(a, b);
    temp = _mm_and_ps(temp, float4_mask_xyz);
    temp = _mm_hadd_ps(temp, temp);
    return _mm_hadd_ps(temp, temp);
}

inline float4 dot4(float4 a, float4 b) {
    float4 temp = _mm_mul_ps(a, b);
    temp = _mm_hadd_ps(temp, temp);
    return _mm_hadd_ps(temp, temp);
}

#else
#error SSE1/SSE2 version left as an exercise for the reader!
#endif

#endif


With a simple, no-overhead abstraction layer, it's now very easy to write prog/vector_altivec.h and prog/vector_neon.h headers for Power and ARM support that have the same interface, and make your code portable.

Newer Posts
Don't change these.
Name: Email:
Entire Thread Thread List