Return Styles: Pseud0ch, Terminal, Valhalla, NES, Geocities, Blue Moon. Entire thread

Dot product

Name: Anonymous 2011-08-02 14:18

Single precision floating point.
System V AMD64 ABI convention.


sses_dot:
    movaps xmm0, [rdi]
    mulps xmm0, [rsi]
    haddps xmm0, xmm0
    haddps xmm0, xmm0
    ret

Name: Anonymous 2011-08-02 14:29

(define (dot-product v w)
  (foldr + 0 (map * v w)))

Name: Anonymous 2011-08-02 14:34

>>2
That's nice, but when the GC has to collect v and w 5*10^9 times the program freezes

Name: Anonymous 2011-08-02 15:00

I wish more code were platform dependent.

Name: Anonymous 2011-08-02 15:05

>>3
With a decent JIT, any usage of >>2 compiles to >>1.  0/10

Name: Anonymous 2011-08-02 15:46

>>5
That is just false.

Name: Anonymous 2011-08-02 15:50

>>6
Prove it, faggot.

Name: not >>5 2011-08-02 15:52

>>6
Why not?

Name: Anonymous 2011-08-02 15:54

>>8
The proper question to ask is ``Why (is it false)?''.

Name: Anonymous 2011-08-02 16:02

>>9
My question will never get a proper answer anyway because >>6 is false.

Name: Anonymous 2011-08-02 17:12

>>3


(define (do-product v w)
  (letrec [(helper (lambda (v w sum)
                     (if (or (null? v) (null? w))
                       sum
                       (helper (cdr v)
                               (cdr w)
                               (+ (* (car v) (car w)) sum)))))]
    (helper v w 0)))

                                  
                      
[/code]

Name: Anonymous 2011-08-02 17:37

PIG DISGUSTING!

Name: Anonymous 2011-08-02 17:56

>>11
(define (do-product v w)
 (foldl (λ (x y r) (+ (* x y) r)) 0 v w))

Name: Anonymous 2011-08-02 18:44

>>13
>9000 (would have to count heap memory lookup too) assembly instructions.

>>1
~10 instructions.

Name: Anonymous 2011-08-02 19:27

>>14
lol over nien tosand myright? xD

Name: Anonymous 2011-08-02 19:42

def dot_product(v, w):
    return sum(map(lambda v: v[0] * v[1], zip(v, w)))

Name: Anonymous 2011-08-02 21:39

>>14
compilers are smarter than that.

Name: Anonymous 2011-08-02 21:45

>>17
I really doubt that. Especially with the heap lookup thing, which takes idk 100's of assembly instructions if they are spread out (which they are in lisp).

Name: Anonymous 2011-08-02 22:21

>>1
Your routine is inefficient and not correct. Firstly the System V AMD64 ABI passes SSE vector values to functions directly in XMM0 through to XMM7 registers, it does not pass them via the stack. At a minimum it should be:

sse3_dot:
    mulps xmm0, xmm1
    haddps xmm0, xmm0
    haddps xmm0, xmm0
    ret


However, SSE 4.1 adds the DPPS instruction for performing dot-products:

sse41_dot:
    dpps xmm0, xmm1, 255
    ret


But your function is just bad design as it has a lot of call overhead for such a simple operation. Generally where you're taking the dot product of two vectors, you're also taking the dot product of lots of vectors. Your math library should liberally use lots of inline functions wrapping SSE intrinsics or should be designed to process large sets of both AoS and SoA formatted vectors, not single vectors one at a time. Otherwise your code just ends up spending most of it's time branching into functions and setting up and tearing down call frames instead of doing actual meaningful work.

Therefore, your code should probably look more like this:

// prog/vector_sse.h

#ifndef PROG_VECTOR_SSE_H
#define PROG_VECTOR_SSE_H

#include <mmintrin.h>

typedef __m128 float4;

extern float4 float4_mask_xy; // defined as { 0xFFFFFFFF, 0xFFFFFFFF, 0, 0 }
extern float4 float4_mask_xyz; // defined as { 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, 0 }

#if defined(__SSE4_1__)

inline float4 dot2(float4 a, float4 b) {
    return _mm_dp_ps(a, b, 0x33);
}

inline float4 dot3(float4 a, float4 b) {
    return _mm_dp_ps(a, b, 0x77);
}

inline float4 dot4(float4 a, float4 b) {
    return _mm_dp_ps(a, b, 0xFF);
}

#elif defined(__SSE3__)

inline float4 dot2(float4 a, float4 b) {
    float4 temp = _mm_mul_ps(a, b);
    temp = _mm_and_ps(temp, float4_mask_xy);
    temp = _mm_hadd_ps(temp, temp);
    return _mm_hadd_ps(temp, temp);
}

inline float4 dot3(float4 a, float4 b) {
    float4 temp = _mm_mul_ps(a, b);
    temp = _mm_and_ps(temp, float4_mask_xyz);
    temp = _mm_hadd_ps(temp, temp);
    return _mm_hadd_ps(temp, temp);
}

inline float4 dot4(float4 a, float4 b) {
    float4 temp = _mm_mul_ps(a, b);
    temp = _mm_hadd_ps(temp, temp);
    return _mm_hadd_ps(temp, temp);
}

#else
#error SSE1/SSE2 version left as an exercise for the reader!
#endif

#endif


With a simple, no-overhead abstraction layer, it's now very easy to write prog/vector_altivec.h and prog/vector_neon.h headers for Power and ARM support that have the same interface, and make your code portable.

Name: Anonymous 2011-08-02 22:33

Also, to the ``lifthp faggots'' in this thread, does your Lisp compiler generate SSE SIMD instructions out of the box from regular code? I guarantee you that it does not.

Instead, you have to hack it in, and write your own XMM register allocator.

http://www.pvk.ca/Blog/Lisp/hacking_SSE_intrinsics-part_1.html

Name: Anonymous 2011-08-02 22:51

>>20
This. And then you need to write your own versions of operator *, operator +, map and foldr that are SSE-aware and can detect when the operands are SSE vectors.

Also, to the FIOC faggot >>16, do you honestly think that CPython, IronPython, or any other Python will generate SSE instructions for those functions? They do not. Instead you will end up with a bunch of shitty FIOC table lookups to infer types at runtime, and scalar code hidden behind multiple layers of function invocations.

Fucking faggots the lot of you.

Name: 1 2011-08-02 23:47

>>19
Interesting. I didn't use my version in any real code because of the function call overhead but your C inline version seems to fix that.

Well, I also didn't know how System V AMD64 ABI handles SSE vectors... I just passed the vectors as pointers to float. Inefficient.

Thanks for the explanation.

Name: Anonymous 2011-08-03 2:10

>>21
implying all python implementations aren't blazing fast already

Name: Anonymous 2011-08-03 11:41

>>23
blazing fast
in reality 30-75x slower than C/C++ in single-threaded, 100-200x slower at 4-core multi-threading than C/C++

http://shootout.alioth.debian.org/u64q/which-programming-languages-are-fastest.php

Name: Anonymous 2011-08-03 11:53

>>23-24
To /g/, both of you.

>>24
If you believe to benchmarks, I'm sorry for you. Python implementations are slow anyway. YHBT and IHBTBY.

Name: Anonymous 2011-08-03 12:42

Fuck yeah. I'm going to rewrite my ray tracer's dot product code following >>19's instructions. I will save many seconds and grow my virtual penis many millimeters!

Name: Anonymous 2011-08-03 13:49

>>20,21,24
That's why you interface with C code. Duh.
Also, Python really is slow as shit going uphill. But that is what you pay for when you use an interpreter.

Name: Anonymous 2011-08-03 13:49

>>26
You should see my 8-way SoA cross-product using the newer 256-bit AVX extensions for Intel Sandy/Ivy Bridge and AMD Bulldozer. It'll put at least another inch on top of your current length and add a bit to your girth too. It's like jelking on steroids!

Name: Anonymous 2011-08-03 13:53

>>26
Don't forget to provide us benchmarks.

Name: Anonymous 2011-08-03 14:48

>>21
Do you honestly think you can code your implementation in ASM faster than I can code mine in Python? Do you honestly think I can't make it significantly faster by just importing NumPy? Is tens of milliseconds of saved run time really worth hours if not days of your development time?

Enjoy your OCD and unemployment.

Name: Anonymous 2011-08-03 17:41

>>30
what is it with you python users being all high-and-mighty

Name: Anonymous 2011-08-03 20:06

prog is too easy to troll

Name: Anonymous 2011-08-03 20:39

>>32
spoilers, ``please''!!

Name: >>32 2011-08-03 20:41

>>33
What did I tell you?  Real easy.

Name: Anonymous 2011-08-03 20:45

progn is too easy to evaluate forms, in the order in which they are given.

Name: Anonymous 2011-08-03 21:28

Name: Anonymous 2011-08-03 21:37

faggot language detected

Name: Anonymous 2011-08-03 23:02

>>37
Anyone who thinks Lisp sucks obviously cannot comprehend it. You must not have a primitive mind in order to study it.

Interestingly enough, the same could be said for C++.

Name: Anonymous 2011-08-03 23:38

>>8
Haha, and you don't even know Lisp.

C++ is actually very powerful language and it can be used to produce very efficient programs.

Every Lisp program I tried is slow as hell... wait, I haven't used any, since they all suck.

Name: Anonymous 2011-08-03 23:43

>>39
>C++ is actually very powerful language and it can be used to produce very efficient programs.
read http://yosefk.com/c++fqa/defective.html

Newer Posts
Don't change these.
Name: Email:
Entire Thread Thread List