Return Styles: Pseud0ch, Terminal, Valhalla, NES, Geocities, Blue Moon.

Pages: 1-4041-

Dot product

Name: Anonymous 2011-08-02 14:18

Single precision floating point.
System V AMD64 ABI convention.


sses_dot:
    movaps xmm0, [rdi]
    mulps xmm0, [rsi]
    haddps xmm0, xmm0
    haddps xmm0, xmm0
    ret

Name: Anonymous 2011-08-02 14:29

(define (dot-product v w)
  (foldr + 0 (map * v w)))

Name: Anonymous 2011-08-02 14:34

>>2
That's nice, but when the GC has to collect v and w 5*10^9 times the program freezes

Name: Anonymous 2011-08-02 15:00

I wish more code were platform dependent.

Name: Anonymous 2011-08-02 15:05

>>3
With a decent JIT, any usage of >>2 compiles to >>1.  0/10

Name: Anonymous 2011-08-02 15:46

>>5
That is just false.

Name: Anonymous 2011-08-02 15:50

>>6
Prove it, faggot.

Name: not >>5 2011-08-02 15:52

>>6
Why not?

Name: Anonymous 2011-08-02 15:54

>>8
The proper question to ask is ``Why (is it false)?''.

Name: Anonymous 2011-08-02 16:02

>>9
My question will never get a proper answer anyway because >>6 is false.

Name: Anonymous 2011-08-02 17:12

>>3


(define (do-product v w)
  (letrec [(helper (lambda (v w sum)
                     (if (or (null? v) (null? w))
                       sum
                       (helper (cdr v)
                               (cdr w)
                               (+ (* (car v) (car w)) sum)))))]
    (helper v w 0)))

                                  
                      
[/code]

Name: Anonymous 2011-08-02 17:37

PIG DISGUSTING!

Name: Anonymous 2011-08-02 17:56

>>11
(define (do-product v w)
 (foldl (λ (x y r) (+ (* x y) r)) 0 v w))

Name: Anonymous 2011-08-02 18:44

>>13
>9000 (would have to count heap memory lookup too) assembly instructions.

>>1
~10 instructions.

Name: Anonymous 2011-08-02 19:27

>>14
lol over nien tosand myright? xD

Name: Anonymous 2011-08-02 19:42

def dot_product(v, w):
    return sum(map(lambda v: v[0] * v[1], zip(v, w)))

Name: Anonymous 2011-08-02 21:39

>>14
compilers are smarter than that.

Name: Anonymous 2011-08-02 21:45

>>17
I really doubt that. Especially with the heap lookup thing, which takes idk 100's of assembly instructions if they are spread out (which they are in lisp).

Name: Anonymous 2011-08-02 22:21

>>1
Your routine is inefficient and not correct. Firstly the System V AMD64 ABI passes SSE vector values to functions directly in XMM0 through to XMM7 registers, it does not pass them via the stack. At a minimum it should be:

sse3_dot:
    mulps xmm0, xmm1
    haddps xmm0, xmm0
    haddps xmm0, xmm0
    ret


However, SSE 4.1 adds the DPPS instruction for performing dot-products:

sse41_dot:
    dpps xmm0, xmm1, 255
    ret


But your function is just bad design as it has a lot of call overhead for such a simple operation. Generally where you're taking the dot product of two vectors, you're also taking the dot product of lots of vectors. Your math library should liberally use lots of inline functions wrapping SSE intrinsics or should be designed to process large sets of both AoS and SoA formatted vectors, not single vectors one at a time. Otherwise your code just ends up spending most of it's time branching into functions and setting up and tearing down call frames instead of doing actual meaningful work.

Therefore, your code should probably look more like this:

// prog/vector_sse.h

#ifndef PROG_VECTOR_SSE_H
#define PROG_VECTOR_SSE_H

#include <mmintrin.h>

typedef __m128 float4;

extern float4 float4_mask_xy; // defined as { 0xFFFFFFFF, 0xFFFFFFFF, 0, 0 }
extern float4 float4_mask_xyz; // defined as { 0xFFFFFFFF, 0xFFFFFFFF, 0xFFFFFFFF, 0 }

#if defined(__SSE4_1__)

inline float4 dot2(float4 a, float4 b) {
    return _mm_dp_ps(a, b, 0x33);
}

inline float4 dot3(float4 a, float4 b) {
    return _mm_dp_ps(a, b, 0x77);
}

inline float4 dot4(float4 a, float4 b) {
    return _mm_dp_ps(a, b, 0xFF);
}

#elif defined(__SSE3__)

inline float4 dot2(float4 a, float4 b) {
    float4 temp = _mm_mul_ps(a, b);
    temp = _mm_and_ps(temp, float4_mask_xy);
    temp = _mm_hadd_ps(temp, temp);
    return _mm_hadd_ps(temp, temp);
}

inline float4 dot3(float4 a, float4 b) {
    float4 temp = _mm_mul_ps(a, b);
    temp = _mm_and_ps(temp, float4_mask_xyz);
    temp = _mm_hadd_ps(temp, temp);
    return _mm_hadd_ps(temp, temp);
}

inline float4 dot4(float4 a, float4 b) {
    float4 temp = _mm_mul_ps(a, b);
    temp = _mm_hadd_ps(temp, temp);
    return _mm_hadd_ps(temp, temp);
}

#else
#error SSE1/SSE2 version left as an exercise for the reader!
#endif

#endif


With a simple, no-overhead abstraction layer, it's now very easy to write prog/vector_altivec.h and prog/vector_neon.h headers for Power and ARM support that have the same interface, and make your code portable.

Name: Anonymous 2011-08-02 22:33

Also, to the ``lifthp faggots'' in this thread, does your Lisp compiler generate SSE SIMD instructions out of the box from regular code? I guarantee you that it does not.

Instead, you have to hack it in, and write your own XMM register allocator.

http://www.pvk.ca/Blog/Lisp/hacking_SSE_intrinsics-part_1.html

Name: Anonymous 2011-08-02 22:51

>>20
This. And then you need to write your own versions of operator *, operator +, map and foldr that are SSE-aware and can detect when the operands are SSE vectors.

Also, to the FIOC faggot >>16, do you honestly think that CPython, IronPython, or any other Python will generate SSE instructions for those functions? They do not. Instead you will end up with a bunch of shitty FIOC table lookups to infer types at runtime, and scalar code hidden behind multiple layers of function invocations.

Fucking faggots the lot of you.

Name: 1 2011-08-02 23:47

>>19
Interesting. I didn't use my version in any real code because of the function call overhead but your C inline version seems to fix that.

Well, I also didn't know how System V AMD64 ABI handles SSE vectors... I just passed the vectors as pointers to float. Inefficient.

Thanks for the explanation.

Name: Anonymous 2011-08-03 2:10

>>21
implying all python implementations aren't blazing fast already

Name: Anonymous 2011-08-03 11:41

>>23
blazing fast
in reality 30-75x slower than C/C++ in single-threaded, 100-200x slower at 4-core multi-threading than C/C++

http://shootout.alioth.debian.org/u64q/which-programming-languages-are-fastest.php

Name: Anonymous 2011-08-03 11:53

>>23-24
To /g/, both of you.

>>24
If you believe to benchmarks, I'm sorry for you. Python implementations are slow anyway. YHBT and IHBTBY.

Name: Anonymous 2011-08-03 12:42

Fuck yeah. I'm going to rewrite my ray tracer's dot product code following >>19's instructions. I will save many seconds and grow my virtual penis many millimeters!

Name: Anonymous 2011-08-03 13:49

>>20,21,24
That's why you interface with C code. Duh.
Also, Python really is slow as shit going uphill. But that is what you pay for when you use an interpreter.

Name: Anonymous 2011-08-03 13:49

>>26
You should see my 8-way SoA cross-product using the newer 256-bit AVX extensions for Intel Sandy/Ivy Bridge and AMD Bulldozer. It'll put at least another inch on top of your current length and add a bit to your girth too. It's like jelking on steroids!

Name: Anonymous 2011-08-03 13:53

>>26
Don't forget to provide us benchmarks.

Name: Anonymous 2011-08-03 14:48

>>21
Do you honestly think you can code your implementation in ASM faster than I can code mine in Python? Do you honestly think I can't make it significantly faster by just importing NumPy? Is tens of milliseconds of saved run time really worth hours if not days of your development time?

Enjoy your OCD and unemployment.

Name: Anonymous 2011-08-03 17:41

>>30
what is it with you python users being all high-and-mighty

Name: Anonymous 2011-08-03 20:06

prog is too easy to troll

Name: Anonymous 2011-08-03 20:39

>>32
spoilers, ``please''!!

Name: >>32 2011-08-03 20:41

>>33
What did I tell you?  Real easy.

Name: Anonymous 2011-08-03 20:45

progn is too easy to evaluate forms, in the order in which they are given.

Name: Anonymous 2011-08-03 21:28

Name: Anonymous 2011-08-03 21:37

faggot language detected

Name: Anonymous 2011-08-03 23:02

>>37
Anyone who thinks Lisp sucks obviously cannot comprehend it. You must not have a primitive mind in order to study it.

Interestingly enough, the same could be said for C++.

Name: Anonymous 2011-08-03 23:38

>>8
Haha, and you don't even know Lisp.

C++ is actually very powerful language and it can be used to produce very efficient programs.

Every Lisp program I tried is slow as hell... wait, I haven't used any, since they all suck.

Name: Anonymous 2011-08-03 23:43

>>39
>C++ is actually very powerful language and it can be used to produce very efficient programs.
read http://yosefk.com/c++fqa/defective.html

Name: Anonymous 2011-08-04 0:21

>>40
A lot of his points aren't true, they're based upon ignorance or are using old information from before 1998 standardization, or from before C++0x/C++11.

Name: Anonymous 2011-08-04 1:02

>>41
Newer standard is even more ugly and complex. Original C already was ugly. Standard made it uglier. C++ set an unbelievable record of ulgines. Now C++0x jumped even higher on the scale of ugliness. It's like COBOL, which becomes only uglier with time.

Name: Anonymous 2011-08-04 4:51

>>40
I was born in Moscow, and live in Jerusalem.
JEW ALERT

Name: Anonymous 2011-08-04 5:27

>>43
Even jews hate C/C++. And jews love everything ugly.

Name: Anonymous 2011-08-04 5:37

>>41
The standard doesn't matter until compilers implement it.

Name: Anonymous 2011-08-04 8:33

>>45
Similar bullshit applies to C.

Name: Anonymous 2011-08-04 9:26

>>46
Everyone does C99 except Microsoft but who cares about them?

Name: Anonymous 2011-08-04 9:40

>>47
I find that a bit odd since Microsoft is like the only company that

a)Helped write the 0Auth RFC draft.
b)Made a reasonable effort to implement POSIX threads.

Name: Anonymous 2011-08-04 9:43

>>48
The latter is kind of important because now ANSI/ISO (?) has now drafted a series of proposals to make threads standards. I guess POSIXs "we are totally vague how the shit" wasn't cutting it -(.

Name: Anonymous 2011-08-04 9:44

* "we are totally vague about how the shit should work" *

Name: Anonymous 2011-08-04 9:51

>>49
If you look at the C1x threads proposal, it's a bit more simplified than POSIX threads, but it should be more widely available on non-POSIX platforms. It appears to provide procedural alternatives to the C++11/C++0x threading primitives.

Also, since when did Microsoft adopt POSIX threads or provide a POSIX threads layer for Windows? I think you're drinking the toilet water.

Name: Anonymous 2011-08-04 11:37

>>51
Also, since when did Microsoft adopt POSIX threads or provide a POSIX threads layer for Windows? I think you're drinking the toilet water.

Unless I'm missing it, you really can't implement POSIX threads. The best you can do is provide a reasonable interpetation of it. And as far as I know, Microhoth has been using a (subset?) of POSIX threads as far back as maybe Windows xp?

Name: Anonymous 2011-08-04 12:24

The only sane threading library is ooc.util.concurrent

Name: Anonymous 2011-08-04 15:38

>>53
Microhoth has been using a (subset?) of POSIX threads as far back as maybe Windows xp?
No, they have their own threads API which is commonly known as Winthreads. It's completely different from POSIX threads.

Does this look like pthread_create(3) to you?

http://msdn.microsoft.com/en-us/library/ms682453%28v=vs.85%29.aspx

Perhaps you're thinking of BSD sockets, which they more or less support through Winsock. They also have User-Mode Scheduled cooperative-threading worker threads for building very fast task schedulers.

Name: Anonymous 2011-08-04 15:42

>>53
He doesn't know about Concrt, TBB, FastFlow or heterogeneous parallel compute languages like OpenCL, DirectCompute, or C++AMP.

Name: Anonymous 2011-08-04 15:59

>>55
Go back to /g/, ``please''!!

Name: Anonymous 2011-08-04 17:07

>>54
http://technet.microsoft.com/en-us/library/cc771672.aspx
http://technet.microsoft.com/en-us/library/cc754234.aspx

I bet you didn't know Win32 is just one subsystem running on top of the Windows kernel, it's not the "native" API.

Name: Anonymous 2011-08-04 18:52

>>55
None of those were conceived by a sentient C compiler, also ooc.util.concurrent.gpgpu.

THREAD OVER

Besides everyone knows about those, stop trying to look smarter than you are, it makes you look like a fucking moron.

Name: Anonymous 2011-08-04 19:03

>>54
I thought pthread_create(3) was an implementation of POSIX threads.

Name: Anonymous 2011-08-04 21:07

>>58
you're a huge fag, kill yourself

Name: Anonymous 2011-08-04 21:46

>>57
I fucking asked you if Microsoft offered a POSIX threads layer and you implied "no" by going all out and saying that Microsoft uses POSIX threads internally, as if it were its native threads interface.

Well guess what? This UNIX subsystem for Windows provides a POSIX threads layer, and a UNIX System V layer, on top of the native OS facilities. The native OS API for threading still is NOT POSIX threads. It's Winthreads.

Microsoft does not use POSIX threads natively.

Furthermore, "Win32" has been a deprecated term since around 2003/2004. You now refer to it as simply Windows or "Win" for short. As in, the Windows API or "Win API."

http://msdn.microsoft.com/en-us/library/ee663300%28v=VS.85%29.aspx

Does it say "Win32" or "Win32 API" anywhere? No it does not. Personally, I don't use Windows as my primary OS, I use a free as freedom operating system. But I still have the decency and thoughtfulness to use the correct terminology, even for operating systems I somewhat despise.

>>58
also ooc.util.concurrent.gpgpu

C++AMP supports targeting GPU, FPGA, and other heterogenous computing devices as first class citizens in a unified programming model. For example, this is code that runs on the host CPU and partitions work on a buffer to be carried out on one or more GPU devices across one or more GPU wavefronts:

#include <stdlib.h>
#include <ppl.h>

int main(int argc, char** argv) {
    const size_t buffer_size = 32 * 1024 * 1024;
    int* buffer = malloc(buffer_size * sizeof(int));

    parallel_for(0, buffer_size, [&] (size_t i) restrict(gpu) {
       buffer[i] = (int)((i * 8) & 0x7FFF);
    });

    // ...

    free(buffer);
    return 0;
}


Note how I'm not using external source files for the GPU compute kernels. Notice how I don't have specify how to partition the work, or marshal data between the CPU and GPU(s)--it does it all for me automatically, although if I wish, I can use my own custom partitioner.

I'm looking at the occ source code and I don't see where this magical concurrency library of yours resides. So I can't determine the extent to which occ's gpgpu stuff goes, but I have a feeling it's probably just a wrapper around OpenCL host library APIs.

This looks like a community driven language with out much thought behind the library design, it's all just thrown together in an ad-hoc manner like every other community driven project. Why should I use this over an ISO/IEC standardized language and library extension with multiple vendor and FOSS community support?

>>59
It is. Did you follow the link? It's not pthread_create(3), it was CreateThread, you swine.

Name: Anonymous 2011-08-04 22:01

>>57
In my heated discourse, I forgot to mention that while the Windows client threading APIs are not exactly the same as ones used in the kernel, there's almost a one-to-one mapping between the Windows API calls and the kernel system calls.

For example, CreateThread <-> [code]RtlCreateThread with the exact same function signature. In effect, they are one and the same.

Name: Anonymous 2011-08-05 5:33

>>60
Wow, are you really buttwrenched over that?
I guess you can't handle getting owned very well.

>>61
It's not occ, it's ooc, and even if you managed to get that right you're looking at the wrong ooc.

Name: Anonymous 2011-08-05 5:52

>>63
I was looking at http://ooc-lang.org/

Name: Anonymous 2011-08-05 8:35

>>64
Not much to see.

Name: Anonymous 2011-08-05 11:58

ooc just seems to be crappier version of c++

Name: Anonymous 2011-08-05 15:45

>>66
unpossible

Don't change these.
Name: Email:
Entire Thread Thread List