Name: Anonymous 2011-08-13 23:02
Carmack does hope that Sony avoids the Cell architecture all together due to the difficulty in development.
Fuck cell! For the Carmack's ために!
Fuck cell! For the Carmack's ために!
vector<int> values; // initialized with say a million integers
constexpr int value_to_find = 20415001;
structured_task_group group;
combinable<ptrdiff_t> result(PTRDIFF_MAX);
// search for the offset of the first occurrence
parallel_for(values.begin(), values.end(), group, [&](vector<int>::iterator i) {
if (*i == value_to_find) {
ptrdiff_t offset = i - values.begin();
result.value(offset);
group.cancel_work_unit();
}
});
// final reduction step
int final_offset = result.reduce([&] (int x, int y) {
return (x < y) ? x : y;
}parallel_for uses a linear fixed-size partitioner. If your machine has 8 logical cores, and the vector<int> values container has a 220 = 1024 * 1024 elements, then partitioner will schedule and dispatch 8 jobs to the task-scheduler, the first job iterating over the elements [0, 220 / 8) elements, the second job will iterate over elements [220 / 8, 2 * 220 / 8), and so on. Each element may only be touched once, and in fact, each job will terminate as soon as it finds the value it was searching for as we're only interested in the first such element in the entire container. Multiple cores aren't touching the same piece of data, they're touching disjoint parts of it. parallel_for example above still would scale linearly.
mov eax, [esi+8*edi+12345678]
mov ebx, [esi+4*edi+23456780]
mov ecx, [esi+2*edi+34567894]
mov edx, [esi+edi+4567890C]
mov [esi+8*edi+12345678], eax
mov eax, 12h
and eax, [esi+8*edi+11521530]movups all with full addressing modes in the same cycle too, and you can see that this wouldn't need more than compiler (or assembler) level changes (if any) to take advantage of.