>>20
>Great, now you're raping the cache by loading various parts of that array into the other cores.
You don't know how a multi-level cache actually works, do you? Reading data from memory scales linearly. You can read the same cache line of data on as many multiple physical cores simultaneously as you like and there won't be a single cache fault. It is ``write-sharing'' that you need to worry about. Writing, or read-modify-writing, to a shared location of memory from multiple cores simultaneously causes that cache-line to be invalidated mid-flight which will result in a cache fault the next time a different core tries to access it, thus forcing the memory controller to serialize access. However, writing to a cache-line from a single core, that other cores currently aren't attempting to read or write also scales fine. So the trick is remove write-sharing.
You're wasting time just to setup the thread pool.
Setting up the task scheduler is only done once when the program first starts. That's it. There is no additional overhead whenever you dispatch concurrent tasks. And in fact, if you use cooperative user-mode scheduling with your task scheduler (Windows 7 UMS threads for example), there's not even any kernel syscall overhead or kernel mode context-switching. (Also, it's not a mere thread-pool, as each worker thread multiplexes tasks from local lock-free queues).
In other words, I don't get why it's a better idea to have multiple cores caress a single piece of data at a time rather than having each core caress its own piece of data.
They do only get their own piece of data.
parallel_for uses a linear fixed-size partitioner. If your machine has 8 logical cores, and the
vector<int> values container has a 2
20 = 1024 * 1024 elements, then partitioner will schedule and dispatch 8 jobs to the task-scheduler, the first job iterating over the elements [0, 2
20 / 8) elements, the second job will iterate over elements [2
20 / 8, 2 * 2
20 / 8), and so on. Each element may only be touched once, and in fact, each job will terminate as soon as it finds the value it was searching for as we're only interested in the first such element in the entire container. Multiple cores aren't touching the same piece of data, they're touching disjoint parts of it.
So even if reading the same memory from multiple cores didn't scale linearly, the
parallel_for example above still would scale linearly.