>>10
Yes. That is the big problem. Part of that problem, however, is that a lot of people are stuck in the old school-of-thought when it comes to writing scalable, parallel code. When you ask them to make a bit of code threaded, the first thing they think of is to use a mutex or a reader-writer lock to serialize access to that code and then instantiate a few threads to execute it, and maybe use a semaphore or condition variable to signal events or pass messages between threads--of course that's not going to scale. I get the feeling this is what you're still thinking of.
The reality is that it's actually quite feasible to parallelize a lot of code, or transformations on the underlying data, in ways that don't serialize access to the data. You just need to adapt, think outside of the box and take a look at what other people are doing to solve these problems.
Threading code around the idea of modules or subsystems is old-school, and doesn't scale. Like you mentioned, you'd use an extra core to handle physics or AI or some other subsystem in a game engine. It doesn't work.
The way around it is to use task-oriented and data-oriented parallelism. And part of that involves getting rid of using fine-grained object-orientation. Just fucking stop using it OOP, it's what's holding back a lot of code from being made to scale. Suddenly you'll find yourself being able to write code that scales up to hundreds of cores, even on x86. And not only that, but you'll also find that the code you write is smaller and simpler.
For example, with my own task-oriented toolkit I've been working on for C++11/C++0x, based off of Intel's TBB, searching for the offset of a particular value within a vector using a map-reduce type pattern looks like:
vector<int> values; // initialized with say a million integers
constexpr int value_to_find = 20415001;
structured_task_group group;
combinable<ptrdiff_t> result(PTRDIFF_MAX);
// search for the offset of the first occurrence
parallel_for(values.begin(), values.end(), group, [&](vector<int>::iterator i) {
if (*i == value_to_find) {
ptrdiff_t offset = i - values.begin();
result.value(offset);
group.cancel_work_unit();
}
});
// final reduction step
int final_offset = result.reduce([&] (int x, int y) {
return (x < y) ? x : y;
}
There's no overhead of starting up new threads as they're already sitting idle waiting for work, the partitioning is automatically done based on the number of hardware cores that are available (it's possible to write your own partitioner for different use cases to better fit the data), the task-scheduler is based off the SLAW adaptive work-stealing scheduler and it will automatically load-balance work units across cores, and there's no write-sharing which causes memory performance issues, using the TLS mechanisms built into the combinable template container.
This is just a simple example, but doing things like parallel quick/intro or radix sorts, parallel tree traversals, parallel updating all of your entities in game engine each frame, asynchronous file/network I/O, etc. isn't that much more code, maybe a few dozen extra lines.
This is the kind of stuff game developers are starting to use in their engines, and what Lispers and Haskeller's have been raving about all of these years.