Return Styles: Pseud0ch, Terminal, Valhalla, NES, Geocities, Blue Moon. Entire thread

something some guy wrote somewhere

Name: Anonymous 2007-08-04 0:36 ID:0eNriJPv

I saw this on the front page of fpgacpu.org.

What bothers me about it is, it doesn't explain just how/why exactly doing this with linked lists causes frequent, wasteful cache misses.  I know what cache misses are, but I fail to see the connection with this iterative loop mentioned in the post.  Nor does the author recommend a better way to do it.  Your input, /prog/?

------------------------------------------------------------------

First let's do a gedanken experiment. Take your Windows PC, or your OS X Macintosh, or your Linux box, and freeze it in mid-computation, and save a snapshot of the entire memory image. Maybe you have a 128 MB dump. Now you put on your "software archaeologist" pith helmet, and you spend the next three years of your life pouring over the dump, picking through the strata, cataloging the arcane bits of code and data that you uncover. If you do that, I assure you that you will find, amongst other things, hundreds or even thousands of separate, mediocre, linked list and hash table implementations. You will find tens of thousands of loops written in C/C++ that amount to

  for (ListElement* p = head; p; p = p->next) { ... }

That is the dominant paradigm. It stinks. It is holding us back. This little idiom and its brethren, carved in diamond in untold billions of dollars worth of genuinely useful software intellectual property, is the carrot that leads Intel and AMD and the rest, to build 100 million and soon one billion transistor processors where the actual computing takes place in less than a million of those transistors.

What is the intention behind this code? To do something with a growable collection of values -- search it, map it, collect a subset.

On modern machines, this code stinks. A full cache miss wastes many hundreds of instruction issue slots. Your hundreds of millions of transistors sit around twiddling their thumbs. Until each preceding pointer completes its odyssey...

    (Upon determining that the miscreant pointer value has gone AWOL from on-chip caches, the processor sends out a subpoena compelling its appearance; this writ is telegraphed out to the north bridge and thence to the DRAM; the value, and its collaborators in adjacent cells in the line, then wend their way, from DRAM cells, through sense amps, muxes, drivers, queueing for embarkation at the DRAM D/Q pins, then sailing across the PCB, then by dogsled across the north bridge, then by steamer across the PCB again, finally landing at the processor pins, and then, dashing across the processor die and into the waiting room of the L1 D-cache ...)

... until that happens, the poor processor can't make any significant progress on the next iteration of the computation.

This scenario is so bad and so common that the microprocessor vendors use 80% of their transistor budgets for on-chip caches -- Intel as glorified SRAM vendor.

Unfortunately it is really hard to make compilers and computer architectures transform this hopelessly serial pointer-following problem into something that can go faster.

Name: Anonymous 2011-05-31 4:02

>>18
std::deque in C++ uses two levels of arrays, one which is a array of pointers that map to blocks, and each fixed-size block can contain N nodes.

However, with virtual memory allocators, you can reserve a huge chunk of address space. In 64-bit operating systems, you can like reserve 32GB of address space each for various things, with out actually eating up any physical memory. And when you need a new page, you just commit a page of reserved memory at the end of one of your contiguous arrays, and a physical page is mapped to it by the operating system.

Current AMD/Intel x86-64 CPUs use 48-bits for addressing, so that gives you 281474 GB of address space to play with in your process.

Newer Posts
Don't change these.
Name: Email:
Entire Thread Thread List