/prog/ - something some guy wrote somewhere

Name: Anonymous 2007-08-04 0:36 ID:0eNriJPv

I saw this on the front page of fpgacpu.org.

What bothers me about it is, it doesn't explain just how/why exactly doing this with linked lists causes frequent, wasteful cache misses. I know what cache misses are, but I fail to see the connection with this iterative loop mentioned in the post. Nor does the author recommend a better way to do it. Your input, /prog/?

------------------------------------------------------------------

First let's do a gedanken experiment. Take your Windows PC, or your OS X Macintosh, or your Linux box, and freeze it in mid-computation, and save a snapshot of the entire memory image. Maybe you have a 128 MB dump. Now you put on your "software archaeologist" pith helmet, and you spend the next three years of your life pouring over the dump, picking through the strata, cataloging the arcane bits of code and data that you uncover. If you do that, I assure you that you will find, amongst other things, hundreds or even thousands of separate, mediocre, linked list and hash table implementations. You will find tens of thousands of loops written in C/C++ that amount to

for (ListElement* p = head; p; p = p->next) { ... }

That is the dominant paradigm. It stinks. It is holding us back. This little idiom and its brethren, carved in diamond in untold billions of dollars worth of genuinely useful software intellectual property, is the carrot that leads Intel and AMD and the rest, to build 100 million and soon one billion transistor processors where the actual computing takes place in less than a million of those transistors.

What is the intention behind this code? To do something with a growable collection of values -- search it, map it, collect a subset.

On modern machines, this code stinks. A full cache miss wastes many hundreds of instruction issue slots. Your hundreds of millions of transistors sit around twiddling their thumbs. Until each preceding pointer completes its odyssey...

(Upon determining that the miscreant pointer value has gone AWOL from on-chip caches, the processor sends out a subpoena compelling its appearance; this writ is telegraphed out to the north bridge and thence to the DRAM; the value, and its collaborators in adjacent cells in the line, then wend their way, from DRAM cells, through sense amps, muxes, drivers, queueing for embarkation at the DRAM D/Q pins, then sailing across the PCB, then by dogsled across the north bridge, then by steamer across the PCB again, finally landing at the processor pins, and then, dashing across the processor die and into the waiting room of the L1 D-cache ...)

... until that happens, the poor processor can't make any significant progress on the next iteration of the computation.

This scenario is so bad and so common that the microprocessor vendors use 80% of their transistor budgets for on-chip caches -- Intel as glorified SRAM vendor.

Unfortunately it is really hard to make compilers and computer architectures transform this hopelessly serial pointer-following problem into something that can go faster.

something some guy wrote somewhere

1 Name: Anonymous 2007-08-04 0:36 ID:0eNriJPv

5 Name: Anonymous 2007-08-04 3:26 ID:VYUfPxTB

Name: Anonymous 2007-08-04 0:36 ID:0eNriJPv

Name: Anonymous 2007-08-04 3:26 ID:VYUfPxTB