Return Styles: Pseud0ch, Terminal, Valhalla, NES, Geocities, Blue Moon.

Pages: 1-

something some guy wrote somewhere

Name: Anonymous 2007-08-04 0:36 ID:0eNriJPv

I saw this on the front page of fpgacpu.org.

What bothers me about it is, it doesn't explain just how/why exactly doing this with linked lists causes frequent, wasteful cache misses.  I know what cache misses are, but I fail to see the connection with this iterative loop mentioned in the post.  Nor does the author recommend a better way to do it.  Your input, /prog/?

------------------------------------------------------------------

First let's do a gedanken experiment. Take your Windows PC, or your OS X Macintosh, or your Linux box, and freeze it in mid-computation, and save a snapshot of the entire memory image. Maybe you have a 128 MB dump. Now you put on your "software archaeologist" pith helmet, and you spend the next three years of your life pouring over the dump, picking through the strata, cataloging the arcane bits of code and data that you uncover. If you do that, I assure you that you will find, amongst other things, hundreds or even thousands of separate, mediocre, linked list and hash table implementations. You will find tens of thousands of loops written in C/C++ that amount to

  for (ListElement* p = head; p; p = p->next) { ... }

That is the dominant paradigm. It stinks. It is holding us back. This little idiom and its brethren, carved in diamond in untold billions of dollars worth of genuinely useful software intellectual property, is the carrot that leads Intel and AMD and the rest, to build 100 million and soon one billion transistor processors where the actual computing takes place in less than a million of those transistors.

What is the intention behind this code? To do something with a growable collection of values -- search it, map it, collect a subset.

On modern machines, this code stinks. A full cache miss wastes many hundreds of instruction issue slots. Your hundreds of millions of transistors sit around twiddling their thumbs. Until each preceding pointer completes its odyssey...

    (Upon determining that the miscreant pointer value has gone AWOL from on-chip caches, the processor sends out a subpoena compelling its appearance; this writ is telegraphed out to the north bridge and thence to the DRAM; the value, and its collaborators in adjacent cells in the line, then wend their way, from DRAM cells, through sense amps, muxes, drivers, queueing for embarkation at the DRAM D/Q pins, then sailing across the PCB, then by dogsled across the north bridge, then by steamer across the PCB again, finally landing at the processor pins, and then, dashing across the processor die and into the waiting room of the L1 D-cache ...)

... until that happens, the poor processor can't make any significant progress on the next iteration of the computation.

This scenario is so bad and so common that the microprocessor vendors use 80% of their transistor budgets for on-chip caches -- Intel as glorified SRAM vendor.

Unfortunately it is really hard to make compilers and computer architectures transform this hopelessly serial pointer-following problem into something that can go faster.

Name: Anonymous 2007-08-04 0:46 ID:64YQBAp4

Man, someone thinks he's a better writer than he is. I almost tl;dred on that garbage, but here's what I'm getting:

The likelihood that the contents of the next list item will already be in memory is not good, so each iteration of the loop is likely to produce a miss and leave the compiler waiting for data to be fetched from memory. Unfortunately, it's hard to optimize this, since how would we find out what data to preload in the cache? We'd traverse the list, and we're back to square one.

Name: Anonymous 2007-08-04 0:47 ID:64YQBAp4

>>2
Also, I want to punch him in the eye for not just saying that.

Name: Anonymous 2007-08-04 1:13 ID:uVAe/05G

celerons arent good ya

Name: Anonymous 2007-08-04 3:26 ID:VYUfPxTB

I don't understand. He found linked lists inside memory? That doesn't even make any sense at all...

Name: Anonymous 2007-08-04 3:31 ID:a8106UXX

Fuck they just need to come out with SRAM sticks already.  I'll fuckin pay $1000 for one if it's fast enough.

Name: Anonymous 2007-08-04 3:53 ID:64YQBAp4

>>5
Your post makes no sense at all. What do you mean by "he found"? Why wouldn't linked lists be in memory?

Name: Anonymous 2007-08-04 4:21 ID:VYUfPxTB

>7

>If you do that, I assure you that you will find, amongst other things, hundreds or even thousands of separate, mediocre, linked list and hash table implementations. You will find tens of thousands of loops written in C/C++ that amount to

  for (ListElement* p = head; p; p = p->next) { ... }

Name: Anonymous 2007-08-04 4:23 ID:64YQBAp4

Poor >>8, did his terrible writing confuse you? He just meant that linked lists and hash tables are very common.

Name: Anonymous 2007-08-04 6:13 ID:VYUfPxTB

>>9
ah. i guess that makes sense.
:(

Name: Anonymous 2007-08-04 6:16 ID:Heaven

>>8
Fool. Read his post again. He's saying that within the memory dump, he found code which looks like that for loop. You are aware that both code and data is stored in memory, right?

Name: Anonymous 2009-03-06 7:57

Maybe you should learn   to read scroll   down on the   meaning of a   benchmark of what   hacking was at   that point convinced.

Name: 蟻力神 2011-05-31 3:27

Name: 壮三天 2011-05-31 3:29

Name: Anonymous 2011-05-31 3:42

Fail /prog/rammers assuming the cpu's cache and memory are the same thing.

The CPU's cache is the memory it uses to do operations. When it doesn't contain what it needs, it must retrieve it from memory.

Linked lists and hash tables cause frequent cache misses, causing more memory retrieval than actual computation (it must go find the next item in the list/table).

The solution? Fuck if I know.

I actually find it funny that the author considered linked list and hash table implementations mediocre when he/she did not recommend a solution.

Name: Anonymous 2011-05-31 3:49

The solution? Fuck if I know.

I would like to upvote contiguous arrays of homogeneous objects that fit nicely inside of cache lines and exploit SIMD vectorization and cache-prefetching where applicable. Memory should be allocated in page-sized blocks using the operating system's virtual memory page allocator.

These work much better than linked lists or other linked data structures for iteration.

Name: Anonymous 2011-05-31 3:54

>>16 here

>>17
Yes, they would, but unfortunately they aren't very versatile (in terms of size of iterable space anyways.)

What about a data structure that existed as such an array, but the last element was reserved for a pointer to another such array when more space is needed?

Name: Anonymous 2011-05-31 4:02

>>18
std::deque in C++ uses two levels of arrays, one which is a array of pointers that map to blocks, and each fixed-size block can contain N nodes.

However, with virtual memory allocators, you can reserve a huge chunk of address space. In 64-bit operating systems, you can like reserve 32GB of address space each for various things, with out actually eating up any physical memory. And when you need a new page, you just commit a page of reserved memory at the end of one of your contiguous arrays, and a physical page is mapped to it by the operating system.

Current AMD/Intel x86-64 CPUs use 48-bits for addressing, so that gives you 281474 GB of address space to play with in your process.

Name: Anonymous 2011-05-31 4:08

>>18
Thanks, I was unaware. I suppose that's essentially the same thing.

Also, that's neat. Is there a way to use virtual memory allocators (in C++, for instance)?

Name: Anonymous 2011-05-31 4:11

>>20
Yes, using the operating system APIs.

You use the mmap POSIX function on Linux and other UNIX-like operating systems. On Windows, you use the VirtualAlloc function.

Name: Anonymous 2011-05-31 4:16

>>20
I suppose that's essentially the same thing.
Also, it's not quite the same thing, as the page mapping is all handled by the actual hardware using the CPU's page table. So you get memory indirection for essentially free.

http://en.wikipedia.org/wiki/X86-64#Virtual_address_space_details

Name: Anonymous 2011-05-31 5:08

>>22
No, I meant the deque class used the same idea as the data structure I proposed.

And thank you for the link, fellow /prog/rammer!

Name: Anonymous 2011-05-31 7:15

>>17
>>18
>>19
I can see an issue.

You'd need to maintain a list of the valid entries for each chunk (preferably as a bit array), and insertion wouldn't be done in O(1) as with simple linked lists, unless you like wasting memory.
I guess the best choice entirely depends on which of the operations is the most frequently used, map, insert, or delete.

Name: Anonymous 2011-05-31 7:42

>>16
>The CPU's cache is the memory it uses to do operations.

You are confusing cache with registers.

Name: Anonymous 2011-05-31 7:46

>>18
What about a data structure that existed as such an array, but the last element was reserved for a pointer to another such array when more space is needed?
So, a VList? http://en.wikipedia.org/wiki/VList

Name: Anonymous 2011-05-31 8:53

Holy shit, most of /prog/ is/are/was/were fucking retarded.

Name: Anonymous 2011-05-31 10:17

>>27
If you ignore the scarce off-topic posts and the idiots confusing memory types, this is actually an excellent thread by /prog/'s standards.

Name: Anonymous 2011-05-31 12:06

>>28
We had better threads than this.

Name: tradeinfo365 2011-05-31 21:30

Name: Anonymous 2011-06-01 1:06

>>27
Hi! You've just attained "regular" status here on /prog/. Not everything you hoped it would be, huh?

Don't change these.
Name: Email:
Entire Thread Thread List