/prog/ - motherfucking hahstables

Name: Anonymous 2010-06-14 10:34

Hello /prog/

Im trying to implement a hash-table algorithm, the problem is: i dont know shit about hash-tables.

Now i might seem noobish for asking, but can someone explain them to me, Wikipedia didnt help a dick.
All i could find out was that the key has to be encoded in some way.

On a side note, why the fuck is it called a linked list? Isnt it obvious that it is linked, else it wouldnt be a list would it.

Name: Anonymous 2010-06-14 10:39

you want to do something you don't know a shit about?

Wikipedia actually does have a nice article on this: are you sure you even know what you're doing? Maybe you'd better learn about hash functions first.

It's called a linked list because a hash table is a subcategory in the linked list category.

Name: Anonymous 2010-06-14 10:40

also, nice misspelling in your subject...

Name: Anonymous 2010-06-14 11:07

Hashtables are reasonably trivial, but why do you want to implement them without understanding them. If you just need one, just use the one your language provides, or get a hashtable library.

Anyway, a linked list is either:
a) end of the list, NIL
b) a 2 element structure (which I will call a cons, with two fields, first being the car and the second being the cdr). One element, the car, traditionally points to the first element of the list, and the other element, the cdr, points to the rest of the list.

So a list like (1 2 3), can be written as:
(cons 1 (cons 2 (cons 3 nil)))
where (cons car cdr) = (car . cdr).
In non-lispy, languages, linked-lists tend to be reinvented a lot and people may put more fields in the linked-list type (what I called the cons cell), but there will always be a next field.

Linked lists are usually used to store lists of arbitrary length. Adding a new element in front is O(1), traversal is O(n), adding an element at the end is O(n) (unless, you make it a tconc, which stores a pointer to the end of the list, then it becomes a O(1)). There's also doubly linked lists which allow moving backwards from one element, but they're not used as often (there's another field which points back to the previous structure).

A hashtable is usually a data type which allows you to index into it by some data (either arbitrary or of some specific type) called the key and get a value associated with that key. They work internally by having a hash function, which when applied to the key, will give a hash (usually an integer of sorts). A hash should represent an object uniquely, so hashes will usually be designed to either test for object equivalence (they point to the same place, so hash the pointer), or object equality (same as hashing all the object data, doing a deep compare), or maybe some string hashing functions (case insentivity options), and so on. While the hash will usually be different for different objects, it doesn't have to be (collisions can occur and usually do, especially if the hash function sucks or the key size is too small), so there's also an equality function that comes together with the key function, it tests if 2 objects are equal given the criteria the user picked.

The internal representation of the hashtable can vary, but here's one possible internal representations (of many):
1 - hash function
2 - equality test
3 - value vector (could be a resizable vector of a certain size... when more values are added past its capacity, resize it)
4 - keyhash vector (runs parallel to the value vector, maybe always sorted to allow binary searches, but doesn't have to be)

To look up a value, you, hash the key, do a binary search for the hash in the keyhash vector, once found, use that index to look up the value in the value vector. Each value vector element is a linked list of values, and each value in this linked list would have the same hash. One then proceeds to iterate through the linked list until one finds the element which is equal(according to the equality test), and returns that element (it could even be a key-value pair, if you choose to store it). There's really many variations possible here. Another common one would be to reduce the hash size to something manageable (0x100 or 0x10000), don't have a hash vector, and the hash is used as an index in the value vector, after which you just look what you need up. Each approach has its advantages/disadvantages...

So why do you need to implement your own? Even if I'm using a language that doesn't have hashtables, I just pick a good hashtable library and just use that.

Name: Anonymous 2010-06-14 11:18

Im quite a noob when it comes to data structures and CS type of things, i never really read SICP all trough so i dont know if its good for that kind of stuff.

Can /prog/ suggest me any good articles that i can read online or better yet books that i can download?

It should be practical stuff tough, something that explains these things in such a way that i could implement it.

Name: Anonymous 2010-06-14 11:57

>>4

To look up a value, you, hash the key, do a binary search
I read somewhere that hash tables should have a constant overhead, but if i were to implement a search would not the key with say the lowest number, if we assume it would use a numerical index, be the first one to be found.

Im concerned about the trade off between memory and speed.

I would i not have to allocate many empty keys that i would not even use for such a search algorithm.

So why do you need to implement your own? Even if I'm using a language that doesn't have hashtables, I just pick a good hashtable library and just use that.

Because i am making ma own lisp dialect, so i have to reinvent the wheel again.
I need for the symboltable in first line.

Name: >>4 2010-06-14 12:36

>>6
I said that there are many ways to implement one.
The second approach I described (sparse array) is the most traditional way to implement a hashtable, but I've seen many which do something more close to the first.

Im concerned about the trade off between memory and speed.
Removing the key array, and just having a value array means it'll take up more memory, and lookup would get slower if there are many elements (symbols in your case), but as long as the hash function is good, the lookup is still faster than where you have to go through the key hash table.

Because i am making ma own lisp dialect, so i have to reinvent the wheel again. I need for the symboltable in first line.

At least for Common Lisp, I know SBCL has a highly modified/tuned hashtable for packages (collections of symbols), different from the generic hashtable that is used for everything else. The generic hashtable uses something similar to the first approach I described, but a bit more efficient.

Since in your case, the key and the value are the same (interning symbols), you can optimize your hashtable a bit (no key storage).

Here's what SBCL used for symbols, quoted straight from the source, and a live contents of a symbol hashtable was dumped for your convenience:



;;; comment from CMU CL:

;;;      Packages are implemented using a special kind of hashtable. It is

;;;   an open hashtable with a parallel 8-bit I-vector of hash-codes. The

;;;   primary purpose of the hash for each entry is to reduce paging by

;;;   allowing collisions and misses to be detected without paging in the

;;;   symbol and pname for an entry. If the hash for an entry doesn't

;;;   match that for the symbol that we are looking for, then we can

;;;   go on without touching the symbol, pname, or even hastable vector.

;;;      It turns out that, contrary to my expectations, paging is a very

;;;   important consideration the design of the package representation.

;;;   Using a similar scheme without the entry hash, the fasloader was

;;;   spending more than half its time paging in INTERN.

;;;      The hash code also indicates the status of an entry. If it zero,

;;;   the entry is unused. If it is one, then it is deleted.

;;;   Double-hashing is used for collision resolution.



(def!type hash-vector () '(simple-array (unsigned-byte 8) (*)))



(def!struct (package-hashtable

             (:constructor %make-package-hashtable

                           (table hash size &aux (free size)))

             (:copier nil))

  ;; The g-vector of symbols.

  (table (missing-arg) :type simple-vector)

  ;; The i-vector of pname hash values.

  (hash (missing-arg) :type hash-vector)

  ;; The total number of entries allowed before resizing.

  ;;

  ;; FIXME: CAPACITY would be a more descriptive name. (This is

  ;; related to but not quite the same as HASH-TABLE-SIZE, so calling

  ;; it SIZE seems somewhat misleading.)

  (size (missing-arg) :type index)

  ;; The remaining number of entries that can be made before we have to rehash.

  (free (missing-arg) :type index)

  ;; The number of deleted entries.

  (deleted 0 :type index))

Name: >>4 2010-06-14 12:38

Well, damn, it's too big for Shiichan.
http://paste.lisp.org/display/111429

Name: Anonymous 2010-06-14 13:50

>>1
On a side note, why the fuck is it called a linked list? Isnt it obvious that it is linked, else it wouldnt be a list would it.
Almost got me.

Name: 1 2010-06-14 18:57

Ok, so i read about this whole thing and found out that its somewhat a thing of luck to find a good method. Now im still dont know much more about this.
As far as i could figure is that a hashtable in essence is a sorted linked list with hash function encoded key entries and coresponding values or two indexed arrays with keys and values assigned to each other by their index. Would that be somewhat correct?

Im sure im not the first one doing this here, so i would like to know which way would be the best.

Im off sleeping, good night /prog/ i love you.

Name: Anonymous 2010-06-14 19:10

>>10
hash(data) == fixed size key (not too big).
Key is either indexed directly into an array to get to possible values matching that key. Recursive iteration+comparison over the linked list corresponding to that index (hash) is performed until the right value is found. Some versions may not use linked lists and just use flat arrays, others use sparse arrays where you index the hash directly at a higher memory cost, but faster lookup times, while others use multiple (2-3) non-sparse arrays to lookup up the key, usually slightly slower, but at less memory cost. A hashtable is defined by it having the key<->value association as an abstract data structure, and the fact that the key is hashed into a smaller value which can somehow be used to locate or reduce the needed search time when finding the value.

Name: Anonymous 2010-06-14 21:52

Are we really having an earnest discussion about very basic data structures and getting most of it wrong? Seriously, /prog/?

Name: Anonymous 2010-06-14 23:11

>>12
What, you think people who know this topic like the back of their hand would find it interesting enough to participate in its discussion?

motherfucking hahstables

1 Name: Anonymous 2010-06-14 10:34

2 Name: Anonymous 2010-06-14 10:39

3 Name: Anonymous 2010-06-14 10:40

4 Name: Anonymous 2010-06-14 11:07

5 Name: Anonymous 2010-06-14 11:18

6 Name: Anonymous 2010-06-14 11:57

7 Name: >>4 2010-06-14 12:36

8 Name: >>4 2010-06-14 12:38

9 Name: Anonymous 2010-06-14 13:50

10 Name: 1 2010-06-14 18:57

11 Name: Anonymous 2010-06-14 19:10

12 Name: Anonymous 2010-06-14 21:52

13 Name: Anonymous 2010-06-14 23:11