/prog/ - How to store "infixes"

Name: Anonymous 2009-09-16 7:58

I need to store all possible infixes of words (and their location) to be able to quickly make queries.

(eg. anus -> a, n, u, s, an, nu, us, anu, nus, anus)

Problem is, even with small files, if you store "infix->list of positions", the index gets really really really large (obviously).

I have been trying to think of a data structure to reduce storage needed, but... I failed. Any ideas?

Name: Anonymous 2009-09-16 8:00

We all realize this is set-up, so please just gtfo. You ask a question you've answered a priori. Even the answer post might be written a priori; but not necessarily. What you gain from this, is that you're able to associate your stupid junk with programming posts, thus giving the illusion that programmers here actually talk about this bullshit (your stupid shit).

Name: Anonymous 2009-09-16 8:05

ONE WORD, SUFFIX TREES, THREAD OVER

Name: Anonymous 2009-09-16 8:10

>>2

No it's not

>>3

I doubt that a suffix tree uses less space.

Name: Anonymous 2009-09-16 8:13

>>1
so the idea is that the input to this program will be an infix ('nu'), and it should return the words / indices which contain that infix ('anus' and 'nucleus'?)

Name: Anonymous 2009-09-16 8:14

>>5

just the positions are fine, yeah

so for example if the input is "anus nucleus" and you are searching for "nu", return 2 and 6.

Name: Anonymous 2009-09-16 8:17

>>4
Then use a hree with buttsorting.

Name: Anonymous 2009-09-16 8:17

man strstr

Name: Anonymous 2009-09-16 8:19

I bet that there's a Haskell function that does that.

Name: Anonymous 2009-09-16 8:19

>>6
what is the input data range like (gigabytes of text, html pages, ???)

Name: Anonymous 2009-09-16 8:20

Jagged arrays.

Name: Anonymous 2009-09-16 8:22

use one of the fast string matching algorithms (moris-pratt, rabin-karp, knuth-morris-pratt, etc...)
see >>8

Name: Anonymous 2009-09-16 8:22

>>10

not really gigabytes, as a test just a 1.2mb file with 100.000 lines. building an index of
infix1->pos1,pos2,...
...

uses around 150mb. which huge. there are two million infixes.

Name: Anonymous 2009-09-16 8:25

>>12
obviously for this method, you would be obtaining indices of matches as needed. if you absolutely most be storing all possible infixes, then use a suffix-tree.

Name: Anonymous 2009-09-16 12:30

>>12

too slow if there are a lot of matches, and if you use a short (2-3 letters) query.

Name: Anonymous 2009-09-16 22:26

>>4
You doubt suffix trees use less space than brute force storage? I wasn't able to verify, but I'm pretty sure it's O(input size). Your algorithm is O(2^(input size)). Suffix trees win.

Name: Anonymous 2009-09-16 22:28

>>15
Whoops. O((input size)^2). Two different things.

Name: Anonymous 2009-09-17 3:24

>>16

Yeah but when searching for short patterns (length m) with many matches (n) it will be slower: O(m+n) Also: The tree has many many many nodes and leaves (especially if you allow unicode code points instead of just a-z), which results in huge memory consumption. If you have ten thousand different unicode code points, the first node will have ten thousand children. Good luck with that.

How to store "infixes"

1 Name: Anonymous 2009-09-16 7:58

2 Name: Anonymous 2009-09-16 8:00

3 Name: Anonymous 2009-09-16 8:05

4 Name: Anonymous 2009-09-16 8:10

5 Name: Anonymous 2009-09-16 8:13

6 Name: Anonymous 2009-09-16 8:14

7 Name: Anonymous 2009-09-16 8:17

8 Name: Anonymous 2009-09-16 8:17

9 Name: Anonymous 2009-09-16 8:19

10 Name: Anonymous 2009-09-16 8:19

11 Name: Anonymous 2009-09-16 8:20

12 Name: Anonymous 2009-09-16 8:22

13 Name: Anonymous 2009-09-16 8:22

14 Name: Anonymous 2009-09-16 8:25

15 Name: Anonymous 2009-09-16 12:30

16 Name: Anonymous 2009-09-16 22:26

17 Name: Anonymous 2009-09-16 22:28

18 Name: Anonymous 2009-09-17 3:24

Name: Anonymous 2009-09-16 7:58

Name: Anonymous 2009-09-16 8:00

Name: Anonymous 2009-09-16 8:05

Name: Anonymous 2009-09-16 8:10

Name: Anonymous 2009-09-16 8:13

Name: Anonymous 2009-09-16 8:14

Name: Anonymous 2009-09-16 8:17

Name: Anonymous 2009-09-16 8:17

Name: Anonymous 2009-09-16 8:19

Name: Anonymous 2009-09-16 8:19

Name: Anonymous 2009-09-16 8:20

Name: Anonymous 2009-09-16 8:22

Name: Anonymous 2009-09-16 8:22

Name: Anonymous 2009-09-16 8:25

Name: Anonymous 2009-09-16 12:30

Name: Anonymous 2009-09-16 22:26

Name: Anonymous 2009-09-16 22:28

Name: Anonymous 2009-09-17 3:24