/prog/ - How would I go about this?

Name: Anonymous 2011-08-18 12:46

I have a ton of grouped pdf files that I need to store in such a way that I can read and search them in the fastest way possible. Write/Annotate/Delete times are not important.

* The pdfs are already grouped by various themes and alphanumerically. This should facilitate searching.

* Average file size is about 2MB.

* There are about 2.5M files

How would I go about this?

Name: Anonymous 2011-08-18 12:51

>>1
I have never seen a pornographic pdf, much less a collection of those.

Name: Anonymous 2011-08-18 13:01

>>2

It's erotic literature.

Name: Anonymous 2011-08-18 13:04

>>3

Oh and thanks for reminding me. What I want to achieve is something similar to literotica.com, albeit with pdf's instead of plain text.

Name: (　≖‿≖) 2011-08-18 13:06

>>1
Use a PDF API for extracting text of PDF.
Build a database with that text.
Extract statistics (use vector model or similar) for terms and themes.

Everything is explained here, without the API thing.
http://nlp.stanford.edu/IR-book/information-retrieval-book.html

Name: Anonymous 2011-08-18 13:54

>>5
http://nlp.stanford.edu/IR-book/html/htmledition/an-example-information-retrieval-problem-1.html

How do get Antony and Cleopatra and Hamlet from

110100 AND 110111 AND 101111 = 100100

Name: (　≖‿≖) 2011-08-18 15:09

>>6
I don't reccomend using the boolean model because it's shit. Take that only as a reference, any other model should improve on that.

Name: (　≖‿≖) 2011-08-18 15:16

>>6
Explanation:
110100 Is the boolean vector with the documents that have Anthony.
110111 Is the boolean vector with the documents that have Cleopatra.
101111 Is the boolean vector with the documents that have Hamlet.

100100 Is the boolean vector with the documents that have the three terms.

Most models elaborate on this, giving each pair (Document, Term) a value. For each query the value for each document is computed based on that pair's value. On the boolean model those values are absolute (1 = term present, 0 = term not present) but for more advanced models (i.e. non specialized search, like Google) you have to refine this.

Name: Anonymous 2011-08-18 15:39

>>8
Maybe I'm just being dumb, but I don't see they arrive at

Antony and Cleopatra and Hamlet

Did they refer to the matrix table to get those names? If so, how.

Name: (　≖‿≖) 2011-08-18 15:51

>>9
Those are terms extracted from the document collection. They construct the matrix for the terms they have taken from each document by putting 1 on those cells whose column is a document that contains the term on the row.

Name: Anonymous 2011-08-18 16:35

>>10
I meant how do they arrive at those terms in the final answer.

Name: (　≖‿≖) 2011-08-18 17:34

>>11
Bitwise and?

Name: Anonymous 2011-08-18 17:36

>>11
Also, they are not trying to get terms, they are trying to get documents with terms.

Name: Anonymous 2011-08-18 18:31

>>11
No. Let's try this again. How do they get

How do they get they Antony and Cleopatra and Hamlet from 100100.

Name: Anonymous 2011-08-18 19:12

Never mind. I just asked a real computer programmer the same question, and instead of giving my the run around, they gave the answer I was looking for. Funny how this person had no issues comprehending my question.

Name: Anonymous 2011-08-18 19:14

>>14
On the off chance you are actually asking, and not just trolling and acting dense, the idea is that each indexed bit position represents an entry in some ordered data structure (array, list, tree, whatever), which contains the information needed to resolve the rest of the search data. When the bit patterns are AND or OR'ed together, depending on the type of search, each set bit represents one of these entries; so you map the bit position to the position in the data structure, dereference the corresponding positions into a result set, and you have your search results.

Name: Anonymous 2011-08-18 19:15

>>15
Thanks for maturely sharing your real computer programmer's response instead of acting like a whining bitch.

Name: Anonymous 2011-08-18 19:16

Fucking easy

Install Google Desktop Search
Point it to the folder with all the PDF's.
Wait a day for GDS to index all that shit

PROFIT!

If you want, there's a Python command line tool that you can use to interface with GDS.

Name: (　≖‿≖) 2011-08-18 19:31

>>18
OMG and I was thinking you wanted to do some programming... I am dissapoint.

Name: Anonymous 2011-08-18 20:03



$ man updatedb; man locate

Name: Anonymous 2011-08-18 20:06

Well, interfacing with the Python tool is interesting. I played with it one day or two at work, then i became bored.
>>19
I like programming, but is much better when you do something that is needed instead of a toy. In this particular case, just a few tools do the work, and you can even "hide" it building on top of the Python tool, if you really want to.

But writing something that GDS does very well is kinda pointless, IMO.

Still want to play with this?
http://www.sqlite.org/fts3.html
Is a good starting point

Name: Anonymous 2011-08-18 21:06

< MAN I'M ON A ROLL!

Name: Anonymous 2011-08-19 2:15

>>5

Thanks for the link. That's what I was looking for!

Name: Anonymous 2011-08-19 2:25

　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　　 /＼＿＿_／ヽ　　　　　　　　　　
　　　 (.｀ヽ(｀＞、　　　　　　　　　　　　　　　　　　　　　　／''''''　　　'''''':::::　　　　　It's VIP quality
　　　 `'＜｀ゝr'ﾌ＼　　　　　　　　　　　　　　　　　＋　　|（●）,　　､（●）､.:|　＋　　 http://www.world4ch.org/prog/
　 ⊂ｺ二Lﾌ^´　ノ, ／⌒)　　　　　　　　　　　　　　　　　|　　,,,ﾉ(､_, )ヽ､,,　.::::|　
　　⊂l二L7_　/　-ゝ-')´　　　　　　　　　　　　　　　.+　|　　｀-=ﾆ=- '　.::::::|　+　.　　　　
　　　　　　＼_　　､__,.ｲ＼　　　　　　　　　　　＋　　　　＼　　｀ﾆﾆ´　 .:::／　　　　+　
　　　　　　　　(T＿_ノ　　 Tヽ　　　　　　　 , -r'⌒!￣ `":::7ヽ.｀- ､　　　./|　　.　
　　　　　　　　　ヽ￢.　　　/　ﾉ`ｰ-､ﾍ<ｰ1´|　ヽ　| :::::::::::::ﾄ、＼　(　　./ヽ　　　　
　　　　　　　　　　＼l__,.／　　　　 i　l.ヽ!　|　　　.| ::::::::::::::l ヽ　｀７ｰ.､‐'´ |＼-､

Name: Anonymous 2011-08-19 5:17

>>21
Too bad SQLite is SLOW AS FUCK.

How would I go about this?

1 Name: Anonymous 2011-08-18 12:46

2 Name: Anonymous 2011-08-18 12:51

3 Name: Anonymous 2011-08-18 13:01

4 Name: Anonymous 2011-08-18 13:04

5 Name: ( ≖‿≖) 2011-08-18 13:06

6 Name: Anonymous 2011-08-18 13:54

7 Name: ( ≖‿≖) 2011-08-18 15:09

8 Name: ( ≖‿≖) 2011-08-18 15:16

9 Name: Anonymous 2011-08-18 15:39

10 Name: ( ≖‿≖) 2011-08-18 15:51

11 Name: Anonymous 2011-08-18 16:35

12 Name: ( ≖‿≖) 2011-08-18 17:34

13 Name: Anonymous 2011-08-18 17:36

14 Name: Anonymous 2011-08-18 18:31

15 Name: Anonymous 2011-08-18 19:12

16 Name: Anonymous 2011-08-18 19:14

17 Name: Anonymous 2011-08-18 19:15

18 Name: Anonymous 2011-08-18 19:16

19 Name: ( ≖‿≖) 2011-08-18 19:31

20 Name: Anonymous 2011-08-18 20:03

21 Name: Anonymous 2011-08-18 20:06

22 Name: Anonymous 2011-08-18 21:06

23 Name: Anonymous 2011-08-19 2:15

24 Name: Anonymous 2011-08-19 2:25

25 Name: Anonymous 2011-08-19 5:17