Return Styles: Pseud0ch, Terminal, Valhalla, NES, Geocities, Blue Moon. Entire thread

How would I go about this?

Name: Anonymous 2011-08-18 12:46

I have a ton of grouped pdf files that I need to store in such a way that I can read and search them in the fastest way possible. Write/Annotate/Delete times are not important.

* The pdfs are already grouped by various themes and alphanumerically. This should facilitate searching.

* Average file size is about 2MB.

* There are about 2.5M files

How would I go about this?

Name: ( ≖‿≖) 2011-08-18 13:06

>>1
Use a PDF API for extracting text of PDF.
Build a database with that text.
Extract statistics (use vector model or similar) for terms and themes.

Everything is explained here, without the API thing.
http://nlp.stanford.edu/IR-book/information-retrieval-book.html

Newer Posts
Don't change these.
Name: Email:
Entire Thread Thread List