/prog/ - How would I go about this?

Name: Anonymous 2011-08-18 12:46

I have a ton of grouped pdf files that I need to store in such a way that I can read and search them in the fastest way possible. Write/Annotate/Delete times are not important.

* The pdfs are already grouped by various themes and alphanumerically. This should facilitate searching.

* Average file size is about 2MB.

* There are about 2.5M files

How would I go about this?

Name: (　≖‿≖) 2011-08-18 15:16

>>6
Explanation:
110100 Is the boolean vector with the documents that have Anthony.
110111 Is the boolean vector with the documents that have Cleopatra.
101111 Is the boolean vector with the documents that have Hamlet.

100100 Is the boolean vector with the documents that have the three terms.

Most models elaborate on this, giving each pair (Document, Term) a value. For each query the value for each document is computed based on that pair's value. On the boolean model those values are absolute (1 = term present, 0 = term not present) but for more advanced models (i.e. non specialized search, like Google) you have to refine this.

How would I go about this?

1 Name: Anonymous 2011-08-18 12:46

8 Name: ( ≖‿≖) 2011-08-18 15:16

Name: Anonymous 2011-08-18 12:46

Name: (　≖‿≖) 2011-08-18 15:16