I need a data structure and data storage solution that is optimal for the following scenario:
Storing: >20M sets of attributes. Each set is of variable length. None are empty.
Matching: An inputted set with all the other stored sets. Comparing and contrasting attributes. Returning the most favorable matched sets.
Keep the attributes in a bloom filter per set and then just calculate Hamming distances between the target bit vector and that of each of the sets. Works acceptably if you have a lot of attributes per set but not that many sets.
If you do have a lot of sets and they're distributed fairly normally, it may be possible to enumerate all bit vectors within Hamming distance x (where x is maybe 1 or 2) of the target vector and locate them more efficiently, possibly with a fallback to the other algorithm if you have an empty result set.
The real answer is to use an ENTERPRISE TURKEY SOLUTION and make it someone else's problem. That's only feasible if someone else is paying and you have no interest in knowing how shit works, though.