Most programmers will be familiar with the concept of hashing. Basically, a hash function takes an arbitrary long message and transforms it into a fixed size (generally smaller) one. This is used for a number of things. For example, to create efficient data structures like hash maps where the hash of any object determines in which bucket the object can be placed. It also finds its uses in digital cryptography where it is used for all kinds of patterns, e.g. digital signatures etc. Since in general the hashes are smaller than the original message (or at least the number of possible messages are smaller than the number of possible hashes), they tend to be a one-way street: a message uniquely maps to a hash, but a single hash might have multiple original messages. And if the hash function works in such a way that it very difficult to create two original messages that fit the same hash, it is suitable for cryptography.
A hashed car? (photo by marcovdz)
All of these use cases work best if even the most minor change in the original message gives a wildly different hash. This way, data structures using hashes will nicely balance out due to, more or less, random assignments to buckets. And it will make it very difficult to create a message fitting a hash, even if you have a message that is already quite close to your target hash.
But what happens if you need a hash that behaves exactly the opposite? This is what we recently needed for a project.
Imagine a large set of names in a database and you want to search them. Of course, names are notorious for being completely unstructured and have multiple ways of being spelled. So to find a name you’ll need some sort of fuzzy matching. A fuzzy match function gives you a distance between two words, in other words: how far apart or how many typing mistakes do you have to make to get from one word to another. Common examples are the Hamming distance, the Jaro-Winkler distance and the Levenshtein distance (no, that is not a typo), or one of their derivatives. These functions take your search word and a potential match in your dataset and give you the distance between them. For example, ‘book’ and ‘back’ have a Levenshtein distance of 2: change the first ‘o’ to an ‘a’, and the second ‘o’ to a ‘c’. It gets complicated quite quickly if you, like Levenshtein, also take into account insertions and additions.
So these are expensive functions to run but what makes it worse is that there is no way that you can put an index on your dataset. Since one parameter of the function is dynamic (one could search for any random string), there’s no way to precalculate the distance. That means that every record in the table has to be examined. And that makes for quite a slow search.
A trick you can use are so called Locality-Sensitive Hashes (LSM). Basically what these hashes do is make sure that if two original messages are similar, the hashes are similar as well. So you could hash down all the names in the database using a LSM, create an index on that and use that to only do fuzzy matching on the names that have the same hash. As a simple example, one could define a simple hash function that strips down a name into just its vowels by removing all the consonants. (My last name (Lamers) would thus be ‘ae’). If you do the same to your search string (‘Lammers’ would also hash to ‘ae’), you can quickly get a subset of names that are likely candidates for your search.
Of course, this is not perfect; not all typing mistakes in the search query will get you the same hash (e.g. searching for ‘Laamers’ would not deliver the same hash). Of course, you could create a second hash with just the consonants and use those two indexes together. As always, it boils down to engineering decisions: do you need speed or accuracy? How much time can you spend on tweaking? Creating these kind of hashes is very specific for your use case. The closer it works in cooperation with your fuzzy match function the better the result. Bit sampling for example as a hash function works nicely with Hamming distances. What is similar in names of natural persons might not be as similar if we are talking about e.g. colors or images. And since it is by no means perfect, what percentage of false positives or negatives can you afford?
A statistically more sound hashing function (one that produces far fewer false positives or negatives) might be difficult to explain to the end customer or QA team. If by accident they pick a search query that doesn’t fall into the same hash as the target, you’ll have a hard time explaining why. Sometimes it is best to stick to something that at least is simple to explain and less magical. After all, computer science is all about working with humans.