Fuzzy hashing could help eliminate that unless the customer horribly misspells shit. Instead of hashing the entire address, split it up into chunks of say 5 bytes, then hash those chunks. Now if they make a single misspelling, there will still be a % match to the previously stored hash that will be statistically higher than normal. This will take a bit more work to do. You also generally want a big input when working with fuzzy hashes, an address doesn't really give a whole lot of data to work with (versus say a 1megabyte image that you want to be able to detect even if it has had some graphics editing done to it), and if the sections are too small it will make brute forcing easier and make salting much more required and also since it is such a small input and addresses are sometimes close it could lead to some false positives. Right now I would like to find a list of all possible legitimate ways to write out the same address, so I can have my script standardize them prior to hashing. edit: since it was Pines idea and he/she/it seems keen to write the script, I will discontinue mine. What language are you going to write it in Pine?