The spell checker in Microsoft Word 97 had a quirk that earned a lot of giggles. If you typed zzzzz, the suggested correction was sex. Some people saw the quirk as intentional humor or sabotage, and a psychoanalyst probably could have had a field day with it. But the likely culprit was the SOUNDEX algorithm, which SQL Server implements with its built-in SOUNDEX() function. The SOUNDEX algorithm converts a word to an encoded representation of its pronunciation. So, SOUNDEX('Smith') and SOUNDEX('Smythe') both return the encoded value S530. And SOUNDEX('zzzzz') returns Z200, which is a close match to the result SOUNDEX('sex') produces: S200.
My company's record-management products needed a similar phonetic word-matching functionality. We wanted to offer fuzzy searching so that spelling errors or variations could be included in search results, and we also wanted spell checking of new data that included intelligent suggestions for misspellings. The SQL Server SOUNDEX() function was a key part of our initial implementation of both these features. As a built-in function, SOUNDEX() was easy to use, but we were unhappy with some of the matches it suggested.
Unfortunately, SOUNDEX()—patented in 1918—assumes that every consonant has a single consistent pronunciation, which is too simple a formula for most languages. In American English, most consonants have varying sounds and can even be silent. For example, SOUNDEX('phone') and SOUNDEX('pony') both return the encoded value P500, but the words aren't pronounced the same.
In the C/C++ Users Journal article "The Double Metaphone Search Algorithm," June 2000 (http://www.cuj.com/articles/2000/0006/0006d/0006d.htm?topic=articles), Lawrence Philips published a revised version of his Metaphone algorithm, which he wrote in 1990 as a replacement for SOUNDEX. Like SOUNDEX, the revised version of Metaphone, named Double Metaphone, is designed primarily to encode American English names (though it also encodes most English words well) while taking into account the fact that such words can have more than one acceptable pronunciation. Double Metaphone can compute a primary and a secondary encoding for a given word or name to indicate both the most likely pronunciation as well as an optional alternative pronunciation (hence the "double" in the name).
I've implemented the Double Metaphone algorithm in a set of SQL Server—based client applications for use in spell checking. When an application encounters an unknown word, it uses Double Metaphone to find likely substitutions from a table of known correct words. A previous implementation used the SQL Server SOUNDEX() function, but as I noted earlier, it produced too many poor suggestions to be valuable in most cases. Double Metaphone provides much better quality of matches, but SQL Server doesn't have a built-in function that implements this algorithm. So, I turned to SQL Server 2000's user-defined functions (UDFs) as the core building blocks to translate the Double Metaphone algorithm from C++ to T-SQL.
Encoding Language
SOUNDEX and Double Metaphone translate each consonant into a limited set of characters. Table 1 shows some examples of SOUNDEX and Double Metaphone encoding. Similar sounds use the same character; for example, the Metaphone algorithm encodes both b and p sounds as p. So the encoding of a misspelled word will often match the encoding of the word that was intended. The secondary Metaphone encodings can represent less common pronunciation variations in names and can encode many character patterns that have varied pronunciations. For example, the letter t has varying pronunciations that Double Metaphone can usually determine by its adjacent letters. When ion follows t, the combination is usually pronounced shun, but when h follows t, the letters usually combine as in the word thanks. Notice in Table 1 that whereas the primary Metaphone encoding is accurate for benign and poignant, the secondary encoding is better than the primary in the case of benignant. Depending on the level of fuzziness that's acceptable or desired in your environment, you might want to use both types of Metaphone encodings or simply ignore the secondary encoding altogether.
Although my company's application stores word lists and precomputed Double Metaphone values in SQL Server tables, the client applications were performing all the calculations. Implementing these calculations in each client application was a maintenance problem for us. Any corrections or enhancements to the algorithm required simultaneous updates of all applications. Some client code is C++ and some is Visual Basic (VB), and because consistency of the shared data was the primary goal, it was vital that all clients shared the same logic. Centralizing these calculations in SQL Server UDFs eliminates this maintenance burden. Web Listing 1 contains the UDFs that Table 2 describes. (To download Web Listing 1, enter InstantDoc ID 26094 at http://www.tsqlsolutions.com and click Download the code.)