Ask Glenda

Volume I, Number 1

Soundex History and Methods

How do genealogists deal with the problems of surname spellings? Confronted with 30 pages of passenger lists, how does one find the names of interest? When indexing entries, what can one do to arrange the entries in such a way that one can find all the names of interest?

A word that starts with W in English or German is likely to be GU in Spanish (Waltman - Gualteman) and one that starts with S becomes ES (Scott - Escott). Let a French scribe write the Spanish modified name and you will get Gualtheman. Would you recognize David Gualtheman of Natchitoches as the same person as David Waltman of Natchez and Pointe Coupee? Have you ever experienced the frustration of spending hours at the microfilm reader looking for Van Hoesen only to realize once you have returned home that it could also have been spelled Van Huizen?

This indexing problem has two aspects :

names that do not sound the same but are the same, disguised as spelling variants Rhinesmith, Reinschmidt, and
names that sound alike Van Hoesen, Van Housen, Van Huizen but are spelled differently.

In the first instance, spelling variants that may not sound the same, you must "know" the rules for what constitutes a spelling variant or have access to a reference that contains spelling variants. An example of this technique is the one used for the International Genealogical Index. It groups names together and indicates spelling variants. It lists sound-alike names together followed by spelling variants. The theory behind this technique has not been made public but the "LDS Genealogist's Handbook" refers to automated catalogs of spellings for given and surnames for each country and standardized spellings for the indices.

Bonner, William
Boughner, William Asbury
Boner, Wilson
Bonnes, ** see Bunis
Bonnet, ** see Bonnett
Bonnett, ** see Bennett
Herrich, ** see Herrick
Reinschmidt, ** see Reinsmith
Reinthal, ** see Renthall
Reynard, ** see Rinehart
Renolas, ** see Reynolds
etc......................

In the latter instance, names that sound the same, you could convert each name to a string of characters representing how it sounds. If your conversion technique is valid, all names that sound the same are converted to the same string. Arranging all the entries by the code would result in grouping all like sounding names together. And the researcher would find all spelling variations of the surname.

What kind of conversion would you use?

INTERNATIONAL PHONETIC ALPHABET (IPA)

If you want to capture strictly pronunciation variations, you would look into the International Phonetic Alphabet (IPA) developed by anthropologists and linguists for transcribing actual pronunciation. Some language textbooks use this alphabet to indicate proper pronunciation.

The International Phonetic Alphabet (IPA) is sponsored by the International Phonetic Association. The society, founded in 1886, works for the advancement of the study of phonetics. It consists of letters to symbolize the position of the articulating organ. The same sound has the same symbol irrespective of the language, or in the development of a language, in which the sound occurs. It frequently used in field studies for transcribing pre-literate languages. It is nonspecific to any language. Although it uses the Roman alphabet it has additional characters for other sounds such as the Welsh lll/fl and !click of African languages.

In genealogy we are more likely to consult written than spoken sources so what we call a "soundex" technique is a better bet than a truly phonetic alphabet such as the IPA. For example if you were researching ALLAIN in Louisiana you might assume it would be ALAN as in "clan" and be disappointed to learn that all the relatives pronounce it Al-Layne. A strictly phonetic alphabet would render these as two separate words. Soundex coding treats them as if they were the same. Theoretically, using a soundex system you should be able to index a name so that you can find it no matter how one spells it.

Soundex -- but that means coding how something sounds doesn't it? The name is deceiving and often ill defined as a representation based on the way a name sounds rather than the way one spells it. But soundex techniques (regardless of their name) are more a method of capturing spelling variations rather than pronunciation variations.

For example, using the Russell coding technique, B100 gives one not only BABY and its known spelling variants BABE, BABI, and BABIE but also such pronunciation variations as BOBO and BEEBE. For some researchers those veriations may be boo boos. The reason for this is that soundex coding techniques ignore vowels. The consonant families are represented, but it is the vowels that really cause differences in pronunciation and soundex ignores them.

RUSSELL SOUNDEX CODING

In 1930 the WPA did a complete Soundex of the 1880, 1900, and 1920 censuses. The census information was copied onto file cards, alphabetically coded, and filed by state.

The coding rules were:

1. Take the first letter as is.

2. Code the following letters to three digits using 0 at the end if needed.

3. Ignore A, E, I, O, U, Y, W, and H.

4. Code double letters as one letter.

5. Caution: prefixes (van, Von, Di, de, le, du, d', dela, etc) are sometimes disregarded.

      Letter           Value    
      B P F V            1
      C S K G J Q X Z    2
      D T                3
      L                  4
      M N                5
      R                  6
Bonner-B560  Smith-S530   Rea-R000    Van Hoesen-V525
Boner-B560   Smythe-S530  Rhea-R000   Van Huizen-V525
Bohner-B560               Ray-R000    Van Housen-V525

1880 U.S. FEDERAL CENSUS

A soundex index to the 1880 Census was made that lists families with at least one child ten years of age or younger in the household in June 1880. In case you have forgotten the soundex rules, you will find the instructions at the beginning of the 1880 soundex film numbers listed in the book United States Census 1790-1880. Using this book, find your state or territory and use your soundex code to determine the film number.

    Wisconsin: H-416 thru H-525 (N)      449069                   
    Wisconsin: H-525 (O) thru H-634      449070                   
    Wisconsin: H-635 thru J-520 (L)      449071                   
    Wisconsin: J-520 (M) thru K-146      449072

On the soundex film you will see index cards. These are the Soundex cards, not the census itself. Look at the upper left-hand corner of each card to find the code and family name. Several names may have the same code number. Last names are not in alphabetical order because of the coding system, but first names are filed alphabetically within the code. Remember, only households with a child under the age of 10 were included in the Soundex. Even then, some could have been missed. If you don't find a particular name, you can still obtain the census film and search it. Also look for variations of the given name, such as initials and nicknames.

1900 U.S. FEDERAL CENSUS

A Soundex index to the 1900 Census exists for every state. The soundex rules are at the beginning of the 1900 soundex film numbers in the book, United States Census 1900. Use the same procedures as for the 1880 census to find your entries.

   Illinois: S-316 John M. **  1243346   
           thru S-322 Orpha  

   Illinois: S-322 Park S.     1243347   
           thru S-330 Wm.  **

1910 U.S. FEDERAL CENSUS

A Soundex index to the 1910 Census exists for Alabama, Arkansas, California, Florida, Georgia, Illinois, Kansas, Kentucky, Louisiana, Michigan, Mississippi, Missouri, North Carolina, Ohio, Oklahoma, Pennsylvania, South Carolina, Tennessee, Texas, Virginia, and West Virginia. The soundex rules are at the beginning of the 1910 soundex film numbers in the book, United States Census 1910. Use the same procedures as for the 1880 census to find your entries.

    Kentucky:  R-400 Arch         1370590  
               thru R-512 Annie M           

    Kentucky:  R-512 Benj. thru   1370591  
               R-550 William T.           

    Kentucky:  R-550 Willie E.    1370592  
               thru R-665 Asa C.

History of the Russell Coding Technique

Who actually developed this clever and useful code and when?

Anyone who has worked enough with certain years of the U.S. Census microfilms has seen the Remington Rand copyright notices with a lightening bolt thrusting up through the logo: "Soundex, Quick as a Flash!". A little bit of corporate genealogy here. The company we now know as Unisys is the descendant of Burroughs. Burroughs acquired Sperry Rand. Sperry Rand was a creation of Remington Rand which was the child of Remington (a typewriter manufacturer).

Because Remington did the coding of the censuses, many people credit it with its invention. And Remington's literature gives the wrong impression that Remington's Library Bureau Division invented Soundex in house in 1912. Remington was very successful in promoting this indexing technique. A 1948 brochure from Remington Rand Systems Division "Office Manual of Filing Systems" gives some insight into why that group was hired to soundex index the U.S. Census for 1880, 1900, 1920, and some states for 1910.

THE AUTOMATIC INDEX

In 1912 it became possible for the first time to check papers BACK INTO a file with accuracy and speed through a combination of alphabetic and numerical designations.

WHERE TO USE SOUNDEX

Soundex is most efficient under these conditions:

1. In files of 50,000 or more names, especially if positive locating of information is vital.

2. Where reference is frequent and speed important.

3. Where names of individuals predominate.

DETECTS AND "COLLECTS" NAME VARIATIONS

Summarizing its advantages, SOUNDEX ...

1. Provides a positive and unchanging number for every name.

2. Automatically groups 98% of all family names regardless of spelling.

3. Detects duplications, and prevents future duplications.

4. Offers unlimited expansion.

5. Puts responsibility for results on the system.

6. Uses 6 numbers instead of 26 letters.

7. Permits numeric sorting, filing, and finding -- the fastest of all methods

.8. Provides a rapid and unfailing way of checking for accuracy.

9. Counteracts most transcribing errors.

10. Permits all minds, on all occasions, to file and find alike.

11. Reduces clerical and supervisory expense, executive delays, and losses from erroneous information.

Remington Rand will assume wither supervisory of complete responsibility for putting name indexes on an efficient Soundex basis. Our Contract Service Department has performed this work for many of the most important files in America....

** courtesy of the Unisys Corporation archives in Detroit Michigan.

Remington actually acquired the license for the Soundex system when it bought up a company named Library Bureau prior to 1940. The Library Bureau was licensing pre-existing patents, which were issued at least as early as 1907 to Robert C. Russell of Pittsburgh, PA for his Russell Definite Index -- which seems to have been first marketed by the Boston Index Card Company (later acquired by the Library Bureau?) sometime prior to 1918.

The U.S. Patent Office lists a Soundex developed and patented by Margaret K. Odell and Robert C. Russell, U.S. Patents 1261167 (1918) and 1435663 (1922). In the 1920s and 1930s there were many national and international attempts at spelling reform and international languages like Esperanto. Perhaps the invention of the soundex was part of this overall movement.

Miracode/Nu-Soundex

There are variations on the basic soundex. About 1940 a modernized variation of the Soundex index cards called Miracode index cards was used. These cards were being typed on automated data processing machines. Another variation was Nu-Soundex which added a second field for encoded date information and eliminated the need for cross-indexing in many applications.

Soundzee

George Hlavka, a genealogist in Santa Monica, uses SOUNDZEE in his research. It is a simple change to the existing Russell technique. George codes the first letter as well. So Kucera using Russell would be K260. Using Soundzee it is 2260 as is Cucera and Quecera. He has not promoted the use of his technique beyond his own personal records.

Soundex Problems

Soundex techniques are too simplistic and give false positives.
- B560 is Bonner, Boner, and Bohner. But it also is Baymore.
- S530 is Smith, Smythe. It also is Saneed.
- V525 is Van Hoesen, Van Huizen, and Van Housen. It also is Vincenzo.
Discarding all vowels is misleading.
- BRxxD is not the same as BxRxD. How about a vowel marker? But sometimes an extra syllable has been added MANDELOVSKI has become MANDELOVESKI and a vowel marker would be deceiving. It works the other way as well when HOROWITZ has become HORWITZ.
The more information you drop (so as to avoid false negatives) the greater the number of names that will be put into each group (the greater of false positives). Retaining information to avoid false positives risks a greater number of false negatives.
- One must be willing to settle for less than perfection and recognize that it is impossible to please all the people all of the time. To achieve the maximum number of positive hits with the least number of negative hits.

The search for a better soundex has continued. The advent of the personal computer has opened up the field to many more people. The power of the personal computer now exceeds the power of the large computers of the 1940s. That means the algorithms can be much more complex.

DAITCH-MOKOTOFF

Gary Mokotoff, the president of the Association of Jewish Genealogical Societies and publisher of AVOTAYNU, computerized in 1984 the names of persons who legally changed their names while living in Palestine under the British Mandate. He used the standard Russell Coding Soundex. However, he found that the system did not work well with Slavic and German spellings of Yiddish surnames.

So he revised the coding to accommodate those surnames and published the new rules in AVOTAYNU in "Proposal for a Jewish Soundex Code". Randy Daitch, another member of the Jewish Genealogical Society expanded on these rules and in 1985 published the AVOTAYNU article "The Jewish Soundex - A Revised Format".

The basic enhancements were:

expanded the soundex code from 4 to six digits,
coded the first letter
assigned a single code to some double letter combinations,
coded more than once, letters/combinations pronounced different ways,
changed Romance language pronunciations to Slavic/German tongue, e.g. W to V not silent, and
addressed the problem of the letter C. The hard and soft sound are separated.

DAITCH-MOKOTOFF CODING

Names are coded to six digits, each digit representing a sound listed in the coding chart.
When a name lacks enough coded sounds for six digits, use zeros to fill to six digits. GOLDEN which has only four coded sounds is coded as 583600.
The letters A, E, I, O, U, J, and Y are always coded at the beginning of a name as in Alpert 087930. In any other situation, they are ignored except when two of them form a pair and the pair comes before a vowel, as in Breuer 791900 but not Freud.
The letter H is coded at the beginning of a name as in Haber 579000 or preceding a vowel as in Manheim 665600, otherwise it is not coded.
When adjacent sounds can combine to form a larger sound, they are given the code number of the larger sound. Mintz which is not coded MIN-T-Z but MIN-TZ 664000.
When adjacent letters have the same code number, they are coded as one sound, as in TOPF, which is not coded TO-P-F 377000 but TO-PF 370000. Exceptions to this rule are the letter combinations MN and NM whose letters are coded separately, as in Kleinman, which is coded 586660 not 586600.
When a surname consists or more than one word, it is coded as if one word, such as Ben Aron which is treated as Benaron.
Several letter and letter combinations pose the problem that they may sound in one of two ways. The letter and letter combinations CH, CK, C, J, and RS are assigned two possible code numbers.

Letter    Alternate    Initial    Before a    Any other
          Spelling     Letter     vowel       situation    
    AI       AJ, AY         0          1         Not coded (NC)  
    AU                      0          7            NC
    A                       0          NC           NC
    B                       7          7             7
    CHS                     5         54            54
    CH      Try KH (5) and TCH (4)
    CK      Try K (5) and TSK (45)
    CZ      CS, CSZ, CZS    4          4             4
    C       Try K (5) and TZ (4)
    DRZ     DRS             4          4             4
    DS      DSH, DSZ        4          4             4
    DZ      DZH, DZS        4          4             4
    D       DT              3          3             3
    EI      EJ, EY          0          1            NC
    EU                      1          1            NC
    E                       0         NC            NC
    FB                      7          7             7
    F                       7          7             7
    G                       5          5             5
    H                       5          5            NC
    IA      IE, IO, IU      1         NC            NC
    I                       0         NC            NC
    J       Try Y (1) and DZH (4)
    KS                      5         54            54
    KH                      5          5             5
    K                       5          5             5
    L                       8          8             8
    MN                                66            66
    M                       6          6             6
    NM                                66            66
    N                       6          6             6
    OI      OJ, OY          0          1            NC
    O                       0          0            NC
    P       PF, PH          7          7             7
    Q                       5          5             5
    RZ, RS  Try RTZ (94) and ZH (4)
    R                       9          9             9    
    SCHTSCH SCHTSH, SCHTCH  2          4             4        
    SCH                     4          4             4        
    SHTCH   SHCH, SHTSH     2          4             4        
    SHT     SCHT, SCHD      2         43            43        
    SH                      4          4             4
    STCH    STSCH, SC       2          4             4
    STRZ    STRS, STSH      2          4             4
    ST                      2         43            43
    SZCZ    SZCS            2          4             4
    SZT     SHD, SZD, SD    2         43            43
    SZ                      4          4             4
    S                       4          4             4
    TCH     TTCH, TTSCH     4          4             4
    TH                      3          3             3
    TRZ     TRS             4          4             4
    TSCH    TSH             4          4             4
    TS      TTS, TTSZ, TC   4          4             4
    TZ      TTZ, TZS, TSZ   4          4             4
    T                       3          3             3
    UI      UJ, UY          0          1            NC
    U       UE              0         NC            NC
    V                       7          7             7
    W                       7          7             7
    X                       5         54            54
    Y                       1         NC            NC
    ZDZ     ZDZH, ZHDZH     2          4             4
    ZD      ZHD             2         43            43
    ZH      ZS, ZSCH, ZSH   4          4             4
    Z                       4          4             4

Auerbach   A-UE-R-B-A-CH      097500   Lipshitz  L-I-P-SH-I-TZ  874400
Ohrbach    O-H-R-B-A-CH       097500   Lippszyc  L-I-P-P-SZ-Y-C 874400

Szlamavitz SZ-L-A-M-A-V-I-TZ  486740                                       
Shlamowicz SH-L-A-M-O-W-I-CZ  486740

METAPHONE

This technique was developed by Lawrence Philips, an artificial intelligence specialist at NAC Reinsurance. His problem was to design an program that will retrieve items that sound like the ones that were entered. Metaphone ignores vowels after the first letter and simplifies thereafter by equating "D" and "T" and so on. It attempts to apply commonplace rules of English pronunciation (e.g. "c" before "i" or "e" is pronounced like "s"). It also reduces the alphabet to 16 consonant sounds, although vowels are kept when they are the first letter. Zero (Ø) is used to represent the "th" sound (it looks a lot like the Greek theta when it has that line through it). "X" is used for the "sh" sound (the Chinese now use it that way when spelling Chinese words for westerners, as in Deng Xiaopeng.

METAPHONE CODING

The 16 consonant sounds: B X S K J T F H L M N P R W Y

      Letter  Code    Comments                        

        B       B     unless at the end of a word
                      after "m" as in "dumb"          
        C       X     (sh) if "-cia-" or "-ch-"       
                S     if "-ci-", "-ce-", or "-cy-"
                      silent if "-sci-", "-sce-", or  
                      "-scy-"                         
                K     otherwise, including in        
                      "-sch-"                         
        D       J     if in "-dge-", "-dgy-",         
                      "-dgi".                         
                T     otherwise                       
        F       F                                       
        G             silent if in "-gh-" and not at  
                      end or before a vowel           
                      in "-gn" or "-gned"             
                      in "-dge-" etc., as in above    
                      rule                            
                J     if before "i", or "e", or "y"   
                      if not double "gg"              
                K     otherwise                       
        H             silent if after vowel and no    
                      vowel follows                   
                H     otherwise                       
        J       J                                       
        K             silent if after "c"             
                K     otherwise                       
        L       L                                       
        M       M                                       
        N       N                                       
        P       F     if before "h"                   
                      otherwise                       
        Q       K                                       
        R       R                                       
        S       X     (sh) if before "h" or in "-sio-" or "-sia-"              
                S      otherwise                      
        T       X      (sh) if "-tia-" or "-tio-"     
                Ø       (th) if before "h"             
                       silent if in "-tch-"            
                       otherwise                       
        V       F                                       
        W              silent if not followed by a     
                       vowel                           
                W      if followed by a vowel          
        X       KS                                      
        Y              silent if not followed by a     
                       vowel                           
                Y      if followed by a vowel          
        Z       S

Exceptions:

Initial "kn-", "gn-", "pm, "ae-", "wr-" drop the first letter

Initial "x" change it to "s"

Initial "wh-" change to "w"

For practical purposes, usually only the first four (if there are that many) letters of the phonetic spelling are used, giving, for example SKL for school and XBRT for Shubert. Metaphone can be modified to account for the "SH" sound in casual and so on, according to your intended domain, but English spelling is so weird that after a certain point you run into the contradictory cases ("-sua" in "casual" is "SH" but in "persuade" the "s" sounds like an "S"). Other inconsistencies include common words like "chemical," "technical," and "mechanic," where "ch" is pronounced like "k" since the words are derived from Greek.

    Bonner - BNR    Smith - SMØ    Van Hoesen - VNHSN  
    Boner - BNR     Smythe - SMØ   Van Huizen - VNHSN  
    Bohner - BNR    Saneed - SNT  Van Housen - VNHSN  
    Baymore - BMR                 Vincenzo - VNSNS

Guth Coding

Brian Randell at the University of Newscastle at Newcastle upon Tyne has the last word. All of the discussions above concentrate on spelling variations that would arise from "phonological" reasons such as mis-hearing, changes in spoken accents, etc. However another obvious cause of spelling errors in transliterations from difficult to decipher handwriting. For example, in my own family for many years my mothers (whose parents had died when she was young) had believed that her mother's maiden name was Buse, rather than Beese, because of the writing on the marriage certificate.

In looking into the literature on computer-based work by historical demographers on "family reconsitution" I came across one algorithm for name matchingwhich is mainly aimed at this sort of problem, named that by Guth. (G.J.A. Guth, "Surname Spellings and Computerized Record Linkage," Historial Methods Newsletter, Vol. 10, pp.1019, 1976.). This method provides a surname matching algorithm that takes account of letter ordering, rather than being phonetic in character. It was used in a project to link various sets of early 18th century Norwich records: pollbooks, land tax assessments, window-tax assessments and registers of freemen.

(It was also used) in connection with a large-scale parish register and census data linking project in Quebec, for matching data concerning couples (rather than individuals). Five different types of name variation were described: spelling variations, phonetic variations, double-names, double first names, and alternate first names. The article "Name Variations and Computerized Record Linkage" by Bouchard and Pouyez in Historical Methods, Vol. 13, pp119-125, 1980, details the techniques used for dealing with two of these. Spelling variations were handled by a specially created scheme of phonetic encoding for the French language, containing 64 rules. The article points out that soundex is not really a phonetic encoding scheme, but rather just a crude sorting device.

The phonetic variations were handled by a technique based mainly on Guth's scheme for assessing the extent to which two words contain the same letters in the same order. Another aspect of the scheme is that it builds up a dictionary of equivalent names, using separate criteria for isolated names, and names in the context of all the name data available for a couple. On one test, involving 2,000 records the scheme succeeded in finding 98.5% of the possible links, whereas only two-thirds of the possible links could be found if one tested simply for name identicality.

De Brou and Olsen discuss experiments using the Guth algorithm against language-specific surname matching algorithms as Soundex. "Although relatively successful for the particular projects for which they were designed, each language-specific system suffers from a major drawback. Its effective application is limited to the language for which it was originally created... (Guth's) algorithm does not depend on recognition of phonetic similarity and she is correct to claim that her algorithm has an important advantage over other systems..... it is well-suited to the linking of a multi-enthnic population."

The bottom line is clearly that blind faith in any one system such as Soundex is unwise, particularly when used in circumstances which do not match those for which the system was designed.

CREDITS

Gary Mokotoff, "Introducing the Daitch-Mokotoff Soundex System", Ancestry Newsletter, VII:3, p5-7.

Lavona L. Ness, "Census Schedules - U.S. Federal" Syllabus for the Priesthood Genealogy Seminar, Saratoga, CA 1979.

Lawrence Philips, "Hanging on the Metaphone", Computer Language, December 1990.

Publications of the Corporation of the President of the Church of Jesus Christ of Latter-day Saints. Census information handouts from the Genealogical Library, Salt Lake City, Utah.

Genealogy(SF) electronic contributors to this paper:

Mary Lou Barrett, Seattle, Washington
Barbara Bennett, Columbia, MD
Sue Budlong, Columbia, MD
George Hlavka, Santa Monica, CA
Steve Hobbs, The Time Warp,
Robert McLaren, Columbia, MD
Dick Miller, Natick, MA
Richard Pence, Arlington, VA
Roland Roy, Florida
Eric Simon, Arlington, VA
Jerry Smith, Sacramento, CA
Vicki Titus

Back to FIRST PAGE

6,266 accesses.