Volume I, Number 1
International Phonetic Alphabet | Russell Soundex Coding | 1880 U.S. Census | 1900 U.S. Census | 1910 U.S. Census | History of the Russell Coding Technique | Miracode/Nu-Soundex | Soundzee | Soundex Problems | Daitch-Mokotoff | Daitch-Mokotoff Coding Rules | Metaphone | Guth | Credits
How do genealogists deal with the problems of surname spellings? Confronted with 30 pages of passenger lists, how does one find the names of interest? When indexing entries, what can one do to arrange the entries in such a way that one can find all the names of interest?
A word that starts with W in English or German is likely to be GU in Spanish (Waltman - Gualteman) and one that starts with S becomes ES (Scott - Escott). Let a French scribe write the Spanish modified name and you will get Gualtheman. Would you recognize David Gualtheman of Natchitoches as the same person as David Waltman of Natchez and Pointe Coupee? Have you ever experienced the frustration of spending hours at the microfilm reader looking for Van Hoesen only to realize once you have returned home that it could also have been spelled Van Huizen?
This indexing problem has two aspects :
In the first instance, spelling variants that may not sound the same, you must "know" the rules for what constitutes a spelling variant or have access to a reference that contains spelling variants. An example of this technique is the one used for the International Genealogical Index. It groups names together and indicates spelling variants. It lists sound-alike names together followed by spelling variants. The theory behind this technique has not been made public but the "LDS Genealogist's Handbook" refers to automated catalogs of spellings for given and surnames for each country and standardized spellings for the indices.
Bonner, William
Boughner, William Asbury
Boner, Wilson
Bonnes, ** see Bunis
Bonnet, ** see Bonnett
Bonnett, ** see Bennett
Herrich, ** see Herrick
Reinschmidt, ** see Reinsmith
Reinthal, ** see Renthall
Reynard, ** see Rinehart
Renolas, ** see Reynolds
etc......................
In the latter instance, names that sound the same, you could convert each name to a string of characters representing how it sounds. If your conversion technique is valid, all names that sound the same are converted to the same string. Arranging all the entries by the code would result in grouping all like sounding names together. And the researcher would find all spelling variations of the surname.
What kind of conversion would you use?
If you want to capture strictly pronunciation variations, you would look into the International Phonetic Alphabet (IPA) developed by anthropologists and linguists for transcribing actual pronunciation. Some language textbooks use this alphabet to indicate proper pronunciation.
The International Phonetic Alphabet (IPA) is sponsored by the International Phonetic Association. The society, founded in 1886, works for the advancement of the study of phonetics. It consists of letters to symbolize the position of the articulating organ. The same sound has the same symbol irrespective of the language, or in the development of a language, in which the sound occurs. It frequently used in field studies for transcribing pre-literate languages. It is nonspecific to any language. Although it uses the Roman alphabet it has additional characters for other sounds such as the Welsh lll/fl and !click of African languages.
In genealogy we are more likely to consult written than spoken sources so what we call a "soundex" technique is a better bet than a truly phonetic alphabet such as the IPA. For example if you were researching ALLAIN in Louisiana you might assume it would be ALAN as in "clan" and be disappointed to learn that all the relatives pronounce it Al-Layne. A strictly phonetic alphabet would render these as two separate words. Soundex coding treats them as if they were the same. Theoretically, using a soundex system you should be able to index a name so that you can find it no matter how one spells it.
Soundex -- but that means coding how something sounds doesn't it? The name is deceiving and often ill defined as a representation based on the way a name sounds rather than the way one spells it. But soundex techniques (regardless of their name) are more a method of capturing spelling variations rather than pronunciation variations.
For example, using the Russell coding technique, B100 gives one not only BABY and its known spelling variants BABE, BABI, and BABIE but also such pronunciation variations as BOBO and BEEBE. For some researchers those veriations may be boo boos. The reason for this is that soundex coding techniques ignore vowels. The consonant families are represented, but it is the vowels that really cause differences in pronunciation and soundex ignores them.
In 1930 the WPA did a complete Soundex of the 1880, 1900, and 1920 censuses. The census information was copied onto file cards, alphabetically coded, and filed by state.
The coding rules were:
1. Take the first letter as is.
2. Code the following letters to three digits using 0 at the end if needed.
3. Ignore A, E, I, O, U, Y, W, and H.
4. Code double letters as one letter.
5. Caution: prefixes (van, Von, Di, de, le, du, d', dela, etc) are sometimes disregarded.
Letter Value B P F V 1 C S K G J Q X Z 2 D T 3 L 4 M N 5 R 6 Bonner-B560 Smith-S530 Rea-R000 Van Hoesen-V525 Boner-B560 Smythe-S530 Rhea-R000 Van Huizen-V525 Bohner-B560 Ray-R000 Van Housen-V525
A soundex index to the 1880 Census was made that lists families with at least one child ten years of age or younger in the household in June 1880. In case you have forgotten the soundex rules, you will find the instructions at the beginning of the 1880 soundex film numbers listed in the book United States Census 1790-1880. Using this book, find your state or territory and use your soundex code to determine the film number.
Wisconsin: H-416 thru H-525 (N) 449069 Wisconsin: H-525 (O) thru H-634 449070 Wisconsin: H-635 thru J-520 (L) 449071 Wisconsin: J-520 (M) thru K-146 449072
On the soundex film you will see index cards. These are the Soundex cards, not the census itself. Look at the upper left-hand corner of each card to find the code and family name. Several names may have the same code number. Last names are not in alphabetical order because of the coding system, but first names are filed alphabetically within the code. Remember, only households with a child under the age of 10 were included in the Soundex. Even then, some could have been missed. If you don't find a particular name, you can still obtain the census film and search it. Also look for variations of the given name, such as initials and nicknames.
A Soundex index to the 1900 Census exists for every state. The soundex rules are at the beginning of the 1900 soundex film numbers in the book, United States Census 1900. Use the same procedures as for the 1880 census to find your entries.
Illinois: S-316 John M. ** 1243346 thru S-322 Orpha Illinois: S-322 Park S. 1243347 thru S-330 Wm. **
A Soundex index to the 1910 Census exists for Alabama, Arkansas, California, Florida, Georgia, Illinois, Kansas, Kentucky, Louisiana, Michigan, Mississippi, Missouri, North Carolina, Ohio, Oklahoma, Pennsylvania, South Carolina, Tennessee, Texas, Virginia, and West Virginia. The soundex rules are at the beginning of the 1910 soundex film numbers in the book, United States Census 1910. Use the same procedures as for the 1880 census to find your entries.
Kentucky: R-400 Arch 1370590 thru R-512 Annie M Kentucky: R-512 Benj. thru 1370591 R-550 William T. Kentucky: R-550 Willie E. 1370592 thru R-665 Asa C.
Who actually developed this clever and useful code and when?
Anyone who has worked enough with certain years of the U.S. Census microfilms has seen the Remington Rand copyright notices with a lightening bolt thrusting up through the logo: "Soundex, Quick as a Flash!". A little bit of corporate genealogy here. The company we now know as Unisys is the descendant of Burroughs. Burroughs acquired Sperry Rand. Sperry Rand was a creation of Remington Rand which was the child of Remington (a typewriter manufacturer).
Because Remington did the coding of the censuses, many people credit it with its invention. And Remington's literature gives the wrong impression that Remington's Library Bureau Division invented Soundex in house in 1912. Remington was very successful in promoting this indexing technique. A 1948 brochure from Remington Rand Systems Division "Office Manual of Filing Systems" gives some insight into why that group was hired to soundex index the U.S. Census for 1880, 1900, 1920, and some states for 1910.
THE AUTOMATIC INDEX
In 1912 it became possible for the first time to check papers BACK INTO a file with accuracy and speed through a combination of alphabetic and numerical designations.
WHERE TO USE SOUNDEX
Soundex is most efficient under these conditions:
1. In files of 50,000 or more names, especially if positive locating of information is vital.
2. Where reference is frequent and speed important.
3. Where names of individuals predominate.
DETECTS AND "COLLECTS" NAME VARIATIONS
Summarizing its advantages, SOUNDEX ...
1. Provides a positive and unchanging number for every name.
2. Automatically groups 98% of all family names regardless of spelling.
3. Detects duplications, and prevents future duplications.
4. Offers unlimited expansion.
5. Puts responsibility for results on the system.
6. Uses 6 numbers instead of 26 letters.
7. Permits numeric sorting, filing, and finding -- the fastest of all methods
.8. Provides a rapid and unfailing way of checking for accuracy.
9. Counteracts most transcribing errors.
10. Permits all minds, on all occasions, to file and find alike.
11. Reduces clerical and supervisory expense, executive delays, and losses from erroneous information.
Remington Rand will assume wither supervisory of complete responsibility for putting name indexes on an efficient Soundex basis. Our Contract Service Department has performed this work for many of the most important files in America....
** courtesy of the Unisys Corporation archives in Detroit Michigan.
Remington actually acquired the license for the Soundex system when it bought up a company named Library Bureau prior to 1940. The Library Bureau was licensing pre-existing patents, which were issued at least as early as 1907 to Robert C. Russell of Pittsburgh, PA for his Russell Definite Index -- which seems to have been first marketed by the Boston Index Card Company (later acquired by the Library Bureau?) sometime prior to 1918.
The U.S. Patent Office lists a Soundex developed and patented by Margaret K. Odell and Robert C. Russell, U.S. Patents 1261167 (1918) and 1435663 (1922). In the 1920s and 1930s there were many national and international attempts at spelling reform and international languages like Esperanto. Perhaps the invention of the soundex was part of this overall movement.
There are variations on the basic soundex. About 1940 a modernized variation of the Soundex index cards called Miracode index cards was used. These cards were being typed on automated data processing machines. Another variation was Nu-Soundex which added a second field for encoded date information and eliminated the need for cross-indexing in many applications.
George Hlavka, a genealogist in Santa Monica, uses SOUNDZEE in his research. It is a simple change to the existing Russell technique. George codes the first letter as well. So Kucera using Russell would be K260. Using Soundzee it is 2260 as is Cucera and Quecera. He has not promoted the use of his technique beyond his own personal records.
The search for a better soundex has continued. The advent of the personal computer has opened up the field to many more people. The power of the personal computer now exceeds the power of the large computers of the 1940s. That means the algorithms can be much more complex.
Gary Mokotoff, the president of the Association of Jewish Genealogical Societies and publisher of AVOTAYNU, computerized in 1984 the names of persons who legally changed their names while living in Palestine under the British Mandate. He used the standard Russell Coding Soundex. However, he found that the system did not work well with Slavic and German spellings of Yiddish surnames.
So he revised the coding to accommodate those surnames and published the new rules in AVOTAYNU in "Proposal for a Jewish Soundex Code". Randy Daitch, another member of the Jewish Genealogical Society expanded on these rules and in 1985 published the AVOTAYNU article "The Jewish Soundex - A Revised Format".
The basic enhancements were:
Letter Alternate Initial Before a Any other Spelling Letter vowel situation AI AJ, AY 0 1 Not coded (NC) AU 0 7 NC A 0 NC NC B 7 7 7 CHS 5 54 54 CH Try KH (5) and TCH (4) CK Try K (5) and TSK (45) CZ CS, CSZ, CZS 4 4 4 C Try K (5) and TZ (4) DRZ DRS 4 4 4 DS DSH, DSZ 4 4 4 DZ DZH, DZS 4 4 4 D DT 3 3 3 EI EJ, EY 0 1 NC EU 1 1 NC E 0 NC NC FB 7 7 7 F 7 7 7 G 5 5 5 H 5 5 NC IA IE, IO, IU 1 NC NC I 0 NC NC J Try Y (1) and DZH (4) KS 5 54 54 KH 5 5 5 K 5 5 5 L 8 8 8 MN 66 66 M 6 6 6 NM 66 66 N 6 6 6 OI OJ, OY 0 1 NC O 0 0 NC P PF, PH 7 7 7 Q 5 5 5 RZ, RS Try RTZ (94) and ZH (4) R 9 9 9 SCHTSCH SCHTSH, SCHTCH 2 4 4 SCH 4 4 4 SHTCH SHCH, SHTSH 2 4 4 SHT SCHT, SCHD 2 43 43 SH 4 4 4 STCH STSCH, SC 2 4 4 STRZ STRS, STSH 2 4 4 ST 2 43 43 SZCZ SZCS 2 4 4 SZT SHD, SZD, SD 2 43 43 SZ 4 4 4 S 4 4 4 TCH TTCH, TTSCH 4 4 4 TH 3 3 3 TRZ TRS 4 4 4 TSCH TSH 4 4 4 TS TTS, TTSZ, TC 4 4 4 TZ TTZ, TZS, TSZ 4 4 4 T 3 3 3 UI UJ, UY 0 1 NC U UE 0 NC NC V 7 7 7 W 7 7 7 X 5 54 54 Y 1 NC NC ZDZ ZDZH, ZHDZH 2 4 4 ZD ZHD 2 43 43 ZH ZS, ZSCH, ZSH 4 4 4 Z 4 4 4 Auerbach A-UE-R-B-A-CH 097500 Lipshitz L-I-P-SH-I-TZ 874400 Ohrbach O-H-R-B-A-CH 097500 Lippszyc L-I-P-P-SZ-Y-C 874400 Szlamavitz SZ-L-A-M-A-V-I-TZ 486740 Shlamowicz SH-L-A-M-O-W-I-CZ 486740
This technique was developed by Lawrence Philips, an artificial intelligence specialist at NAC Reinsurance. His problem was to design an program that will retrieve items that sound like the ones that were entered. Metaphone ignores vowels after the first letter and simplifies thereafter by equating "D" and "T" and so on. It attempts to apply commonplace rules of English pronunciation (e.g. "c" before "i" or "e" is pronounced like "s"). It also reduces the alphabet to 16 consonant sounds, although vowels are kept when they are the first letter. Zero (Ø) is used to represent the "th" sound (it looks a lot like the Greek theta when it has that line through it). "X" is used for the "sh" sound (the Chinese now use it that way when spelling Chinese words for westerners, as in Deng Xiaopeng.
The 16 consonant sounds: B X S K J T F H L M N P R W Y
Letter Code Comments B B unless at the end of a word after "m" as in "dumb" C X (sh) if "-cia-" or "-ch-" S if "-ci-", "-ce-", or "-cy-" silent if "-sci-", "-sce-", or "-scy-" K otherwise, including in "-sch-" D J if in "-dge-", "-dgy-", "-dgi". T otherwise F F G silent if in "-gh-" and not at end or before a vowel in "-gn" or "-gned" in "-dge-" etc., as in above rule J if before "i", or "e", or "y" if not double "gg" K otherwise H silent if after vowel and no vowel follows H otherwise J J K silent if after "c" K otherwise L L M M N N P F if before "h" otherwise Q K R R S X (sh) if before "h" or in "-sio-" or "-sia-" S otherwise T X (sh) if "-tia-" or "-tio-" Ø (th) if before "h" silent if in "-tch-" otherwise V F W silent if not followed by a vowel W if followed by a vowel X KS Y silent if not followed by a vowel Y if followed by a vowel Z S
Exceptions:
Initial "kn-", "gn-", "pm, "ae-", "wr-" drop the first letter
Initial "x" change it to "s"
Initial "wh-" change to "w"
For practical purposes, usually only the first four (if there are that many) letters of the phonetic spelling are used, giving, for example SKL for school and XBRT for Shubert. Metaphone can be modified to account for the "SH" sound in casual and so on, according to your intended domain, but English spelling is so weird that after a certain point you run into the contradictory cases ("-sua" in "casual" is "SH" but in "persuade" the "s" sounds like an "S"). Other inconsistencies include common words like "chemical," "technical," and "mechanic," where "ch" is pronounced like "k" since the words are derived from Greek.
Bonner - BNR Smith - SMØ Van Hoesen - VNHSN Boner - BNR Smythe - SMØ Van Huizen - VNHSN Bohner - BNR Saneed - SNT Van Housen - VNHSN Baymore - BMR Vincenzo - VNSNS
Brian Randell at the University of Newscastle at Newcastle upon Tyne has the last word. All of the discussions above concentrate on spelling variations that would arise from "phonological" reasons such as mis-hearing, changes in spoken accents, etc. However another obvious cause of spelling errors in transliterations from difficult to decipher handwriting. For example, in my own family for many years my mothers (whose parents had died when she was young) had believed that her mother's maiden name was Buse, rather than Beese, because of the writing on the marriage certificate.
In looking into the literature on computer-based work by historical demographers on "family reconsitution" I came across one algorithm for name matchingwhich is mainly aimed at this sort of problem, named that by Guth. (G.J.A. Guth, "Surname Spellings and Computerized Record Linkage," Historial Methods Newsletter, Vol. 10, pp.1019, 1976.). This method provides a surname matching algorithm that takes account of letter ordering, rather than being phonetic in character. It was used in a project to link various sets of early 18th century Norwich records: pollbooks, land tax assessments, window-tax assessments and registers of freemen.
(It was also used) in connection with a large-scale parish register and census data linking project in Quebec, for matching data concerning couples (rather than individuals). Five different types of name variation were described: spelling variations, phonetic variations, double-names, double first names, and alternate first names. The article "Name Variations and Computerized Record Linkage" by Bouchard and Pouyez in Historical Methods, Vol. 13, pp119-125, 1980, details the techniques used for dealing with two of these. Spelling variations were handled by a specially created scheme of phonetic encoding for the French language, containing 64 rules. The article points out that soundex is not really a phonetic encoding scheme, but rather just a crude sorting device.
The phonetic variations were handled by a technique based mainly on Guth's scheme for assessing the extent to which two words contain the same letters in the same order. Another aspect of the scheme is that it builds up a dictionary of equivalent names, using separate criteria for isolated names, and names in the context of all the name data available for a couple. On one test, involving 2,000 records the scheme succeeded in finding 98.5% of the possible links, whereas only two-thirds of the possible links could be found if one tested simply for name identicality.
De Brou and Olsen discuss experiments using the Guth algorithm against language-specific surname matching algorithms as Soundex. "Although relatively successful for the particular projects for which they were designed, each language-specific system suffers from a major drawback. Its effective application is limited to the language for which it was originally created... (Guth's) algorithm does not depend on recognition of phonetic similarity and she is correct to claim that her algorithm has an important advantage over other systems..... it is well-suited to the linking of a multi-enthnic population."
The bottom line is clearly that blind faith in any one system such as Soundex is unwise, particularly when used in circumstances which do not match those for which the system was designed.
Gary Mokotoff, "Introducing the Daitch-Mokotoff Soundex System", Ancestry Newsletter, VII:3, p5-7.
Lavona L. Ness, "Census Schedules - U.S. Federal" Syllabus for the Priesthood Genealogy Seminar, Saratoga, CA 1979.
Lawrence Philips, "Hanging on the Metaphone", Computer Language, December 1990.
Publications of the Corporation of the President of the Church of Jesus Christ of Latter-day Saints. Census information handouts from the Genealogical Library, Salt Lake City, Utah.
Genealogy(SF) electronic contributors to this paper:
Back to FIRST PAGE
6,266 accesses.