Skip to content

TDD Kata for encoding last names into a 4-character string

Notifications You must be signed in to change notification settings

QueraltSM/Soundex

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Soundex

TDD Kata for encoding last names into a 4-character string

Soundex is a known algorithm for encoding last names into a 4-character string.
The goal is to encode similar-sounding names to the same representation,
so that searches with slightly misspelled names will still find appropriate matches.

The rules for Soundex encoding:

1. Retain the first letter of the name and drop all other occurrences of a, e, i, o, u, y, h, w.

2. Replace consonants with digits as follows (after the first letter):

b, f, p, v → 1
c, g, j, k, q, s, x, z → 2
d, t → 3
l → 4
m, n → 5
r → 6

3. If two or more letters with the same number are adjacent in the original name (before step 1), only retain the first letter; also two letters with the same number separated by ‘h’ or ‘w’ are coded as a single number, whereas such letters separated by a vowel are coded twice. This rule also applies to the first letter.

4. If you have too few letters in your word that you can’t assign three numbers, append with zeros until there are three numbers. If you have more than 3 letters, just retain the first 3 numbers.


Using this algorithm:

  • both "Robert" and "Rupert" return the same string "R163"
  • "Rubin" yields "R150"
  • "Ashcraft" and "Ashcroft" both yield "A261"
  • "Tymczak" yields "T522" not "T520", because the chars 'z' and 'k' in the name are coded as 2 twice since a vowel lies in between them
  • "Pfister" yields "P236" not "P123", because the first two letters have the same number and are coded once as 'P'
  • "Honeyman" yields "H555"

About

TDD Kata for encoding last names into a 4-character string

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages