Re: (Long post) Metaphone Algorithm In AWK

Liste des GroupesRevenir à cl awk 
Sujet : Re: (Long post) Metaphone Algorithm In AWK
De : porkchop (at) *nospam* invalid.foo (Mike Sanders)
Groupes : comp.lang.awk
Date : 20. Aug 2024, 06:45:33
Autres entêtes
Organisation : A noiseless patient Spider
Message-ID : <va1aht$3906i$1@dont-email.me>
References : 1 2
User-Agent : tin/2.6.2-20221225 ("Pittyvaich") (NetBSD/9.3 (amd64))
Ben Bacarisse <ben@bsb.me.uk> wrote:

Using a word list, I found some odd matches.  For example:
 
$ echo "drunkeness indigestion" | awk -f metaphone.awk -v find=texas
drunkeness
indigestion
 
Are these really metaphone matches for "texas"?  It's possible (I don't
know the algorithm at all well) but I found it surprising.

Ben, give this try when you can. Finally starting to wrap my mind around
its usage a little more...

# Metaphone Algorithm In AWK v3: Michael Sanders - 2024
#
# tighter, cleaner, better
#
# example invocation: awk -f metephone.awk -v find=cork < words.txt

BEGIN { find_code = metaphone(find) }

# -----------------------------------------------------------------

# emit metaphone codes only
# { for (x = 1; x <= NF; x++) print $x " : " metaphone($x) }

# tweek levenshtein distance to open/constrain results...
{

for (x = 1; x <= NF; x++)
   if (metaphone($x) == find_code && levenshtein($x, find) <= 2)
      print $x " : " find

}

# -----------------------------------------------------------------

function isvowel(char) { return char ~ /[AEIOU]/ }

# -----------------------------------------------------------------

function metaphone(word, m, c, next_c, len, i) {
  word = toupper(word)
  gsub(/[^A-Z]/, "", word)  # strip non-alphabetic characters
  len = length(word)

  # handle initial letters
  if (substr(word, 1, 2) ~ /^(KN|GN|PN|WR|PS)/) {
    word = substr(word, 2)
    len--
  }

  for (i = 1; i <= len; i++) {
    c = substr(word, i, 1)
    next_c = (i < len) ? substr(word, i + 1, 1) : ""

    # skip duplicate letters except for 'C'
    if (i > 1 && c == substr(word, i - 1, 1) && c != "C") continue

    # handle vowels: retain only if it's 1st letter
    if (isvowel(c)) {
      if (i == 1) m = m c
    }
    # consonants
    else if (c == "B") {
      if (!(i == len && substr(word, i - 1, 1) == "M")) m = m "B"
    }
    else if (c == "C") {
      if (substr(word, i, 2) == "CH") {
        m = m "X"
        i++
      } else if (substr(word, i, 2) ~ /^(CI|CE|CY)/) {
        m = m "S"
      } else {
        m = m "K"
      }
    }
    else if (c == "D") {
      if (substr(word, i, 2) == "DG" && substr(word, i + 2, 1) ~ /[IEY]/) {
        m = m "J"
        i += 2
      } else {
        m = m "T"
      }
    }
    else if (c == "G") {
      if (substr(word, i, 2) == "GH" && (i == 1 || !isvowel(substr(word, i - 1, 1)))) {
        i++
      } else if (substr(word, i, 2) == "GN" || (i == len && c == "G")) {
        continue
      } else if (substr(word, i, 3) ~ /^(GIA|GIE|GEY)/) {
        m = m "J"
      } else {
        m = m "K"
      }
    }
    else if (c == "H") {
      if (i == 1 || substr(word, i - 1, 1) !~ /[CSPTG]/) {
        if (i < len && !isvowel(next_c)) {
          m = m "H"
        }
      }
    }
    else if (c == "K") {
      if (i == 1 || substr(word, i - 1, 1) != "C") m = m "K"
    }
    else if (c == "P") {
      if (substr(word, i, 2) == "PH") {
        m = m "F"
        i++
      } else {
        m = m "P"
      }
    }
    else if (c == "Q") {
      m = m "K"
    }
    else if (c == "S") {
      if (substr(word, i, 2) == "SH") {
        m = m "X"
        i++
      } else if (substr(word, i, 3) == "TIA" || substr(word, i, 3) == "TIO") {
        m = m "X"
        i += 2
      } else {
        m = m "S"
      }
    }
    else if (c == "T") {
      if (substr(word, i, 2) == "TH") {
        m = m "T"
        i++
      } else if (substr(word, i, 3) == "TIA" || substr(word, i, 3) == "TIO") {
        m = m "X"
        i += 2
      } else {
        m = m "T"
      }
    }
    else if (c == "V") {
      m = m "F"
    }
    else if (c == "W" || c == "Y") {
      if (i < len && isvowel(next_c)) m = m c
    }
    else if (c == "X") {
      m = m "KS"
    }
    else if (c == "Z") {
      m = m "S"
    }
    # ensure 'M', 'N', and 'L' are always retained
    else if (c == "M" || c == "N" || c == "L") {
      m = m c
    }
  }

  return m
}

# -----------------------------------------------------------------

function levenshtein(word1, word2, l1, l2, cst, i, j, diz) {
  l1 = length(word1)
  l2 = length(word2)

  # initialize distance array
  for (i = 0; i <= l1; i++) diz[i, 0] = i
  for (j = 0; j <= l2; j++) diz[0, j] = j

  # compute distance
  for (i = 1; i <= l1; i++) {
    for (j = 1; j <= l2; j++) {
      cst = (substr(word1, i, 1) == substr(word2, j, 1)) ? 0 : 1
      diz[i, j] = (diz[i-1, j] + 1 < diz[i, j-1] + 1) ? \
                  (diz[i-1, j] + 1 < diz[i-1, j-1] + cst ? \
                   diz[i-1, j] + 1 : diz[i-1, j-1] + cst) : \
                  (diz[i, j-1] + 1 < diz[i-1, j-1] + cst ? \
                   diz[i, j-1] + 1 : diz[i-1, j-1] + cst)
    }
  }

  return diz[l1, l2]
}

# eof

--
:wq
Mike Sanders


Date Sujet#  Auteur
17 Aug 24 * (Long post) Metaphone Algorithm In AWK17Mike Sanders
19 Aug 24 +* Re: (Long post) Metaphone Algorithm In AWK10Ben Bacarisse
19 Aug 24 i+- Re: (Long post) Metaphone Algorithm In AWK1Ben Bacarisse
19 Aug 24 i+* Re: (Long post) Metaphone Algorithm In AWK2Mike Sanders
19 Aug 24 ii`- Re: (Long post) Metaphone Algorithm In AWK1Mike Sanders
20 Aug 24 i`* Re: (Long post) Metaphone Algorithm In AWK6Mike Sanders
21 Aug 24 i `* Re: (Long post) Metaphone Algorithm In AWK5Ben Bacarisse
21 Aug 24 i  `* Re: (Long post) Metaphone Algorithm In AWK4Mike Sanders
21 Aug 24 i   +- Re: (Long post) Metaphone Algorithm In AWK1Mike Sanders
21 Aug 24 i   `* Re: (Long post) Metaphone Algorithm In AWK2Ben Bacarisse
21 Aug 24 i    `- Re: (Long post) Metaphone Algorithm In AWK1Mike Sanders
20 Aug 24 +- Re: (Long post) Metaphone Algorithm In AWK1Mike Sanders
21 Aug 24 +* Re: (Long post) Metaphone Algorithm In AWK3Mike Sanders
21 Aug 24 i`* AWK language trivia (Was: (Long post) Metaphone Algorithm In AWK)2Kenny McCormack
21 Aug 24 i `- Re: AWK language trivia1Mike Sanders
21 Aug 24 +- Re: (Long post) Metaphone Algorithm In AWK1Mike Sanders
23 Aug 24 `- Re: (Long post) Metaphone Algorithm In AWK1Mike Sanders

Haut de la page

Les messages affichés proviennent d'usenet.

NewsPortal