Deutsch English Français Italiano |
<878qwts8bd.fsf@bsb.me.uk> View for Bookmarking (what is this?) Look up another Usenet article |
Path: ...!weretis.net!feeder8.news.weretis.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: Ben Bacarisse <ben@bsb.me.uk> Newsgroups: comp.lang.awk Subject: Re: (Long post) Metaphone Algorithm In AWK Date: Mon, 19 Aug 2024 00:46:46 +0100 Organization: A noiseless patient Spider Lines: 77 Message-ID: <878qwts8bd.fsf@bsb.me.uk> References: <v9qbgh$1u7qe$1@dont-email.me> MIME-Version: 1.0 Content-Type: text/plain Injection-Date: Mon, 19 Aug 2024 01:46:48 +0200 (CEST) Injection-Info: dont-email.me; posting-host="153c0803c54c022691c586705843dea6"; logging-data="2706798"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19egsdzgNC43Zm3joNIzWbA5OIjgcMla/I=" User-Agent: Gnus/5.13 (Gnus v5.13) Cancel-Lock: sha1:aF8PA8GcYWDWLhUHSQTkWI7pWj8= sha1:iGl4brDPO2p5cpBfJhpB0v3H4TE= X-BSB-Auth: 1.98569a2827c71aae4ff8.20240819004646BST.878qwts8bd.fsf@bsb.me.uk Bytes: 3539 porkchop@invalid.foo (Mike Sanders) writes: > Hi folks, hope you all are doing well. > > Please excuse long post, wanted to share this, some might find > it handy given a certain context. Must run, I'm very behind in > my work (hey I'm always running behind!) Using a word list, I found some odd matches. For example: $ echo "drunkeness indigestion" | awk -f metaphone.awk -v find=texas drunkeness indigestion Are these really metaphone matches for "texas"? It's possible (I don't know the algorithm at all well) but I found it surprising. > # metaphone.awk: Michael Sanders - 2024 > # > # example invocation: > # > # echo "texas taxes taxi" | awk -f metaphone.awk -v find=texas > # > # notes: > # > # ever notice when you search for (say): > # > # 'i went to the zu' > # > # & your chosen search engine suggests something like: > # > # 'did you mean i went to the zoo' > # > # the metaphone algorithm handles such cases pretty well actually... > # > # Metaphone is a phonetic algorithm, published by Lawrence Philips in > # 1990, for indexing words by their English pronunciation. It > # fundamentally improves on the Soundex algorithm by using information > # about variations and inconsistencies in English spelling and > # pronunciation to produce a more accurate encoding, which does a > # better job of matching words and names which sound similar. > # https://en.wikipedia.org/wiki/Metaphone > # > # english only (sorry) > # > # not extensively tested, nevertheless a solid start, if you > # improve this code please share your results > # > # other implentations... > # > # gist: https://gist.github.com/Rostepher/b688f709587ac145a0b3 > # > # BASIC: http://aspell.net/metaphone/metaphone.basic > # > # C: http://aspell.net/metaphone/metaphone-kuhn.txt I wanted a "reference" implementation I could try, but this is not a useful C program. It's in a odd dialect (it uses void but has K&R function definitions) and has loads of undefined behaviours (strcpy of overlapping strings, use if uninitialised variables etc). > # check if a character is a vowel > function isvowel(c, is_vowel) { > is_vowel = c ~ /[AEIOU]/ > return is_vowel > } I was not going to comment on the code, but this hit me just before I posted. Given the odd way AWK functions have to define locals, I tend to use them only when really needed. Here I think I would just write function isvowel(c) { return c ~ /[AEIOU]/ } -- Ben.