| Deutsch English Français Italiano |
|
<87r0alqpmo.fsf@bsb.me.uk> View for Bookmarking (what is this?) Look up another Usenet article |
Path: ...!weretis.net!feeder8.news.weretis.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: Ben Bacarisse <ben@bsb.me.uk> Newsgroups: comp.lang.awk Subject: Re: (Long post) Metaphone Algorithm In AWK Date: Mon, 19 Aug 2024 02:15:43 +0100 Organization: A noiseless patient Spider Lines: 77 Message-ID: <87r0alqpmo.fsf@bsb.me.uk> References: <v9qbgh$1u7qe$1@dont-email.me> <878qwts8bd.fsf@bsb.me.uk> MIME-Version: 1.0 Content-Type: text/plain Injection-Date: Mon, 19 Aug 2024 03:15:44 +0200 (CEST) Injection-Info: dont-email.me; posting-host="153c0803c54c022691c586705843dea6"; logging-data="2706798"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/TS82aeHhQssVf9Gg9ZnInmsDU2xy/cm0=" User-Agent: Gnus/5.13 (Gnus v5.13) Cancel-Lock: sha1:EHNno+VF9OkB0HZvIZNVTjNcO5c= sha1:iIJBzgfqmrBN/C2ewLE0mHZDT40= X-BSB-Auth: 1.2dec014a08557f7a8ee4.20240819021543BST.87r0alqpmo.fsf@bsb.me.uk Bytes: 3767 A correction... Ben Bacarisse <ben@bsb.me.uk> writes: > porkchop@invalid.foo (Mike Sanders) writes: > >> Hi folks, hope you all are doing well. >> >> Please excuse long post, wanted to share this, some might find >> it handy given a certain context. Must run, I'm very behind in >> my work (hey I'm always running behind!) > > Using a word list, I found some odd matches. For example: > > $ echo "drunkeness indigestion" | awk -f metaphone.awk -v find=texas > drunkeness > indigestion > > Are these really metaphone matches for "texas"? It's possible (I don't > know the algorithm at all well) but I found it surprising. I got the C code to compile and these should not match if the C code is working correctly. >> # metaphone.awk: Michael Sanders - 2024 >> # >> # example invocation: >> # >> # echo "texas taxes taxi" | awk -f metaphone.awk -v find=texas >> # >> # notes: >> # >> # ever notice when you search for (say): >> # >> # 'i went to the zu' >> # >> # & your chosen search engine suggests something like: >> # >> # 'did you mean i went to the zoo' >> # >> # the metaphone algorithm handles such cases pretty well actually... >> # >> # Metaphone is a phonetic algorithm, published by Lawrence Philips in >> # 1990, for indexing words by their English pronunciation. It >> # fundamentally improves on the Soundex algorithm by using information >> # about variations and inconsistencies in English spelling and >> # pronunciation to produce a more accurate encoding, which does a >> # better job of matching words and names which sound similar. >> # https://en.wikipedia.org/wiki/Metaphone >> # >> # english only (sorry) >> # >> # not extensively tested, nevertheless a solid start, if you >> # improve this code please share your results >> # >> # other implentations... >> # >> # gist: https://gist.github.com/Rostepher/b688f709587ac145a0b3 >> # >> # BASIC: http://aspell.net/metaphone/metaphone.basic >> # >> # C: http://aspell.net/metaphone/metaphone-kuhn.txt > > I wanted a "reference" implementation I could try, but this is not a > useful C program. It's in a odd dialect (it uses void but has K&R > function definitions) and has loads of undefined behaviours (strcpy of > overlapping strings, use if uninitialised variables etc). The uninitialised variables were due to an undefined function. Most likely, that function was intended to initialise the array. I've mocked up the two undefined functions and can now get the code to run. I don't see any uninitialised variables being used now. The code still has undefined behaviour in some cases but I think that is limited to the use of strcpy. -- Ben.