Path: ...!weretis.net!feeder8.news.weretis.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: Janis Papanagnou Newsgroups: comp.lang.awk Subject: Re: Preliminary version of new regex matcher for gawk now available Date: Thu, 25 Jul 2024 14:05:27 +0200 Organization: A noiseless patient Spider Lines: 85 Message-ID: References: <66a21e7e$0$710$14726298@news.sunsite.dk> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Injection-Date: Thu, 25 Jul 2024 14:05:30 +0200 (CEST) Injection-Info: dont-email.me; posting-host="4d496ee791b39f18f5b0754cb506a405"; logging-data="2400411"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+c8w8mgbOC157yKl/Z+yc9" User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.8.0 Cancel-Lock: sha1:C1CkBF8uYoaBKJI7GlEGj/LClI0= X-Enigmail-Draft-Status: N1110 In-Reply-To: <66a21e7e$0$710$14726298@news.sunsite.dk> Bytes: 4611 On 25.07.2024 11:44, Aharon Robbins wrote: > Hi All. > > I've been working with Mike Haertel (the original author of GNU grep) > for a number of months now. He is writing a new regexp matcher for > use in gawk (and other places, as people desire). > > The matcher is avalable on Github: https://github.com/mikehaertel/minrx. > I have created a branch in the gawk repo that uses it: feature/minrx. > > MinRX is currently written in C++20. Mike will eventually rewrite it > in C for portability. For the moment, you'll need to use gcc / g++ > to build the branch. I haven't tried to mess with clang / clang++. > > The test suite passes completely. > > The new matcher is the default, so that it will be exercised. The old > matchers are still available. To use them, set GAWK_GNU_MATCHERS in > the environment. I will NOT make a formal release with MinRX as long > as MinRX is still in C++. > > For now, the only way to access the code is via Git: > > git clone https://git.savannah.gnu.org/r/gawk.git > cd gawk > git checkout feature/minrx > ./bootstrap.sh && ./configure && make -j && make check > > If you use gawk, please try this branch out. My system complains about -std=c++20 so I cannot test it. (I think I'll wait for a native C release.) > > Questions, comments, and *bug reports* are welcome. Well, I skimmed through the txt file on Mike's git page to learn about the algorithm; especially the algorithm and its complexity is of interest to me. The document was not quite clear about that (or at least made me doubt) beyond the general and typical O(N*M) characteristics. One thing I was astonished about was why there's a non-deterministic automaton model used (NFSM can be transformed into Deterministic FSM); isn't the non-deterministic tree-search (where every branch is traversed breadth-first) sub-optimal? Then it spoke about a finite number of states, but that is a normal characteristic for a "Finite SM". I'm also astonished that while he spoke about back-references (something a bit more exotic in "RE"s) why he sees problems when applying the ERE based approach on BRE subset; my grep supports back-references with BRE and ERE. Not supporting back-references - if that's all he wanted to say - with his RE algorithm is of course okay (for me, at least, others might complain). So, at the moment, that all appears still a bit obscure to me. But that's just a first impression from an only brief look at it. So, yet, only questions. (No comments. No bugs.) :-) But wait! I do have one comment. In Mike's git-page description he speaks about the goal of this implementation approach. Given that goal I don't think it's yet a good time to incorporate that algorithm in GNU Awk; it adds some inherent uncertainty of new code without providing a gain. Algorithm simplicity is nice but as I understand there's not yet performance comparisons done? Unless it was a deliberate offer to use GNU Awk as a test bed. And "nearly-feature-complete implementation" (section Features) is not quite a fruitful marketing concept. I also wonder why BSD and GNU extensions are supported but not the very useful abbreviations for {some,all} Perl RE shortcuts. Just my 2 cents. If performance comparison numbers are available I'm certainly very interested to hear about them. Thanks for reading this (unintentionally) longish post. Janis > > Thanks, > > Arnold >