Article <102c1ff$27o$1@reader1.panix.com>

Deutsch English Français Italiano
<102c1ff$27o$1@reader1.panix.com>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: news.eternal-september.org!eternal-september.org!feeder3.eternal-september.org!panix!.POSTED.spitfire.i.gajendra.net!not-for-mail
From: cross@spitfire.i.gajendra.net (Dan Cross)
Newsgroups: comp.unix.shell
Subject: Re: Bad performance of back-references
Date: Wed, 11 Jun 2025 13:48:31 -0000 (UTC)
Organization: PANIX Public Access Internet and UNIX, NYC
Message-ID: <102c1ff$27o$1@reader1.panix.com>
References: <101hfq7$22v3c$1@dont-email.me> <1022c30$3cdq3$1@dont-email.me> <1026jdl$g70$1@reader1.panix.com> <102beth$1sdmr$1@dont-email.me>
Injection-Date: Wed, 11 Jun 2025 13:48:31 -0000 (UTC)
Injection-Info: reader1.panix.com; posting-host="spitfire.i.gajendra.net:166.84.136.80";
	logging-data="2296"; mail-complaints-to="abuse@panix.com"
X-Newsreader: trn 4.0-test77 (Sep 1, 2010)
Originator: cross@spitfire.i.gajendra.net (Dan Cross)

In article <102beth$1sdmr$1@dont-email.me>,
Janis Papanagnou  <janis_papanagnou+ng@hotmail.com> wrote:
>On 09.06.2025 14:17, Dan Cross wrote:
>> [...]
>> 
>>> (Myself I'm rarely using expressions that go beyond the Regular
>>> Expressions formal class, since I usually want to rely on the O(N)
>>> or O(N+K) complexity. But in the current case the simple solution
>>> was just tempting.)
>> 
>> Yup.  That's why they're there.  But they're deceptive with
>> respect to their complexity.  Moreover, since authors like
>> Friedl misrepresent the theory in popular books on the subject,
>
>I haven't read his books. Are they in any way (sort of) standard
>literature?

No.  He wrote a book called, "Mastering Regular Expressions"
where he mixes up some of the formal terminology.  In particular
he calls things "NDFAs" that are not, in fact, non-deterministic
finite automata.  When corrected, he got somewhat indignant.

That is to say, he's made factually incorrect statements (even
in his book!) and refused to correct them.

>> people use what they think are "regular expressions" and get
>> confused by the observed runtimes.
>
>*shrug* Can't tell. (For me it's obvious, and in the past I've
>also regularly pointed that out in the respective discussions.)
>
>But I recall some bad matching behavior in (I think) some Perl
>version; a very primitive RE (with no back-references or such)
>led to exponential runtime behavior.

An easy way to implement even real regular expressions is with
backtracking, and building an NDFA simulation is rather more
work.  Consequently, a lot of early "regexp" libraries took this
approach, and could suffer from exponential blowup, even using
standard regular expressions that would be linear in a DFA
implementation.  Perl, in particular, did this, borrowing Henry
Spencer's RE library, which used backtracking.  The version of
regular expressions described in "The Practice of Programming"
by Kernighan and Pike also uses backtracking; it's only a page
or two, but exhibits exponential behavior in the worst case.

>> Traditionally, those tools rolled their own versions.  The
>> regular expression libraries people have used over the years
>> came somewhat later, in the evolutinary timeline.
>
>I'm lacking an overview here. One thing I know of GNU Awk - but
>that's of course just one example - is that a standard library
>had been used, and more recently an alternative library has been
>examined. All I remember is that they said it does implement in
>some respect improvements. GNU Awk is relying a lot on existing
>functions as sort of an adapter.
>
>(But Awk doesn't support back-references, so it won't apply to
>the topic of this thread.)

GNU awk appears to have imported a regular expression library
that does DFA construction, if possible, and an NDFA simulation
if not.  The DFA code was written by Mike Haertel; I assume it
came via gnulib or something.

It's quite a bit of code, wrapped in a few layers.  It doesn't
seem like they link to e.g. a system library or something that
already comes with the base OS.

>> [...]
>> 
>>>> I don't know that anyone knows of a better way for the general
>>>> case.  As I said, there is no known fast (as in, less than
>>>> exponential) and general solution to matching with back
>>>> references.
>>>
>>> It's certainly a correct observation that if you have a suboptimal
>>> but "general" algorithmic solution that it's tempting to use that.
>> 
>> For a general tool, it's unclear to me how one could do anything
>> else.
>
>Well, it's not uncommon to have tools separate function classes.
>In case of Regexps you could certainly tell apart RE expressions
>that use [true] Regular Expression patterns and those beyond.

Tcl does something like that.

>>> To be honest, I haven't to any significant depth studied details
>>> of various "beyond-regular" implementations; with formal strictly
>>> Regular Expressions it's obvious and simple.
>> 
>> Mmm, I'm not sure I entirely agree with that.  NDFA analysis can
>> be subtle, but the real thing that twists people's noggins is
>> how DFA _creation_ can be exponential.
>
>Well, I agree if you mean that you can formulate non-deterministic
>RE patterns that would lead to NDFAs whose transformation to DFAs
>may become non-trivial - although patterns are mostly short compared
>to the data they are applied to, so even in those rarer cases it may
>not be an issue.
>
>Myself I seem to have generally always defined deterministic Regular
>Expression patterns. The linear complexity follows.

Careful here; many people assume that but really end up with
something that's simulated by an NDFA, which is superlinear
(though still not exponential).  Any DFA can be _simulated_ by
an NDFA; it's a tradeoff between constuction and runtime.

>>> [...]
>>>
>>> If a "general solution" can be so costly then, I think, something
>>> better should be implemented than a "one-backtracking-fits-all".
>> 
>> Well, if someone comes up with something better, they will have
>> solved a very important open problem in computer science.  :-)
>
>As said, I haven't dived deeper into the "beyond RE" case. But are
>you saying that this class of patterns can *only* be solved by
>expensive backtracking? - I'd doubt that, but I'm too old and too
>lazy to do any fundamental research. ;-)

What I'm saying is that no one actually knows, but if there's a
quicker way to do it, no one has found it yet, and a lot of
people have tried over a lot of years.

As of a few years ago, the best known algorithm for supporting
backreferences had exponential complexity.  As far as I know,
nothing has changed about that since the last time I looked
closely.  People _have_ done some work with fixing the number
of back references and so on, and maybe shown that that problem
is in P (it's somewhat unclear) but that's a) not the general
case, and b) the runtimes are still awful.  Like, O(N^2k), where
k is number of capture groups, kind of awful.

Again, I point to Russ Cox's reference on the subject, which is
a fairly gentle introduction for the lay reader:
https://swtch.com/~rsc/regexp/regexp1.html

	- Dan C.