Article <101k1dp$96v$1@reader1.panix.com>

Deutsch English Français Italiano
<101k1dp$96v$1@reader1.panix.com>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: news.eternal-september.org!eternal-september.org!feeder3.eternal-september.org!panix!.POSTED.spitfire.i.gajendra.net!not-for-mail
From: cross@spitfire.i.gajendra.net (Dan Cross)
Newsgroups: comp.unix.shell
Subject: Re: Bad performance of back-references
Date: Mon, 2 Jun 2025 11:20:25 -0000 (UTC)
Organization: PANIX Public Access Internet and UNIX, NYC
Message-ID: <101k1dp$96v$1@reader1.panix.com>
References: <101hfq7$22v3c$1@dont-email.me> <101hifa$o50$1@reader1.panix.com> <101hjas$242rp$1@dont-email.me>
Injection-Date: Mon, 2 Jun 2025 11:20:25 -0000 (UTC)
Injection-Info: reader1.panix.com; posting-host="spitfire.i.gajendra.net:166.84.136.80";
	logging-data="9439"; mail-complaints-to="abuse@panix.com"
X-Newsreader: trn 4.0-test77 (Sep 1, 2010)
Originator: cross@spitfire.i.gajendra.net (Dan Cross)

In article <101hjas$242rp$1@dont-email.me>,
Janis Papanagnou  <janis_papanagnou+ng@hotmail.com> wrote:
>On 01.06.2025 14:52, Dan Cross wrote:
>> In article <101hfq7$22v3c$1@dont-email.me>,
>> Janis Papanagnou  <janis_papanagnou+ng@hotmail.com> wrote:
>>> In a recent application case I used back-references to find duplicate
>>> strings in large data sets. I tested that with grep as in
>>>
>>>  grep -E -o '(.{42}).*\1'
>>>
>>> While this is obviously a neat way to formulate and solve the task in
>>> principle it is impractical for performance reasons.[*]
>>>
>>> Applied on my MB sized data the task did not terminate and I killed
>>> the process after a day.
>>>
>>> I also implemented the desired function explicitly (using two nested
>>> loops) in a couple of languages (interpreted or compiled). All those
>>> hand-crafted and non-optimized implementations terminated and did
>>> that within minutes or up to only few hours (depending on the pattern
>>> length).
>>>
>>> My astonishment is why the back-reference implementation performs so
>>> badly here with 'grep'.
>>>
>>> Janis
>>>
>>> [*] Note: Back-references are not from the Regular Expression functions
>>> class so they cannot be done in O(N) or O(N+K); so I don't expect this
>>> complexity where I use them. This is not the question here, just to be
>>> clear.
>> 
>> Your results don't surprise me in the the least.
>> 
>> First, "back references" make "regular expressions" not regular,
>> in the formal sense sense that they are no longer isomorphic to
>> deterministic finite automata or their NDFA simulations.
>
>I'm surprised you give me that answer since with my footnote
>above I explicitly intended to exactly prevent focusing on
>or distracting about this already known fact.

Oh, sorry; I missed the footnote.  It's odd, because that really
is the question in some sense.

>> Matching DFA is inherently linear, but _creating_ DFAs can be
>> exponential; NDFAs can be created in reasonable time (I forget
>> the exact complexity, I'm afraid) though executing them may be
>> superlinear; regardless it's much better than exponential.
>> 
>> Second, most implementations that support backreferences use
>> backtracking to do so, which can be exponential in both space
>> and time.
>
>I'd think that most applications that support back-references
>might rely on the same library (and not reinvent the wheel).

I don't see how, when the functionality is implemented in many
different languages.

Historically, regular expressions were one of those things where
library support wasn't super awesome.  This is why Spencer's
library was so popular, but that used backtracking.

>If the accessible implementations will all resort to backtracking
>as the sole implementation for back-references it indeed explains
>something.

I don't know that anyone knows of a better way for the general
case.  As I said, there is no known fast (as in, less than
exponential) and general solution to matching with back
references.

One can probably write something faster for specific use cases,
but that would obviously be specific to those cases.

>> There's some good background information here:
>> https://swtch.com/~rsc/regexp/regexp1.html
>> 
>> The bottom line is that regexp that use back tracking are not
>> actually regular expressions, and there is no known way to make
>> them fast generally.
>
>Thanks.

Happy to help.

	- Dan C.