Article <1026jdl$g70$1@reader1.panix.com>

Deutsch English Français Italiano
<1026jdl$g70$1@reader1.panix.com>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: news.eternal-september.org!eternal-september.org!feeder3.eternal-september.org!panix!.POSTED.spitfire.i.gajendra.net!not-for-mail
From: cross@spitfire.i.gajendra.net (Dan Cross)
Newsgroups: comp.unix.shell
Subject: Re: Bad performance of back-references
Date: Mon, 9 Jun 2025 12:17:57 -0000 (UTC)
Organization: PANIX Public Access Internet and UNIX, NYC
Message-ID: <1026jdl$g70$1@reader1.panix.com>
References: <101hfq7$22v3c$1@dont-email.me> <101hjas$242rp$1@dont-email.me> <101k1dp$96v$1@reader1.panix.com> <1022c30$3cdq3$1@dont-email.me>
Injection-Date: Mon, 9 Jun 2025 12:17:57 -0000 (UTC)
Injection-Info: reader1.panix.com; posting-host="spitfire.i.gajendra.net:166.84.136.80";
	logging-data="16608"; mail-complaints-to="abuse@panix.com"
X-Newsreader: trn 4.0-test77 (Sep 1, 2010)
Originator: cross@spitfire.i.gajendra.net (Dan Cross)

In article <1022c30$3cdq3$1@dont-email.me>,
Janis Papanagnou  <janis_papanagnou+ng@hotmail.com> wrote:
>On 02.06.2025 13:20, Dan Cross wrote:
>> In article <101hjas$242rp$1@dont-email.me>,
>> Janis Papanagnou  <janis_papanagnou+ng@hotmail.com> wrote:
>>> On 01.06.2025 14:52, Dan Cross wrote:
>>>> In article <101hfq7$22v3c$1@dont-email.me>,
>>>> Janis Papanagnou  <janis_papanagnou+ng@hotmail.com> wrote:
>>>>> In a recent application case I used back-references to find duplicate
>>>>> strings in large data sets. I tested that with grep as in
>>>>>
>>>>>  grep -E -o '(.{42}).*\1'
>>>>>
>>>>> While this is obviously a neat way to formulate and solve the task in
>>>>> principle it is impractical for performance reasons.[*]
>>>>>
>>>>> Applied on my MB sized data the task did not terminate and I killed
>>>>> the process after a day.
>>>>>
>>>>> I also implemented the desired function explicitly (using two nested
>>>>> loops) in a couple of languages (interpreted or compiled). All those
>>>>> hand-crafted and non-optimized implementations terminated and did
>>>>> that within minutes or up to only few hours (depending on the pattern
>>>>> length).
>>>>>
>>>>> My astonishment is why the back-reference implementation performs so
>>>>> badly here with 'grep'.
>>>>>
>>>>> Janis
>>>>>
>>>>> [*] Note: Back-references are not from the Regular Expression functions
>>>>> class so they cannot be done in O(N) or O(N+K); so I don't expect this
>>>>> complexity where I use them. This is not the question here, just to be
>>>>> clear.
>>>>
>>>> Your results don't surprise me in the the least.
>>>>
>>>> First, "back references" make "regular expressions" not regular,
>>>> in the formal sense sense that they are no longer isomorphic to
>>>> deterministic finite automata or their NDFA simulations.
>>>
>>> I'm surprised you give me that answer since with my footnote
>>> above I explicitly intended to exactly prevent focusing on
>>> or distracting about this already known fact.
>> 
>> Oh, sorry; I missed the footnote.  It's odd, because that really
>> is the question in some sense.
>
>Yes, of course. It's just the magnitude that is so frustrating,
>especially in comparison to other solutions. (See below.)

Well, factorials get big really fast.

6! = 720
12! = 479,001,600

That is, a factor of two between which factorial we want (6 vs
12) but more than half a million between the value.

>(Myself I'm rarely using expressions that go beyond the Regular
>Expressions formal class, since I usually want to rely on the O(N)
>or O(N+K) complexity. But in the current case the simple solution
>was just tempting.)

Yup.  That's why they're there.  But they're deceptive with
respect to their complexity.  Moreover, since authors like
Friedl misrepresent the theory in popular books on the subject,
people use what they think are "regular expressions" and get
confused by the observed runtimes.

>>>> Matching DFA is inherently linear, but _creating_ DFAs can be
>>>> exponential; NDFAs can be created in reasonable time (I forget
>>>> the exact complexity, I'm afraid) though executing them may be
>>>> superlinear; regardless it's much better than exponential.
>>>>
>>>> Second, most implementations that support backreferences use
>>>> backtracking to do so, which can be exponential in both space
>>>> and time.
>>>
>>> I'd think that most applications that support back-references
>>> might rely on the same library (and not reinvent the wheel).
>> 
>> I don't see how, when the functionality is implemented in many
>> different languages.
>
>Oh, I wasn't primarily speaking about different languages. Here
>I had tools in mind that are supposedly all written in "C", like
>'grep', 'sed', etc. But of course also other libraries and tools
>written in other languages may access (for example) "C" libraries
>that provide that functionality; many languages support external
>access to functions written in other languages.

Traditionally, those tools rolled their own versions.  The
regular expression libraries people have used over the years
came somewhat later, in the evolutinary timeline.

>That said; there's of course also various other implementations
>anyway. (I hope not all do the same mistakes.)

They don't.  If you can use something like re2 (written in C++,
which may constrain portability) then you're already better off.

>> I don't know that anyone knows of a better way for the general
>> case.  As I said, there is no known fast (as in, less than
>> exponential) and general solution to matching with back
>> references.
>
>It's certainly a correct observation that if you have a suboptimal
>but "general" algorithmic solution that it's tempting to use that.

For a general tool, it's unclear to me how one could do anything
else.

>> One can probably write something faster for specific use cases,
>> but that would obviously be specific to those cases.
>
>To be honest, I haven't to any significant depth studied details
>of various "beyond-regular" implementations; with formal strictly
>Regular Expressions it's obvious and simple.

Mmm, I'm not sure I entirely agree with that.  NDFA analysis can
be subtle, but the real thing that twists people's noggins is
how DFA _creation_ can be exponential.

>I've asked that question mainly because the performance measures
>where too far apart. My simple 'grep' based solution '(.{12}).*\1'
>terminated (finally!) after 27.5 hours (more than a day).[*] Where
>a bulky hand-crafted nested loops required only a fraction of that;
>don't recall exactly but I think it was in the low minutes(!) range.
>That was some Awk code that even did a lot of unnecessary costly
>copying (because of the necessity, with the given functions, of
>repeated unavoidable huge copies). With yet another approach I had
>got that task down to 4 seconds, but that's not my expectation with
>a standard regexp library. But 27.5 hours is way off, IMO, compared
>to the straight forward primitive approach with 2 minutes; a factor
>of magnitude approx. 1000 times slower.
>
>If a "general solution" can be so costly then, I think, something
>better should be implemented than a "one-backtracking-fits-all".

Well, if someone comes up with something better, they will have
solved a very important open problem in computer science.  :-)

>[*] Note the comparable small string size of 12 of the substring to
>match. (This value is also much smaller than the sample in my OP.)
>For sized of {6} on a MB sized file it's okay, then it's still in
>the seconds range. But, as opposed to FSMs, the algorithm is badly
>scaling.

See above to see how this scales with exponentiation.

	- Dan C.