Deutsch English Français Italiano |
<101hfq7$22v3c$1@dont-email.me> View for Bookmarking (what is this?) Look up another Usenet article |
Path: news.eternal-september.org!eternal-september.org!.POSTED!not-for-mail From: Janis Papanagnou <janis_papanagnou+ng@hotmail.com> Newsgroups: comp.unix.shell Subject: Bad performance of back-references Date: Sun, 1 Jun 2025 14:07:33 +0200 Organization: A noiseless patient Spider Lines: 26 Message-ID: <101hfq7$22v3c$1@dont-email.me> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit Injection-Date: Sun, 01 Jun 2025 14:07:36 +0200 (CEST) Injection-Info: dont-email.me; posting-host="5efe03dbd7af97f43c3764a2772b692a"; logging-data="2194540"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+Jc/EG6Of8yNOIbBIukuCH" User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.8.0 Cancel-Lock: sha1:WmJxYlDvSTCYxcCCfoURo8WKI+4= X-Enigmail-Draft-Status: N1110 X-Mozilla-News-Host: news://news.eternal-september.org:119 In a recent application case I used back-references to find duplicate strings in large data sets. I tested that with grep as in grep -E -o '(.{42}).*\1' While this is obviously a neat way to formulate and solve the task in principle it is impractical for performance reasons.[*] Applied on my MB sized data the task did not terminate and I killed the process after a day. I also implemented the desired function explicitly (using two nested loops) in a couple of languages (interpreted or compiled). All those hand-crafted and non-optimized implementations terminated and did that within minutes or up to only few hours (depending on the pattern length). My astonishment is why the back-reference implementation performs so badly here with 'grep'. Janis [*] Note: Back-references are not from the Regular Expression functions class so they cannot be done in O(N) or O(N+K); so I don't expect this complexity where I use them. This is not the question here, just to be clear.