Deutsch English Français Italiano |
<v7qi8t$1mi8m$1@dont-email.me> View for Bookmarking (what is this?) Look up another Usenet article |
Path: ...!2.eu.feeder.erje.net!feeder.erje.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: Janis Papanagnou <janis_papanagnou+ng@hotmail.com> Newsgroups: comp.unix.shell Subject: Re: bash aesthetics question: special characters in reg exp in [[ ... =~~ ... ]] Date: Wed, 24 Jul 2024 11:41:48 +0200 Organization: A noiseless patient Spider Lines: 88 Message-ID: <v7qi8t$1mi8m$1@dont-email.me> References: <v7mknf$3plab$1@news.xmission.com> <v7n9s1$2p39$1@nnrp.usenet.blueworldhosting.com> <v7nfbb$3q3of$1@news.xmission.com> <20240723112050.105@kylheku.com> MIME-Version: 1.0 Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: 7bit Injection-Date: Wed, 24 Jul 2024 11:41:50 +0200 (CEST) Injection-Info: dont-email.me; posting-host="a525e55a33c812c78de3ae166044610c"; logging-data="1788182"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19dyaTAwJe1yNo9vTN8zKrJ" User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101 Thunderbird/45.8.0 Cancel-Lock: sha1:rhhba3/LkXBnzcAdtfxvcSyUHhk= In-Reply-To: <20240723112050.105@kylheku.com> X-Enigmail-Draft-Status: N1110 Bytes: 5444 On 23.07.2024 20:34, Kaz Kylheku wrote: > > [...] > > In the old math/CS papers about regular expressions, regular expressions > are typically represented in terms of some input symbol alphabet > (usually just letters a, b, c ...) and only the operators | and *, > and parentheses (other than when advanced operators are being discussed, > like intersection and complement, whicha re not easily constructed from > these.) > > I think character classes might have been a pragmatic invention in > regex implementations. The theory doesn't require [a-c] because > that can be encoded as (a|b|c). While formally we can restrict to some basic elements it's quite inconvenient in practice. I recall that in Compiler Construction and Automata Theory we regularly used the 'all-but' operator for complementing input symbol sets. And also obvious abbreviations for larger sets of symbols (like 'digits', etc.). Not only in practical regexp implementations, also in education certainly no one wants to waste time. > > The ? operator is not required because (R)? can be written (R)(R)*. ITYM: (R)+ > > Escaping is not required because the oeprators and input symbols are > distinct; the idea that ( could be an input symbol is something that > occurs in implementations, not in the theory. > > Regex implementors take the theory and adjust it to taste, > and add necessary details such as character escape sequences for > control characters, and escaping to allow the oeprator characters > themselves to be matched. Plus character classes, with negation > and ranges and all that. There are (at least) two different types of such adjustments. One are the convenience enhancements (like the '+' or the multiplicity ('{m,n}', Perl's '\d' and '\D' etc.) that, from a complexity perspective, all stay within the same [theoretical] class of the Regular Languages. (There's other types of extensions that we find in implementations that leave that language class.) > > Not all implementations follow solid theory. For instance, the branch > operator | is supposed to be commutative. There is no difference > between R1|R2 and R2|R1. But in many implementations (particularly > backtracking ones like PCRE and similar), there is a difference: these > implementations implement R1|R2|R3 by trying the expressions in left to > right order and stop at the first match. In my book it's not necessary to follow "solid theory"; if, and only if, it's documented, correctly/sensibly implemented, and implications made clear to the programmer. There's two common examples that come to mind. I think it's okay if there's support for, e.g., back-references if it's clearly stated that you should not expect that this code will run in O(N), linear complexity. But there have been implementations - don't recall if it was in Java, Perl, or both - where they implemented "generalizations" of "Regexp-processings" that had the runtime-effect that even for *true* Regular Expressions you had no O(N) guarantee any more; and this is IMO a fatal decision! If the programmer uses only RE elements from the class of Regular Languages there should always be complexity O(N) guaranteed. > > This matters when regexes are used for matching a prefix of the input; > if the regex is interpreted according to the theory should match > the longest possible prefix; it cannot ignore R3, which matches > thousands of symbols, because R2 matched three symbols. I think we should differentiate the matching process (implementation specific syntax for formulas of REs, matching implementation methods) from the Regular Languages theory; the complete strings of text that we match are in general not part of the Regular Language, the Regular Expression only specifies the subset of (matching) strings as part of the respective Regular Language. And, WRT complexity, we should also be aware that the O(N) in Regular Languages is the complexity of the "match", not the length of the data that is skimmed for a match. The various algorithms combine these complexities in supposedly efficient ways. Some RE parsers failed. Janis