Path: ...!2.eu.feeder.erje.net!feeder.erje.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: Kaz Kylheku <643-408-1753@kylheku.com> Newsgroups: comp.unix.shell Subject: Re: bash aesthetics question: special characters in reg exp in [[ ... =~~ ... ]] Date: Tue, 23 Jul 2024 18:34:52 -0000 (UTC) Organization: A noiseless patient Spider Lines: 52 Message-ID: <20240723112050.105@kylheku.com> References: Injection-Date: Tue, 23 Jul 2024 20:34:52 +0200 (CEST) Injection-Info: dont-email.me; posting-host="2e723cea1cdfb5e1d326eb8834436c3e"; logging-data="1392005"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19tiyn3NH1ltXqmtLrWJyExCa+YXXFy7jc=" User-Agent: slrn/pre1.0.4-9 (Linux) Cancel-Lock: sha1:5mRI2wNxzkhy73EnrCIGH0hM4Y0= Bytes: 3581 On 2024-07-23, Kenny McCormack wrote: > Which all kind of echoes back to the other recent thread in this NG about > regular expressions vs. globs. The cold hard fact is that there really is > no such thing as "regular expressions" (*), since every language, every > program, every implementation of them, is quite different. > > (*) As an abstract concept, separate from any specific implementation. Yes, there are regular expressions as an abstract concept. They are part of the theory of automata. Much of the research went on up through the 1960's. The * operator is called the "Kleene star". https://en.wikipedia.org/wiki/Kleene_star In the old math/CS papers about regular expressions, regular expressions are typically represented in terms of some input symbol alphabet (usually just letters a, b, c ...) and only the operators | and *, and parentheses (other than when advanced operators are being discussed, like intersection and complement, whicha re not easily constructed from these.) I think character classes might have been a pragmatic invention in regex implementations. The theory doesn't require [a-c] because that can be encoded as (a|b|c). The ? operator is not required because (R)? can be written (R)(R)*. Escaping is not required because the oeprators and input symbols are distinct; the idea that ( could be an input symbol is something that occurs in implementations, not in the theory. Regex implementors take the theory and adjust it to taste, and add necessary details such as character escape sequences for control characters, and escaping to allow the oeprator characters themselves to be matched. Plus character classes, with negation and ranges and all that. Not all implementations follow solid theory. For instance, the branch operator | is supposed to be commutative. There is no difference between R1|R2 and R2|R1. But in many implementations (particularly backtracking ones like PCRE and similar), there is a difference: these implementations implement R1|R2|R3 by trying the expressions in left to right order and stop at the first match. This matters when regexes are used for matching a prefix of the input; if the regex is interpreted according to the theory should match the longest possible prefix; it cannot ignore R3, which matches thousands of symbols, because R2 matched three symbols. -- TXR Programming Language: http://nongnu.org/txr Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal Mastodon: @Kazinator@mstdn.ca