Deutsch English Français Italiano |
<87y15r650v.fsf@bsb.me.uk> View for Bookmarking (what is this?) Look up another Usenet article |
Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: Ben Bacarisse <ben@bsb.me.uk> Newsgroups: comp.unix.shell Subject: Re: bash aesthetics question: special characters in reg exp in [[ ... =~~ ... ]] Date: Wed, 24 Jul 2024 00:51:44 +0100 Organization: A noiseless patient Spider Lines: 61 Message-ID: <87y15r650v.fsf@bsb.me.uk> References: <v7mknf$3plab$1@news.xmission.com> <v7n9s1$2p39$1@nnrp.usenet.blueworldhosting.com> <v7nfbb$3q3of$1@news.xmission.com> <20240723112050.105@kylheku.com> MIME-Version: 1.0 Content-Type: text/plain Injection-Date: Wed, 24 Jul 2024 01:51:44 +0200 (CEST) Injection-Info: dont-email.me; posting-host="cecbc031db9d83bdacabd1935f084c00"; logging-data="1491887"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+c5SUt13BffVvMg4jsBCb1PGo1A6rnrhs=" User-Agent: Gnus/5.13 (Gnus v5.13) Cancel-Lock: sha1:f5GPZzSOEjhV5ilsrVf4RYCYVHU= sha1:gPAepOMQC9xXgYuj6LXscp3o3hs= X-BSB-Auth: 1.41579f614674024aba7e.20240724005144BST.87y15r650v.fsf@bsb.me.uk Bytes: 4109 Kaz Kylheku <643-408-1753@kylheku.com> writes: > On 2024-07-23, Kenny McCormack <gazelle@shell.xmission.com> wrote: >> Which all kind of echoes back to the other recent thread in this NG about >> regular expressions vs. globs. The cold hard fact is that there really is >> no such thing as "regular expressions" (*), since every language, every >> program, every implementation of them, is quite different. >> >> (*) As an abstract concept, separate from any specific implementation. > > Yes, there are regular expressions as an abstract concept. They are part > of the theory of automata. Much of the research went on up through the > 1960's. The * operator is called the "Kleene star". > https://en.wikipedia.org/wiki/Kleene_star > > In the old math/CS papers about regular expressions, regular expressions > are typically represented in terms of some input symbol alphabet > (usually just letters a, b, c ...) and only the operators | and *, > and parentheses (other than when advanced operators are being discussed, > like intersection and complement, whicha re not easily constructed from > these.) > > I think character classes might have been a pragmatic invention in > regex implementations. The theory doesn't require [a-c] because > that can be encoded as (a|b|c). > > The ? operator is not required because (R)? can be written (R)(R)*. (Aside: the choice is arbitrary but + would be a more "Unixy" choice for that operator.) > Escaping is not required because the oeprators and input symbols are > distinct; the idea that ( could be an input symbol is something that > occurs in implementations, not in the theory. > > Regex implementors take the theory and adjust it to taste, > and add necessary details such as character escape sequences for > control characters, and escaping to allow the oeprator characters > themselves to be matched. Plus character classes, with negation > and ranges and all that. > > Not all implementations follow solid theory. For instance, the branch > operator | is supposed to be commutative. There is no difference > between R1|R2 and R2|R1. But in many implementations (particularly > backtracking ones like PCRE and similar), there is a difference: these > implementations implement R1|R2|R3 by trying the expressions in left to > right order and stop at the first match. > > This matters when regexes are used for matching a prefix of the input; > if the regex is interpreted according to the theory should match > the longest possible prefix; it cannot ignore R3, which matches > thousands of symbols, because R2 matched three symbols. This is more a consequence of the different views. The in the formal theory there is no notion of "matching". Regular expressions define languages (i.e. sets of sequences of symbols) according to a recursive set of rules. The whole idea of an RE matching a string is from their use in practical applications. -- Ben.