Deutsch   English   Français   Italiano  
<87y15r650v.fsf@bsb.me.uk>

View for Bookmarking (what is this?)
Look up another Usenet article

Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: Ben Bacarisse <ben@bsb.me.uk>
Newsgroups: comp.unix.shell
Subject: Re: bash aesthetics question: special characters in reg exp in [[ ... =~~ ... ]]
Date: Wed, 24 Jul 2024 00:51:44 +0100
Organization: A noiseless patient Spider
Lines: 61
Message-ID: <87y15r650v.fsf@bsb.me.uk>
References: <v7mknf$3plab$1@news.xmission.com>
	<v7n9s1$2p39$1@nnrp.usenet.blueworldhosting.com>
	<v7nfbb$3q3of$1@news.xmission.com> <20240723112050.105@kylheku.com>
MIME-Version: 1.0
Content-Type: text/plain
Injection-Date: Wed, 24 Jul 2024 01:51:44 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="cecbc031db9d83bdacabd1935f084c00";
	logging-data="1491887"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX1+c5SUt13BffVvMg4jsBCb1PGo1A6rnrhs="
User-Agent: Gnus/5.13 (Gnus v5.13)
Cancel-Lock: sha1:f5GPZzSOEjhV5ilsrVf4RYCYVHU=
	sha1:gPAepOMQC9xXgYuj6LXscp3o3hs=
X-BSB-Auth: 1.41579f614674024aba7e.20240724005144BST.87y15r650v.fsf@bsb.me.uk
Bytes: 4109

Kaz Kylheku <643-408-1753@kylheku.com> writes:

> On 2024-07-23, Kenny McCormack <gazelle@shell.xmission.com> wrote:
>> Which all kind of echoes back to the other recent thread in this NG about
>> regular expressions vs. globs.  The cold hard fact is that there really is
>> no such thing as "regular expressions" (*), since every language, every
>> program, every implementation of them, is quite different.
>>
>> (*) As an abstract concept, separate from any specific implementation.
>
> Yes, there are regular expressions as an abstract concept. They are part
> of the theory of automata.  Much of the research went on up through the
> 1960's.  The * operator is called the "Kleene star".
> https://en.wikipedia.org/wiki/Kleene_star
>
> In the old math/CS papers about regular expressions, regular expressions
> are typically represented in terms of some input symbol alphabet
> (usually just letters a, b, c ...) and only the operators | and *,
> and parentheses (other than when advanced operators are being discussed,
> like intersection and complement, whicha re not easily constructed from
> these.)
>
> I think character classes might have been a pragmatic invention in
> regex implementations. The theory doesn't require [a-c] because
> that can be encoded as (a|b|c).
>
> The ? operator is not required because (R)? can be written (R)(R)*.

(Aside: the choice is arbitrary but + would be a more "Unixy" choice for
that operator.)

> Escaping is not required because the oeprators and input symbols are
> distinct; the idea that ( could be an input symbol is something that
> occurs in implementations, not in the theory.
>
> Regex implementors take the theory and adjust it to taste,
> and add necessary details such as character escape sequences for
> control characters, and escaping to allow the oeprator characters
> themselves to be matched. Plus character classes, with negation
> and ranges and all that.
>
> Not all implementations follow solid theory. For instance, the branch
> operator | is supposed to be commutative.  There is no difference
> between R1|R2 and R2|R1.  But in many implementations (particularly
> backtracking ones like PCRE and similar), there is a difference: these
> implementations implement R1|R2|R3  by trying the expressions in left to
> right order and stop at the first match.
>
> This matters when regexes are used for matching a prefix of the input;
> if the regex is interpreted according to the theory should match
> the longest possible prefix; it cannot ignore R3, which matches
> thousands of symbols, because R2 matched three symbols.

This is more a consequence of the different views. The in the formal
theory there is no notion of "matching".  Regular expressions define
languages (i.e. sets of sequences of symbols) according to a recursive
set of rules.  The whole idea of an RE matching a string is from their
use in practical applications.

-- 
Ben.