Path: ...!2.eu.feeder.erje.net!feeder.erje.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: Kaz Kylheku <643-408-1753@kylheku.com>
Newsgroups: comp.unix.shell
Subject: Re: bash aesthetics question: special characters in reg exp in [[
 ... =~~ ... ]]
Date: Tue, 23 Jul 2024 18:34:52 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 52
Message-ID: <20240723112050.105@kylheku.com>
References: <v7mknf$3plab$1@news.xmission.com>
 <v7n9s1$2p39$1@nnrp.usenet.blueworldhosting.com>
 <v7nfbb$3q3of$1@news.xmission.com>
Injection-Date: Tue, 23 Jul 2024 20:34:52 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="2e723cea1cdfb5e1d326eb8834436c3e";
	logging-data="1392005"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX19tiyn3NH1ltXqmtLrWJyExCa+YXXFy7jc="
User-Agent: slrn/pre1.0.4-9 (Linux)
Cancel-Lock: sha1:5mRI2wNxzkhy73EnrCIGH0hM4Y0=
Bytes: 3581

On 2024-07-23, Kenny McCormack <gazelle@shell.xmission.com> wrote:
> Which all kind of echoes back to the other recent thread in this NG about
> regular expressions vs. globs.  The cold hard fact is that there really is
> no such thing as "regular expressions" (*), since every language, every
> program, every implementation of them, is quite different.
>
> (*) As an abstract concept, separate from any specific implementation.

Yes, there are regular expressions as an abstract concept. They are part
of the theory of automata.  Much of the research went on up through the
1960's.  The * operator is called the "Kleene star".
https://en.wikipedia.org/wiki/Kleene_star

In the old math/CS papers about regular expressions, regular expressions
are typically represented in terms of some input symbol alphabet
(usually just letters a, b, c ...) and only the operators | and *,
and parentheses (other than when advanced operators are being discussed,
like intersection and complement, whicha re not easily constructed from
these.)

I think character classes might have been a pragmatic invention in
regex implementations. The theory doesn't require [a-c] because
that can be encoded as (a|b|c).

The ? operator is not required because (R)? can be written (R)(R)*.

Escaping is not required because the oeprators and input symbols are
distinct; the idea that ( could be an input symbol is something that
occurs in implementations, not in the theory.

Regex implementors take the theory and adjust it to taste,
and add necessary details such as character escape sequences for
control characters, and escaping to allow the oeprator characters
themselves to be matched. Plus character classes, with negation
and ranges and all that.

Not all implementations follow solid theory. For instance, the branch
operator | is supposed to be commutative.  There is no difference
between R1|R2 and R2|R1.  But in many implementations (particularly
backtracking ones like PCRE and similar), there is a difference: these
implementations implement R1|R2|R3  by trying the expressions in left to
right order and stop at the first match.

This matters when regexes are used for matching a prefix of the input;
if the regex is interpreted according to the theory should match
the longest possible prefix; it cannot ignore R3, which matches
thousands of symbols, because R2 matched three symbols.

-- 
TXR Programming Language: http://nongnu.org/txr
Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal
Mastodon: @Kazinator@mstdn.ca