Article <v7qi8t$1mi8m$1@dont-email.me>

Warning: mysqli::__construct(): (HY000/1203): User howardkn already has more than 'max_user_connections' active connections in D:\Inetpub\vhosts\howardknight.net\al.howardknight.net\includes\artfuncs.php on line 21
Failed to connect to MySQL: (1203) User howardkn already has more than 'max_user_connections' active connections
Warning: mysqli::query(): Couldn't fetch mysqli in D:\Inetpub\vhosts\howardknight.net\al.howardknight.net\index.php on line 66
Article <v7qi8t$1mi8m$1@dont-email.me>

Deutsch English Français Italiano

<v7qi8t$1mi8m$1@dont-email.me>

View for Bookmarking (what is this?)
Look up another Usenet article

Path: ...!2.eu.feeder.erje.net!feeder.erje.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: Janis Papanagnou <janis_papanagnou+ng@hotmail.com>
Newsgroups: comp.unix.shell
Subject: Re: bash aesthetics question: special characters in reg exp in [[ ...
 =~~ ... ]]
Date: Wed, 24 Jul 2024 11:41:48 +0200
Organization: A noiseless patient Spider
Lines: 88
Message-ID: <v7qi8t$1mi8m$1@dont-email.me>
References: <v7mknf$3plab$1@news.xmission.com>
 <v7n9s1$2p39$1@nnrp.usenet.blueworldhosting.com>
 <v7nfbb$3q3of$1@news.xmission.com> <20240723112050.105@kylheku.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=windows-1252
Content-Transfer-Encoding: 7bit
Injection-Date: Wed, 24 Jul 2024 11:41:50 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="a525e55a33c812c78de3ae166044610c";
	logging-data="1788182"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX19dyaTAwJe1yNo9vTN8zKrJ"
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101
 Thunderbird/45.8.0
Cancel-Lock: sha1:rhhba3/LkXBnzcAdtfxvcSyUHhk=
In-Reply-To: <20240723112050.105@kylheku.com>
X-Enigmail-Draft-Status: N1110
Bytes: 5444

On 23.07.2024 20:34, Kaz Kylheku wrote:
> 
> [...]
> 
> In the old math/CS papers about regular expressions, regular expressions
> are typically represented in terms of some input symbol alphabet
> (usually just letters a, b, c ...) and only the operators | and *,
> and parentheses (other than when advanced operators are being discussed,
> like intersection and complement, whicha re not easily constructed from
> these.)
> 
> I think character classes might have been a pragmatic invention in
> regex implementations. The theory doesn't require [a-c] because
> that can be encoded as (a|b|c).

While formally we can restrict to some basic elements it's quite
inconvenient in practice. I recall that in Compiler Construction
and Automata Theory we regularly used the 'all-but' operator for
complementing input symbol sets. And also obvious abbreviations
for larger sets of symbols (like 'digits', etc.). Not only in
practical regexp implementations, also in education certainly no
one wants to waste time.

> 
> The ? operator is not required because (R)? can be written (R)(R)*.

ITYM: (R)+

> 
> Escaping is not required because the oeprators and input symbols are
> distinct; the idea that ( could be an input symbol is something that
> occurs in implementations, not in the theory.
> 
> Regex implementors take the theory and adjust it to taste,
> and add necessary details such as character escape sequences for
> control characters, and escaping to allow the oeprator characters
> themselves to be matched. Plus character classes, with negation
> and ranges and all that.

There are (at least) two different types of such adjustments. One
are the convenience enhancements (like the '+' or the multiplicity
('{m,n}', Perl's '\d' and '\D' etc.) that, from a complexity
perspective, all stay within the same [theoretical] class of the
Regular Languages. (There's other types of extensions that we find
in implementations that leave that language class.)

> 
> Not all implementations follow solid theory. For instance, the branch
> operator | is supposed to be commutative.  There is no difference
> between R1|R2 and R2|R1.  But in many implementations (particularly
> backtracking ones like PCRE and similar), there is a difference: these
> implementations implement R1|R2|R3  by trying the expressions in left to
> right order and stop at the first match.

In my book it's not necessary to follow "solid theory"; if, and only
if, it's documented, correctly/sensibly implemented, and implications
made clear to the programmer.

There's two common examples that come to mind. I think it's okay if
there's support for, e.g., back-references if it's clearly stated
that you should not expect that this code will run in O(N), linear
complexity. But there have been implementations - don't recall if it
was in Java, Perl, or both - where they implemented "generalizations"
of "Regexp-processings" that had the runtime-effect that even for
*true* Regular Expressions you had no O(N) guarantee any more; and
this is IMO a fatal decision! If the programmer uses only RE elements
from the class of Regular Languages there should always be complexity
O(N) guaranteed.

> 
> This matters when regexes are used for matching a prefix of the input;
> if the regex is interpreted according to the theory should match
> the longest possible prefix; it cannot ignore R3, which matches
> thousands of symbols, because R2 matched three symbols.

I think we should differentiate the matching process (implementation
specific syntax for formulas of REs, matching implementation methods)
from the Regular Languages theory; the complete strings of text that
we match are in general not part of the Regular Language, the Regular
Expression only specifies the subset of (matching) strings as part of
the respective Regular Language. And, WRT complexity, we should also
be aware that the O(N) in Regular Languages is the complexity of the
"match", not the length of the data that is skimmed for a match. The
various algorithms combine these complexities in supposedly efficient
ways. Some RE parsers failed.

Janis