Deutsch   English   Français   Italiano  
<674cc506$0$711$14726298@news.sunsite.dk>

View for Bookmarking (what is this?)
Look up another Usenet article

Path: ...!3.eu.feeder.erje.net!feeder.erje.net!usenet.goja.nl.eu.org!dotsrc.org!filter.dotsrc.org!news.dotsrc.org!not-for-mail
Newsgroups: comp.lang.awk
Subject: Re: GNU Awk's types of regular expressions
References: <viac5m$l8oh$1@dont-email.me>
Organization: The Friends of Rational Range Interpretation
X-Newsreader: trn 4.0-test77 (Sep 1, 2010)
From: arnold@freefriends.org (Aharon Robbins)
Originator: arnold@freefriends.org (Aharon Robbins)
Date: 01 Dec 2024 20:20:22 GMT
Lines: 92
Message-ID: <674cc506$0$711$14726298@news.sunsite.dk>
NNTP-Posting-Host: 6239825f.news.sunsite.dk
X-Trace: 1733084422 news.sunsite.dk 711 arnold@skeeve.com/198.99.81.75:43838
X-Complaints-To: staff@sunsite.dk
Bytes: 4048

Hi. Mack The Knife pointed me at this question.

This kind of query should go to the bug list (where I'll see it).
I skim the help list occasionally but don't reply to mails there.

In article <viac5m$l8oh$1@dont-email.me> Janis writes:
>In GNU Awk there's currently three types of regular expressions, in
>addition to the standard regexp-constants (/regex/) and the dynamic
>regexps ("regex", or variables containing "regex") there's in newer
>versions also first class regexp objects (@/regex/, "Strongly Typed
>Regexp Constants") supported.
>
>One principal advantage of regexp-constants is that the engine to
>parse the regexp can be created in advance, while a dynamic regexp
>may be constructed dynamically (from strings) and needs an explicit
>runtime-step to create the engine before the matching can be done.

Even for such dynamically created regexps, the regexp is compiled once and
cached, not compiled each time it's used (as long as it doesn't change).

>Now I assumed that  @/regex-const/  would in that respect behave as
> /regex-const/ ... - until I found in the GNU Awk manual this text:
>
>| Thus, if you have something like this:
>|
>|   re = @/don't panic/
>|   sub(/don't/, "do", re)
>|   print typeof(re), re
>|
>| then re retains its type, but now attempts to match the string ‘do
>| panic’. This provides a (very indirect) way to create regexp-typed
>| variables at runtime.
>
>(I'm astonished that first class regexp objects can be dynamically
>changed. But that is not my point here; I'm interested in potential
>pre-compiles of regexp constants...)

Since `re' is a variable, it can be changed, just as when you do

	str = "don't panic"
	sub(/don't/, "do", str)

>This would imply that the first class regexp constants can be changed
>like dynamic regexps and that there's no regexp pre-compile involved.

"Not so, Watson! Not so!"  When you do

	re = @/don't panic/

gawk uses reference counted pointers to the original object; the
original strongly typed regexp is precompiled and remains that way.

As soon as you go to *change* `re', gawk makes a copy of the string
value of the orginal regexp, makes the substitution, notes that
it's a strongly typed regexp, and compiles the new regexp. From then
on, the cached compiled regexp is used for matching.

>This would also rise suspicion that the "normal" regexp-constants are
>probably also not precomputed.

Also not true.

>So constant-regexps (both forms) have (only?) the advantage that the
>regexp-syntax can be (initially during awk parsing) checked, e.g.,
>
> 	re = @/don't panic[/
> 	     ^ unterminated regexp

Incorrect, they are compiled when the program is parsed.

>And dynamic regexps and first class regexps that got changed (e.g.
>by code like
>
>  sub(/don't/, "do[", re)
>
>in above sample snippet) would both create runtime errors, e.g.
>
>  error: Unmatched [, [^, [:, [., or [=: /do[ panic/
>  fatal: could not make typed regex
>
>(as all ill-formed regexp-types will produce a runtime error).

Well, of course.

In short, I jump through a lot of hoops in order to avoid recompiling
regexps if it's not necessary.

Hope this helps,

Arnold
-- 
Aharon (Arnold) Robbins 		arnold AT skeeve DOT com