Deutsch English Français Italiano |
<20241128200247.439@kylheku.com> View for Bookmarking (what is this?) Look up another Usenet article |
Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: Kaz Kylheku <643-408-1753@kylheku.com> Newsgroups: comp.lang.awk Subject: Re: GNU Awk's types of regular expressions Date: Fri, 29 Nov 2024 04:13:43 -0000 (UTC) Organization: A noiseless patient Spider Lines: 94 Message-ID: <20241128200247.439@kylheku.com> References: <viac5m$l8oh$1@dont-email.me> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Injection-Date: Fri, 29 Nov 2024 05:13:44 +0100 (CET) Injection-Info: dont-email.me; posting-host="a2969b641593bbd1632ba0ef9b48b172"; logging-data="999851"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+/SGTvLpvRazMeBegTakeVCL8ClViN6iU=" User-Agent: slrn/pre1.0.4-9 (Linux) Cancel-Lock: sha1:jtVd+3H/rwvtNgAYiHNh2gMq5xI= Bytes: 4576 On 2024-11-28, Janis Papanagnou <janis_papanagnou+ng@hotmail.com> wrote: > In GNU Awk there's currently three types of regular expressions, in > addition to the standard regexp-constants (/regex/) and the dynamic > regexps ("regex", or variables containing "regex") there's in newer > versions also first class regexp objects (@/regex/, "Strongly Typed > Regexp Constants") supported. > > One principal advantage of regexp-constants is that the engine to > parse the regexp can be created in advance, while a dynamic regexp > may be constructed dynamically (from strings) and needs an explicit > runtime-step to create the engine before the matching can be done. > Now I assumed that @/regex-const/ would in that respect behave as > /regex-const/ ... - until I found in the GNU Awk manual this text: > >| >| Thus, if you have something like this: >| >| re = @/don't panic/ >| sub(/don't/, "do", re) >| print typeof(re), re >| >| then re retains its type, but now attempts to match the string ‘do >| panic’. This provides a (very indirect) way to create regexp-typed >| variables at runtime. >| > > (I'm astonished that first class regexp objects can be dynamically > changed. But that is not my point here; I'm interested in potential > pre-compiles of regexp constants...) I would flatly reject a commit to do such a thing. Yikes! What representation is it working on? If the regex contains a match for a literal backslash using escaping, does that count as two backslash characters when you operate on it? Or is it a single backslash? Can you replace the second backslash with an 'n' and have the pair turn into a newline? Is it just tromboning back to printed representation, and then parsing again? I provide this: 1> (regex-source #/a.*b(c|d)/) (compound #\a (0+ wild) #\b (or #\c #\d)) You can get the source code of the regex object as a nested list with symbols, characters and other objects. When you have this, you can analyze and transform it. Then you can call regex-compile on the result. For instance, prepend a match for the z character: 2> (regex-compile ^(compound #\z ,*(cdr *1))) #/za.*b(c|d)/ This is robust; you're not dealing with any character-syntax issues like escapes, because you have the abstract syntax tree of the regex. > This would imply that the first class regexp constants can be changed > like dynamic regexps and that there's no regexp pre-compile involved. Not necessarily; it could be that a new regex is compiled, and put into the re variable, clobbering the old regex, which is freed (if it hits a refcount of zero or whatever mem management is used). It could also (in combination with this) be lazy. So that is to say @/abc/ will just store the textual source code of the regex into the regex object, but not compile anything. When it comes time to use the regex, on first use, it is compiled and then cached into that object. When the regex is edited, the cache is invalidated. Someone will undoubtedly chime in confirming or refuting these hypotheses. It would be pretty silly if these regex objects didn't cache a compiled regex across multiple uses. > And dynamic regexps and first class regexps that got changed (e.g. > by code like > > sub(/don't/, "do[", re) > > in above sample snippet) would both create runtime errors, e.g. Have you tried this? Do you get an error at sub() time, or when you later try to use re? -- TXR Programming Language: http://nongnu.org/txr Cygnal: Cygwin Native Application Library: http://kylheku.com/cygnal Mastodon: @Kazinator@mstdn.ca