Deutsch   English   Français   Italiano  
<101hecq$22ab2$1@dont-email.me>

View for Bookmarking (what is this?)
Look up another Usenet article

Path: news.eternal-september.org!eternal-september.org!.POSTED!not-for-mail
From: Janis Papanagnou <janis_papanagnou+ng@hotmail.com>
Newsgroups: comp.lang.awk
Subject: Re: substr() - copying or not copying, that is here the question.
Date: Sun, 1 Jun 2025 13:43:21 +0200
Organization: A noiseless patient Spider
Lines: 68
Message-ID: <101hecq$22ab2$1@dont-email.me>
References: <101f9oo$18edp$1@dont-email.me>
 <683b5389$0$683$14726298@news.sunsite.dk> <101fv4s$1g5c8$1@dont-email.me>
 <87h60zrbea.fsf@bsb.me.uk>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
Injection-Date: Sun, 01 Jun 2025 13:43:23 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="5efe03dbd7af97f43c3764a2772b692a";
	logging-data="2173282"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX19AQs0a3JVTS4S3unslT+6I"
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:45.0) Gecko/20100101
 Thunderbird/45.8.0
Cancel-Lock: sha1:uT3QIitYCcQne9Xg470KHSONVu4=
X-Enigmail-Draft-Status: N1110
In-Reply-To: <87h60zrbea.fsf@bsb.me.uk>

On 01.06.2025 12:42, Ben Bacarisse wrote:
> Janis Papanagnou <janis_papanagnou+ng@hotmail.com> writes:
> 
>> On 31.05.2025 21:07, Mack The Knife wrote:
>>> In article <101f9oo$18edp$1@dont-email.me>,
>>> Janis Papanagnou  <janis_papanagnou+ng@hotmail.com> wrote:
>>>> In the context   p=index(substr(t,s),r)
>>>> it would not be necessary to copy the substr(t,s),
>>>> the index() function could operate on the original
>>>> using some access "descriptor" (say, a pointer and
>>>> a length) in read-only mode.
>>>>
>>>> Will (GNU) Awk do a copy of the data value or does
>>>> it use a read-only descriptor access to the already
>>>> existing substring of variable "t"?
>>>>
>>>> Currently I'm playing with some huge data and copies
>>>> of MB sized data is costly (if it's repeatedly done
>>>> with various substr() subscripts).
>>>
>>> substr() makes a copy. This is clear in the code.
>>
>> Okay. Thanks for checking that!
> ...
>> Okay, maybe I could write an extension to work on memory
>> mapped files - the data originally stems from a file -
>> and seek/read through "C" mechanisms. (But that's huge
>> effort compared to some natively available function. And
>> then I'd probably better implement that straightly in "C"
>> instead of using Awk, in the first place, since I'd have
>> to implement the GNU Awk Extension anyway in "C".)
> 
> An alternative (depending on the context) would be to consider an
> extension that provides an index function with a third argument giving
> the initial offset.  I've not looked at how extensions get access to
> GAWK strings, so this many not be as easy as it sounds, but I would
> guess that it might be relatively simple to do.

This, first of all, sounds like a good idea! It would make it
unnecessary to (mis-)use the substr() function as (sort of) a
costly copying-descriptor.[*]

I'm unsure about using an extension here. Would there be a name
clash between an built-in index(haystack,needle) function and an
extension index(haystack,needle,start) function? Should they be
separate functions in the first place? (I don't think so.)

In the past new extended functionality was supported by additional
optional parameters in the core Awk code. (Which seems to be the
best place [for optional controlling arguments].) - There's quite
some examples where it seems to have worked well with optional
parameters in the core functions. The changes were obviously local
and the frightening side-effects were not arising, it seems.

But we've read the recent links (with the interview) or already
know what Arnold thinks about that; and it is ambivalent. For one
there was a complaint about quality issues of contributed code
and the maintainer's reluctance to add such code - which is very
understandable! But then there's also the problem that maintainers
don't want to "jump" when arbitrary wishes on functionality arise.

Janis

[*] N.B.: To be consistent it should probably support a substring,
as in index(haystack,needle [,start[,end]]), since the application
example given above p=index(substr(t,s),r) in its generalized form
would have been p=index(substr(t,s,e),r).