| Deutsch English Français Italiano |
|
<vake8r$2vjon$1@dont-email.me> View for Bookmarking (what is this?) Look up another Usenet article |
Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: Ed Morton <mortonspam@gmail.com>
Newsgroups: comp.lang.awk
Subject: Re: [gawk] Handling variants of CSV input data formats
Date: Tue, 27 Aug 2024 06:45:32 -0500
Organization: A noiseless patient Spider
Lines: 147
Message-ID: <vake8r$2vjon$1@dont-email.me>
References: <vaeh9m$1pfge$1@dont-email.me> <vahop1$2eavu$1@dont-email.me>
<vahttd$2f666$1@dont-email.me> <vaj7ps$2lph3$1@dont-email.me>
<vajant$2m8em$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 27 Aug 2024 13:45:32 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="0da34e7009e7efa5655ea33214084bce";
logging-data="3133207"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/u87TNOUxxOZTxmRPOtmDb"
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:eQrItMMCdwQXP/m+Vy3O9BUpBxk=
X-Antivirus-Status: Clean
Content-Language: en-US
In-Reply-To: <vajant$2m8em$1@dont-email.me>
X-Antivirus: Avast (VPS 240826-4, 8/26/2024), Outbound message
Bytes: 6256
On 8/26/2024 8:39 PM, Janis Papanagnou wrote:
> On 27.08.2024 02:49, Ed Morton wrote:
>> On 8/26/2024 7:54 AM, Janis Papanagnou wrote:
>>> snip>
>>> I'd have liked to provide more concrete information here, but I'm at
>>> the moment even unable to reproduce Awk's behavior as documented in
>>> its manual; I've tried the following command with various locales
>>>
>>> $ echo 4,321 | LC_ALL=en_DK.utf-8 gawk '{ print $1 + 1 }'
>>> -| 5,321
>>>
>>> but always got just 5 as result.
>>
>> You need to specifically TELL gawk to use your locale to read input
>> numbers:
>>
>> $ echo 4,321 | LC_ALL=en_DK.utf-8 gawk '{ print $1 + 1 }'
>> 5
>>
>> $ echo 4,321 | POSIXLY_CORRECT=1 LC_ALL=en_DK.utf-8 gawk '{ print $1 + 1 }'
>> 5,321
>>
>> $ echo 4,321 | LC_ALL=en_DK.utf-8 gawk -N '{ print $1 + 1 }'
>> 5,321
>>
>> See
>> https://www.gnu.org/software/gawk/manual/gawk.html#Locale-influences-conversions
>> for more info on that.
>
> Thanks. That's actually where I got above example from.
>
> I've missed that there was an explicit
> $ export POSIXLY_CORRECT=1
> set on the very top of these examples. Gee!
>
> Feels anyway strange that an explicit LC_* setting is ineffective
> without the additional POSIXLY_CORRECT variable. And the page also
> says: "The POSIX standard says that awk always uses the period as
> the decimal point when reading the awk program source code".
> So despite POSIX saying that, you have to use a variable named
> POSIXLY_CORRECT. - Do I need some more coffee to understand that?
POSIXLY_CORRECT=1 (or equivalently `--posix` aka `-P`) affects numbers
in the input your script reads (as shown in the previous post) and
strings being converted to numbers in your code, it doesn't affect
literal numbers in the source code for your script that awk reads.
In the source code the decimal separator for a literal number (as
opposed to a string being converted to a number) is always `.`.
You can't use, say, a comma as the decimal separator in a literal number
because a comma already means something in the awk syntax, e.g. `print
4,321` means the same as "print 4 OFS 321`.
For example, this code compares a literal number 4.321 to another
literal number 4.1 and prints "bigger" because 4.321 is a bigger number
than 4.1:
$ awk 'BEGIN{ if (4.321 > 4.1) print "bigger"; else print "not" }'
bigger
while this attempt to use `,` as the decimal separator is a syntax error
because `,` has a meaning in awk syntax:
$ awk 'BEGIN{ if (4,321 > 4.1) print "bigger"; else print "not" }'
awk: cmd. line:1: BEGIN{ if (4,321 > 4.1) print "bigger"; else print "not" }
awk: cmd. line:1: ^ syntax error
You can't just write the number as a string because then you're doing a
string comparison and comparing the collation order of `.` vs `,`, not
the values of the numbers:
$ awk 'BEGIN{ if ("4,321" > 4.1) print "bigger"; else print "not" }'
not
You might think you can just add `0` to convert the string to a number:
$ awk 'BEGIN{ if (("4,321"+0) > 4.1) print "bigger"; else print "not" }'
not
but by default in gawk that conversion truncates everything from the `,`
on and so you end up comparing the number 4 to the number 4.1.
To convert `"4,321"` to the number `4.321` in gawk you need to once
again set your locale and tell gawk to use it:
$ LC_ALL=en_DK.utf-8 awk 'BEGIN{ if (("4,321"+0) > 4.1) print "bigger";
else print "not" }'
not
$ LC_ALL=en_DK.utf-8 awk -N 'BEGIN{ if (("4,321"+0) > 4.1) print
"bigger"; else print "not" }'
bigger
$ LC_ALL=en_DK.utf-8 awk -P 'BEGIN{ if (("4,321"+0) > 4.1) print
"bigger"; else print "not" }'
bigger
or in gawk instead of adding 0 you could use `strtonum()` on a string
containing that value to convert it to a number. For example:
$ LC_ALL=en_DK.utf-8 awk 'BEGIN{ if (strtonum("4,321") > 4.1) print
"bigger"; else print "not" }'
not
$ LC_ALL=en_DK.utf-8 awk -N 'BEGIN{ if (strtonum("4,321") > 4.1) print
"bigger"; else print "not" }'
bigger
$ LC_ALL=en_DK.utf-8 awk -P 'BEGIN{ if (strtonum("4,321") > 4.1) print
"bigger"; else print "not" }'
awk: cmd. line:1: fatal: function `strtonum' not defined
In that last example above we see why `-N` is better than `-P` since
they both do what you want with the number but `-P` also disables gawk
extensions while `-N` doesn't.
>
> And I see there's an additional GNU Awk option '--use-lc-numeric'.
That's just the long form of `-N`, identical in meaning.
> What a mess!
>
> (I suppose current status can only be explained by the mentioned
> forth-and-back during history of various GNU Awk versions.)
Right.
>
> What's worth the LC_* variables if they are ignored (or maybe not).
They have their uses but using them for everything by default apparently
isn't the way people most frequently want to use awk so you need options
to tell gawk when to use them in specific situations.
Ed.
>
> Janis
>
>>
>> Regards,
>>
>> Ed
>