Deutsch   English   Français   Italiano  
<878qz31096.fsf@nosuchdomain.example.com>

View for Bookmarking (what is this?)
Look up another Usenet article

Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: Keith Thompson <Keith.S.Thompson+u@gmail.com>
Newsgroups: comp.lang.c
Subject: Re: Hex string literals (was Re: C23 thoughts and opinions)
Date: Mon, 17 Jun 2024 18:57:09 -0700
Organization: None to speak of
Lines: 88
Message-ID: <878qz31096.fsf@nosuchdomain.example.com>
References: <v2l828$18v7f$1@dont-email.me>
	<00297443-2fee-48d4-81a0-9ff6ae6481e4@gmail.com>
	<v2lji1$1bbcp$1@dont-email.me>
	<87msoh5uh6.fsf@nosuchdomain.example.com>
	<f08d2c9f-5c2e-495d-b0bd-3f71bd301432@gmail.com>
	<v2nbp4$1o9h6$1@dont-email.me> <v2ng4n$1p3o2$1@dont-email.me>
	<87y18047jk.fsf@nosuchdomain.example.com>
	<87msoe1xxo.fsf@nosuchdomain.example.com>
	<v2sh19$2rle2$2@dont-email.me>
	<87ikz11osy.fsf@nosuchdomain.example.com>
	<v2v59g$3cr0f$1@dont-email.me>
	<87plt8yxgn.fsf@nosuchdomain.example.com> <v31rj5$o20$1@dont-email.me>
	<87cyp6zsen.fsf@nosuchdomain.example.com>
	<v34gi3$j385$1@dont-email.me>
	<874jahznzt.fsf@nosuchdomain.example.com>
	<v36nf9$12bei$1@dont-email.me>
	<87v82b43h6.fsf@nosuchdomain.example.com>
	<87iky830v7.fsf_-_@nosuchdomain.example.com>
	<wwva5jj4zsw.fsf@LkoBDZeT.terraraq.uk>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit
Injection-Date: Tue, 18 Jun 2024 03:57:15 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="aed299878570cb32e21d076f9aa05b90";
	logging-data="1211818"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX1/m6ojX3SH/AXGAm7CTox2H"
User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/27.2 (gnu/linux)
Cancel-Lock: sha1:z/twOvHQaQNhT80wRDiROi7dpNU=
	sha1:KP6/g+9AkgZ7wkx2PUPImFiWkYM=
Bytes: 5706

Richard Kettlewell <invalid@invalid.invalid> writes:
> Keith Thompson <Keith.S.Thompson+u@gmail.com> writes:
>> Inspired by the existing syntax for integer and floating-point
>> hex constants, I propose using a "0x" prefix.  0x"deadbeef" is an
>> expression of type `const unsigned char[4]` (assuming CHAR_BIT==8),
>> with values 0xde, 0xad, 0xbe, 0xef in that order.  Byte order is
>> irrelevant; we're specifying byte values in order, not bytes of
>> the representation of some larger type.  memcpy()ing 0x"deadbeef"
>> to a uint32 might yield either 0xdeadbeef or uxefbeadde (or other
>> more exotic possibilities).
>
> I like the syntax and I’d find it useful.
>
> There’s more to life than byte arrays, though, so I wonder if there’s
> more to be said here. I find myself dealing a lot with large integers
> generally represented as arrays of some unsigned type (commonly uint32_t
> but other possibilities arise too).
>
> In C as it stands today this requires a translation step before
> constants can be embedded in source code (which is error-prone if
> someone attempts to do it manually).
>
> So being able to say ‘0x8732456872648956348596893765836543 as array of
> uint64_t, LSW first’ (in some suitably C-like syntax) would be a big
> improvement from my perspective, primarily as an accelerator to
> development but also as a small improvement in robustness.

You could use some kind of type punning.  For example, this is currently
legal:

    union {
        unsigned char buf[4];
        uint32_t n;
    } obj = {
        .buf = { 0x01, 0x02, 0x03, 0x04 }
    };

The { 0x01, 0x02, 0x03, 0x04 } could be replaced with 0x"01020304".

Of course you have to deal with endianness.

Since C defines representation in terms of arrays of unsigned char, I'm
inclined to stick to just that.  If there's a *clean* way to extend it
to wider types, I'm ok with that (and I'm not the one who needs to be
convinced).

>> Again, unlike other string literals, there is no implicit terminating
>> null byte.  And I suggest making them const, since there's no
>> existing code to break.
>>
>> If CHAR_BIT==8, each byte is represented by two hex digits.  More
>> generally, each byte is represented by (CHAR_BIT+3)/4 hex digits in
>> the absence of whitespace.  Added whitespace marks the end of a byte,
>> 0x"deadbeef" is 1, 2, 3, or 4 bytes if CHAR_BIT is 32, 16, 12, or 8
>> respectively, but 0x"de ad be ef" is 4 bytes regardless of CHAR_BIT.
>> 0x"" is a syntax error, since C doesn't support zero-length arrays.
>> Anything between the quotes other than hex digits and spaces is a
>> syntax error.
>
> Would "0x1 23 45 67" be a syntax error or { 0x1, 0x23, 0x45, 0x67 }?

The latter.

As you acknowledge in a followup, the 0x goes outside the quotation
marks.

The end of a byte is indicated either by having the right number of
hex digits (2 if CHAR_BIT==8, more otherwise) or by a space character.
0x"1 23 45 67" would be equivalent to 0x"01 23 45 67", or to "0x01234567"
if CHAR_BIT==8.

>> What I'm trying to design here is a more straightforward way to
>> represent raw (unsigned char[]) data in C code, largely but not
>> exclusively for use by #embed.
>
> Compilers can already implement #embed however they like, there’s no
> need for a standardized way to represent the ‘inside’ of a #embed.

The (draft) standard already specifies what #embed expands to,
a comma-delimited sequence of integer constant expressions.
Compilers must implement it in a way that yields the same behavior,
including in contrived cases like the struct example I posted before.
A compiler might implement it in a more optimal manner if it knows
that the target object is an array of unsigned char.

-- 
Keith Thompson (The_Other_Keith) Keith.S.Thompson+u@gmail.com
void Void(void) { Void(); } /* The recursive call of the void */