Path: ...!weretis.net!feeder8.news.weretis.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: David Brown <david.brown@hesbynett.no>
Newsgroups: comp.lang.c
Subject: Re: A Famous Security Bug
Date: Fri, 22 Mar 2024 13:51:53 +0100
Organization: A noiseless patient Spider
Lines: 154
Message-ID: <utjuta$2u6ah$1@dont-email.me>
References: <bug-20240320191736@ram.dialup.fu-berlin.de>
 <20240320114218.151@kylheku.com> <uthirj$29aoc$1@dont-email.me>
 <20240321092738.111@kylheku.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Fri, 22 Mar 2024 12:51:54 -0000 (UTC)
Injection-Info: dont-email.me; posting-host="07a44ddb34b47981e78dc5e82c11a0d6";
	logging-data="3086673"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX1+sXe4E/Aa6gD1rsyXcm4KyM9AVuXylm4k="
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
 Thunderbird/102.11.0
Cancel-Lock: sha1:wZ2AvimxoKL3tp/VNqntDepFguU=
In-Reply-To: <20240321092738.111@kylheku.com>
Content-Language: en-GB
Bytes: 8270

On 21/03/2024 18:41, Kaz Kylheku wrote:
> On 2024-03-21, David Brown <david.brown@hesbynett.no> wrote:
>> On 20/03/2024 19:54, Kaz Kylheku wrote:
>>> On 2024-03-20, Stefan Ram <ram@zedat.fu-berlin.de> wrote:
>>>>     A "famous security bug":
>>>>
>>>> void f( void )
>>>> { char buffer[ MAX ];
>>>>     /* . . . */
>>>>     memset( buffer, 0, sizeof( buffer )); }
>>>>
>>>>     . Can you see what the bug is?
>>>
>>> I don't know about "the bug", but conditions can be identified under
>>> which that would have a problem executing, like MAX being in excess
>>> of available automatic storage.
>>>
>>> If the /*...*/ comment represents the elision of some security sensitive
>>> code, where the memset is intended to obliterate secret information,
>>> of course, that obliteration is not required to work.
>>>
>>> After the memset, the buffer has no next use, so the all the assignments
>>> performed by memset to the bytes of buffer are dead assignments that can
>>> be elided.
>>>
>>> To securely clear memory, you have to use a function for that purpose
>>> that is not susceptible to optimization.
>>>
>>> If you're not doing anything stupid, like link time optimization, an
>>> external function in another translation unit (a function that the
>>> compiler doesn't recognize as being an alias or wrapper for memset)
>>> ought to suffice.
>>
>> Using LTO is not "stupid".  Relying on people /not/ using LTO, or not
>> using other valid optimisations, is "stupid".
> 
> LTO is a nonconforming optimization. 

Really?  That is news to me, and I suspect to the folks at gcc and 
clang/llvm that developed LTO for these compilers.  (I have worked with 
embedded compilers that have had LTO-type optimisations for decades, but 
these are not often concerned with the minutiae of the standards.)

> It destroys the concept that
> when a translation unit is translated, the semantic analysis is
> complete, such that the only remaining activity is resolution of
> external references (linkage), and that the semantic analysis of one
> translation unit deos not use information about another translation
> unit.

Where is it described in the C standards that semantic information from 
one translation unit cannot be used (for optimisation, for static error 
checking, for other analysis or any other purposes) in another 
translation unit?

What makes you think that LTO, as implemented in compilers like gcc and 
clang/llvm, do not generate code according to the "as if" rules?  (That 
is, they can generate code that is more optimal, but produces the same 
observable effects "as if" they were strict dumb translators of the 
functioning of the C abstract machine.)

I believe there is very little where the behaviour of a C program is 
different if parts of the code are in one translation unit, or if they 
are in several.  There are utilities that merge multiple C files into 
single C files (for easier deployment, or for better optimisation). 
They have to take into account renaming static objects and functions to 
file-local names, and remove duplicate type definitions, but as long as 
certain reasonable rules are followed by the programmer, it all goes 
fine.  (You could, I suppose, hit complications if you relied on 
compatibility of struct or union types across translation units where 
the identifiers were different and they are compatible across TU's but 
not within TU's, according to the 6.2.7p1 rules.  But that would be 
unlikely, and I expect LTO compilers to handle those cases.)

> 
> This has not yet changed in last April's N3096 draft, where
> translation phases 7 and 8 are:
> 
>    7. White-space characters separating tokens are no longer significant.
>       Each preprocessing token is converted into a token. The resulting
>       tokens are syntactically and semantically analyzed and translated
>       as a translation unit.
> 
>    8. All external object and function references are resolved. Library
>       components are linked to satisfy external references to functions
>       and objects not defined in the current translation. All such
>       translator output is collected into a program image which contains
>       information needed for execution in its execution environment.
> 
> and before that, the Program Structure section says:
> 
>    The separate translation units of a program communicate by (for
>    example) calls to functions whose identifiers have external linkage,
>    manipulation of objects whose identifiers have external linkage, or
>    manipulation of data files. Translation units may be separately
>    translated and then later linked to produce an executable program.
> 

All of that is irrelevant.  It says nothing against sharing other 
information.

> LTO deviates from the the model that translation units are separate,
> and the conceptual steps of phases 7 and 8.

No, it does not.  These paragraphs are requirements, not limitations.

> 
> The translation unit that is prepared for LTO is not fully cooked.  You
> have no idea what its code will turn into when the interrupted
> compilation is resumed during linkage, under the influence of other
> tranlation units it is combined with.

You have as much and as little idea of what the generated code will be 
as you always do during compilation.  Compilers can do all kinds of 
manipulations of the source code you write - as long as the observable 
behaviour of the program is the same as a dumb translation.  They can, 
and do, use all kinds of inter-procedural optimisations for inlining 
code, outlining it, breaking functions into pieces, cloning them, using 
constant propagation, and so on.  LTO lets them do this across 
translation units.

> 
> So in fact, the language allows us to take it for granted that, given
> 
>    my_memset(array, 0, sizeof(array)); }
> 
> at the end of a function, and my_memset is an external definition
> provided by another translation unit, the call may not be elided.
> 

No, the C language standards make no such guarantee.

> The one who may be acting recklessly is he who turns on nonconforming
> optimizations that are not documented as supported by the code base.
> 
> Another example would be something like gcc's -ffast-math.

That is /completely/ different.  That option is clearly documented as 
potentially violating some of the rules of the ISO C standards.  This is 
why it is not enabled by default or by any common optimisation levels 
(except "-Ofast", which is also documented as potentially violating 
standards).

> You wouldn't unleash that on numerical code written by experts,
> and expect the same correct results.
> 

I would not expect identical results to floating point calculations, no.

Depending on the code in question, I would still expect correct results. 
  I use "-ffast-math" in all my code in order to get correct results a 
good deal faster (for my targets, and my type of code) than I would get 
without it.