Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!eternal-september.org!.POSTED!not-for-mail From: Michael S Newsgroups: comp.lang.c Subject: Re: Buffer contents well-defined after fgets() reaches EOF ? Date: Sat, 15 Feb 2025 20:29:15 +0200 Organization: A noiseless patient Spider Lines: 168 Message-ID: <20250215202915.00004842@yahoo.com> References: <20250210124911.00006b31@yahoo.com> <86ldu9zxkb.fsf@linuxsc.com> <20250214165108.00002984@yahoo.com> <20250214085627.815@kylheku.com> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Injection-Date: Sat, 15 Feb 2025 19:29:22 +0100 (CET) Injection-Info: dont-email.me; posting-host="8738ddbde3c74697697bcd6d7680458d"; logging-data="131702"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18iN2POxYO0v4daFhDTWZCZvqNBM75cFOw=" Cancel-Lock: sha1:D453zTc3GA2MD1pgg20flaGzRAU= X-Newsreader: Claws Mail 4.1.1 (GTK 3.24.34; x86_64-w64-mingw32) Bytes: 7190 On Fri, 14 Feb 2025 17:22:59 -0000 (UTC) Kaz Kylheku <643-408-1753@kylheku.com> wrote: > On 2025-02-14, Michael S wrote: > > For starter, it looks like designers of fgets() did not believe in > > their own motto about files being just streams of bytes. > > They obviously did, which is exactly why they painstakingly preserved > the annoying line terminators in the returned data. > > > I don't know the history, so, may be, the function was defined this > > way for portability with systems where text files have special > > record-based structure? > > You are sliding into muddled thinking here. > > > Then, everything about it feels inelegant. > > A return value carries just 1 bit of information, success or > > failure. > > Why would you assert a claim for which the standard library alone > is replete with counterexamples: getchar, malloc, getenv, pow, sin. > > Did you mean /the/ return value (of fgets)? > > > So why did they encode this information in baroque way instead of > > something obvious, 0 and 1? > > Because you can express this concept: > > char work_area[SIZE]; > char *line; > > while ((line = fgets(work_area, sizeof work_area, stream))) > { > /* process line */ > } > > The work_area just provides storage for the operation: line is the > returned line. > > The loop would work even if fgets sometimes returned pointers that > are not the to first byte of work_area. It just so happens that > they always are. > > It is meaningful to capture the returned value and work with > it as if it were distinct from the buffer. > > > Appending zero at the end also feels like a hack, but it is > > necessary because of the main problem. > > Appending zero is necessary so that the result meets the definition > of a C character string, without which it cannot be passed into > string-manipulating functions like strlen. > > Home-grown functions that resemble fgets, but forget to add a null > byte sometimes, are the subjects of security CVEs. > > > And the main problem is: how the user is > > supposed to figure out how many bytes were read? > > Yes, how are they, if you take away the null byte? > > > In well-designed API this question should be answered in O(1) time. > > > > In the context of C strings, that buys you almost nothing. > Even if you know the length, it's going to get measured numerous > more times. > > It would be good if fgets nuked the terminating newline. > > Many uses of fgets, after every operation, look for the newline > and nuke it, before doing anything else. > > There is a nice idiom for that, by the way, which avoids an > temporary variable and if test: > > line[strcspn(line, "\n")] = 0; > > strcspn(line, "\n") calculates the length of the prefix of line > which consists of non-newlines. That value is precisely the > array index of the first newline, if there is one, or else > of the terminating null, if there isn't a newline. Either > way, you can clobber that with a newline. > > Once you see the above, you will never do this again: > > newline = strchr(line, '\n'); > if (newline) > *newline = 0; > > > With fgets(), it can be answered in O(N) time when input is trusted > > to contain no zeros. > > We have decided in the C world that text does not contain zeros. > Yes, for internal data. External inputs has to be sanitized. > This has become so pervasive that the remaining naysayers can safely > regarded as part of a lunatic fringe. > > Software that tries to support the presence of raw nulls in text is > actively harmful for security. > > For instance, a piece of text with embedded nulls might have valid > overall syntax which makes it immune to an injection attack. > > But when it is sent to another piece of software which interprets > the null as a terminator, the syntax is chopped in half, allowing > it to be completed by a malicious actor. > I don't quite understand. In particular, I don't understand if you argue in favor of fgets() or against it. > > When input is arbitrary, finding out the answer is > > even harder and requires quirks. > > When input is arbitrary, don't use fgets? It's for text. > > > The function foo() is more generic than fgets(). For use instead of > > fgets() it should be accompanied by standard constant EOL_CHAR. > > > > I am not completely satisfied with proposed solution. The API is > > still less obvious than it could be. But it is much better than > > fgets(). > > If last_c is '\n', you're still writing the pesky newline that > the caller will often want to remove. > > Adding a terminating null and returning a pointer to that null > would be better. > If the caller wants it, it can easily do it by itself. OTOH, If we follow your proposal, we lose information about presence/absence of EOL at the end of the file. I think, for generic function it's better to not lose any information, even even an information that is not useful for 99.99% of the callers. > You could then call the operation again with the returned dst > pointer, and it would continue extending the string, > without obliterating the last character. > > I'm sure I've seen a foo-like function in software before: > reading delimited by an arbitrary byte, with length signaling. > I certainly do not pretend that I invented anything new here. Nor did I pretend that it's the best possible. More so, I'd like it even more mundane. I just can't figure out, how to do it without addition of one more [pointer] parameter. One obvious possibility is to return # of characters read instead of pointer. Then 0 can mean EOF and negative values can mean I/O errors. But that is also not sufficiently boring.