Article <slrnva7gg6.39g.bencollver@svadhyaya.localdomain>

Warning: mysqli::__construct(): (HY000/1203): User howardkn already has more than 'max_user_connections' active connections in D:\Inetpub\vhosts\howardknight.net\al.howardknight.net\includes\artfuncs.php on line 21
Failed to connect to MySQL: (1203) User howardkn already has more than 'max_user_connections' active connections
Warning: mysqli::query(): Couldn't fetch mysqli in D:\Inetpub\vhosts\howardknight.net\al.howardknight.net\index.php on line 66
Article <slrnva7gg6.39g.bencollver@svadhyaya.localdomain>
Deutsch English Français Italiano
<slrnva7gg6.39g.bencollver@svadhyaya.localdomain>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!feeds.phibee-telecom.net!2.eu.feeder.erje.net!feeder.erje.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: Ben Collver <bencollver@tilde.pink>
Newsgroups: comp.misc
Subject: strlcpy and how CPUs can defy common sense
Date: Fri, 26 Jul 2024 15:36:17 -0000 (UTC)
Organization: A noiseless patient Spider
Lines: 256
Message-ID: <slrnva7gg6.39g.bencollver@svadhyaya.localdomain>
Injection-Date: Fri, 26 Jul 2024 17:36:18 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="25c82515a3793f0405007c5555736d3a";
	logging-data="3051943"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX18UzezBadE+UlSEjcbOWfzX9bHkRUydp+0="
User-Agent: slrn/1.0.3 (Linux)
Cancel-Lock: sha1:n/wW5GS3Oo5+79tmoYd5blm3E7s=
Bytes: 11330

strlcpy and how CPUs can defy common sense
==========================================
24 Jul 2024

Recently one of my older post about strlcpy has sparked some
discussion on various forums. Presumably the recently released POSIX
edition had something to do with it. One particular counter-argument
was raised by multiple posters - and it's an argument that I've heard
before as well:

* In the common case where the source string fits in to the
  destination buffer, strlcpy would only traverse the string once
  whereas strlen + memcpy would traverse it twice always.

Hidden in this argument is the assumption that traversing the string
once is faster. Which - to be clear - is not at all an unreasonable
assumption. But is it actually true? That's the focus of today's
article.

<https://nrk.neocities.org/articles/not-a-fan-of-strlcpy>

CPU vs common sense
===================
> Computers do not have common sense. Computers are surprising.
> - Tony Hoare to Lomuto

The following is from openbsd, where strlcpy originated - modified a
bit for brevity.

    size_t strlcpy(char *dst, const char *src, size_t dsize)
    {
        const char *osrc = src;
        size_t nleft = dsize;
    
        /* Copy as many bytes as will fit. */
        if (nleft != 0) while (--nleft != 0) {
            if ((*dst++ = *src++) == '\0')
                break;
        }
    
        /* Not enough room in dst, add NUL and traverse rest of src. */
        if (nleft == 0) {
            if (dsize != 0) *dst = '\0'; /* NUL-terminate dst */
            while (*src++) ;
        }
    
        return(src - osrc - 1);	/* count does not include NUL */
    }

It starts by copying from src to dst as much as it can, and if it has
to truncate due to insufficient dst size, then traverses the rest of
src in order to get the strlen(src) value for returning. And so if
the source string fits, it will be traversed only once.

Now if you try to take a look at the glibc implementation of strlcpy,
immediately you'll notice that the first line is this...

    size_t src_length = strlen (src);

.... followed by the rest of the code using memcpy to do the copying.
This already shatters the illusion that strlcpy will traverse the
string once, there's no requirement for that to happen, and as you
can see in practice, one of the major libcs will always traverse the
string twice, once in strlen and once in memcpy.

But before you open a bug report against glibc for being inefficient,
here's some benchmark number when copying a 512 byte string
repeatedly in a loop:

    512 byte
    openbsd:      242us
    glibc:         12us

<https://gist.github.com/N-R-K/ebf096448c0a7f3fdd8b93d280747550>

Perhaps the string is so small that the double traversal doesn't
matter? How about a string of 1MiB?

    1MiB
    openbsd:   501646us
    glibc:      31793us

The situation only gets worse for the openbsd version here, not
better. To be fair, this huge speed up is coming from the fact that
glibc punts all the work over to strlen and memcpy which on glibc are
SIMD optimized. But regardless, we can already see that doing
something fast, twice - is faster than doing it once but slowly.

Apples to apples
================
In order to do an apples to apples comparison I've written the
following strlcpy implementation, which is pretty close to the glibc
implementation except with the strlen and memcpy calls written out in
for loops.

    size_t bespoke_strlcpy(char *dst, const char *src, size_t size)
    {
        size_t len = 0;
        for (; src[len] != '\0'; ++len) {} // strlen() loop
    
        if (size > 0) {
            size_t to_copy = len < size ? len : size - 1;
            for (size_t i = 0; i < to_copy; ++i) // memcpy() loop
                dst[i] = src[i];
            dst[to_copy] = '\0';
        }
        return len;
    }

It's important to note that in order to do a truly apples to apples
comparison, you'd need to also use -fno-builtin when compiling.
Otherwise gcc will realize that the "strlen loop" can be "optimized"
down to a strlen call and emit that. -fno-builtin avoids that from
happening and keeps the comparison fair.

So how does this version, which traverses src twice, perform against
the openbsd's variant which traverses src only once?

    512 byte
    openbsd:      237us
    bespoke:      139us

It's almost twice as fast. How about on bigger strings?

    1MiB
    openbsd:   488469us
    bespoke:   277183us

Still roughly twice as fast. How come?

Dependencies
============
The importance of cache misses (rightfully) gets plenty of spotlight,
dependencies on the other hand are not talked about as much. Your cpu
has multiple cores, and each core has multiple ports (or logic units)
capable of executing instructions. Which means that if you have some
instructions like this (in pseudo assembly, where upper case alphabet
denotes a register):

    A  <- add  B, C
    X  <- add  Y, Z
    E  <- add  A, X

The computation of A and X are independent, and thus can be executed
in parallel. But computation of E requires the result of A and X and
thus cannot be parallelized. This process of being able to execute
independent instructions simultaneously is called
instruction-level-parallelism (or ILP). And dependencies are it's
kryptonite.

If you try to profile the "bespoke" strlcpy version, you'll notice
that nearly 100% of the cpu time is spent on the "strlen loop" while
the copy loop is basically free. Indeed if you replace the "strlen
loop" with an actual strlen call (reminder: that it's SIMD optimized
on glibc) then the bespoke version starts competing with the glibc
version quite well even though we aren't using an optimized memcpy.
In order to understand why this is happening, let's look at the
"strlen loop", written in a verbose manner below:

    len = 0;
    while (true) {
        if (src[len] == '\0')
            break;  // <- this affects the next iteration
        else
            ++len;
    }

In the above loop, whether or not the next iteration of the loop will
execute depends on the result of the previous iteration (whether
src[len] was nul or not). We pay for this in our strlen loop. But our
memcpy loop is free of such loop-carried-dependencies, the current
iteration happens regardless of what happened on the last iteration.

    for (size_t i = 0; i < to_copy; ++i) // memcpy() loop
        dst[i] = src[i]; // <- does NOT depend on previous iteration

In the openbsd version, because the length and copy loop are fused
together, whether or not the next byte will be copied depends on the
byte value of the previous iteration.

    while (--nleft != 0) {  // openbsd copy loop
        // <- the branch taken here affect the next iteration
        if ((*dst++ = *src++) == '\0')
            break;
    }
========== REMAINDER OF ARTICLE TRUNCATED ==========