| Deutsch English Français Italiano |
|
<20250209025745.00003df4@yahoo.com> View for Bookmarking (what is this?) Look up another Usenet article |
Path: ...!fu-berlin.de!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!eternal-september.org!.POSTED!not-for-mail
From: Michael S <already5chosen@yahoo.com>
Newsgroups: comp.arch
Subject: Re: Cost of handling misaligned access
Date: Sun, 9 Feb 2025 02:57:45 +0200
Organization: A noiseless patient Spider
Lines: 161
Message-ID: <20250209025745.00003df4@yahoo.com>
References: <5lNnP.1313925$2xE6.991023@fx18.iad>
<2025Feb6.115939@mips.complang.tuwien.ac.at>
<20250206152808.0000058f@yahoo.com>
<vo2iqq$30elm$1@dont-email.me>
<vo2p33$31lqn$1@dont-email.me>
<20250206211932.00001022@yahoo.com>
<vo36go$345o3$1@dont-email.me>
<20250206233200.00001fc3@yahoo.com>
<vo4lvl$3eu3c$1@dont-email.me>
<20250207124138.00006c8d@yahoo.com>
<vo551p$3hhbc$1@dont-email.me>
<20250207170423.000023b7@yahoo.com>
<2025Feb8.091104@mips.complang.tuwien.ac.at>
<20250208192119.0000148e@yahoo.com>
<2025Feb8.184632@mips.complang.tuwien.ac.at>
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Injection-Date: Sun, 09 Feb 2025 01:57:46 +0100 (CET)
Injection-Info: dont-email.me; posting-host="1477f8ca78b756d97f2690a5b818aff1";
logging-data="317284"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18rvWbja8MIrlq1W97kZJvF3BpjOcAlCOs="
Cancel-Lock: sha1:QZqsn3SaUKZ6zAGsizHoYWUtGBk=
X-Newsreader: Claws Mail 4.1.1 (GTK 3.24.34; x86_64-w64-mingw32)
Bytes: 7747
On Sat, 08 Feb 2025 17:46:32 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
> Michael S <already5chosen@yahoo.com> writes:
> >On Sat, 08 Feb 2025 08:11:04 GMT
> >anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
> >Or by my own pasting mistake. I am still not sure whom to blame.
> >The mistake was tiny - absence of // at the begining of one line, but
> >enough to not compile. Trying it for a second time:
>
> Now it's worse, it's quoted-printable. E.g.:
>
> > if (li >=3D len || li <=3D 0)
>
> Some newsreaders can decode this, mine does not.
>
> >> First cycles (which eliminates worries about turbo modes) and
> >> instructions, then usec/call.
> >>=20
> >
> >I don't understand that.
> >For original code optimized by clang I'd expect 22,000 cycles and
> >5.15 usec per call on Haswell. You numbers don't even resamble
> >anything like that.
>
> My cycle numbers are for the whole program that calls keylocks()
> 100_000 times.
>
> If you divide the cycles by 100000, you get 21954 for clang
> keylocks1-256, which is what you expect.
>
> >> instructions
> >> 5_779_542_242 gcc avx2 1 =20
> >> 3_484_942_148 gcc avx2 2 8=20
> >> 5_885_742_164 gcc avx2 3 8=20
> >> 7_903_138_230 clang avx2 1 =20
> >> 7_743_938_183 clang avx2 2 8?
> >> 3_625_338_104 clang avx2 3 8?=20
> >> 4_204_442_194 gcc 512 1 =20
> >> 2_564_142_161 gcc 512 2 32
> >> 3_061_042_178 gcc 512 3 16
> >> 7_703_938_205 clang 512 1 =20
> >> 3_402_238_102 clang 512 2 16?
> >> 3_320_455_741 clang 512 3 16?
> >>=20
> >
> >I don't understand these numbers either. For original clang, I'd
> >expect 25,000 instructions per call.
>
> clang keylocks1-256 performs 79031 instructions per call (divide the
> number given by 100000 calls). If you want to see why that is, you
> need to analyse the code produced by clang, which I did only for
> select cases.
>
> >Indeed. 2.08 on 4.4 GHz is only 5% slower than my 2.18 on 4.0 GHz.
> >Which could be due to differences in measurements methodology - I
> >reported median of 11 runs, you seems to report average.
>
> I just report one run with 100_000 calls, and just hope that the
> variation is small:-) In my last refereed paper I use 30 runs and
> median, but I don't go to these lengths here; the cycles seem pretty
> repeatable.
>
> >> On the Golden Cove of a Core i3-1315U (compared to the best result
> >> by Terje Mathisen on a Core i7-1365U; the latter can run up to
> >> 5.2GHz according to Intel, whereas the former can supposedly run
> >> up to 4.5GHz; I only ever measured at most 3.8GHz on our NUC, and
> >> this time as well):
> >>=20
> >
> >I always thought that NUCs have better cooling than all, but high-end
> >laptops. Was I wrong? Such slowness is disappointing.
>
> The cooling may be better or not, that does not come into play here,
> as it never reaches higher clocks, even when it's cold; E-cores also
> stay 700MHz below their rated turbo speed, even when it's the only
> loaded core. One theory I have is that one option we set up in the
> BIOS has the effect of limiting turbo speed, but it has not been
> important enough to test.
>
> >> 5.25us Terje Mathisen's Rust code compiled by clang (best on the
> >> 1365U) 4.93us clang keylocks1-256 on a 3.8GHz 1315U
> >> 4.17us gcc keylocks1-256 on a 3.8GHz 1315U
> >> 3.16us gcc keylocks2-256 on a 3.8GHz 1315U
> >> 2.38us clang keylocks2-512 on a 3.8GHz 1315U
> >>=20
> >
> >So, for the best-performing variant IPC of Goldeen Cove is identical
> >to ancient Haswell?
>
> Actually worse:
>
> For clang keylocks2-512 Haswell has 3.73 IPC, Golden Cove 3.63.
>
> >That's very disappointing. Haswell has 4-wide front
> >end and majority of AVX2 integer instruction is limited to throughput
> >of two per clock. Golden Cove has 5+ wide front end and nearly all
> >AVX2 integer instruction have throughput of three per clock.
> >Could it be that clang introduced some sort of latency bottleneck?
>
> As far as I looked into the code, I did not see such a bottleneck.
> Also, Zen4 has significantly higher IPC on this variant (5.36 IPC for
> clang keylocks2-256), and I expect that it would suffer from a general
> latency bottleneck, too. Rocket Lake is also faster on this program
> than Haswell and Golden Cove. It seems to be just that this program
> rubs Golden Cove the wrong way.
>
> >> I would have expected the clang keylocks1-256 to run slower,
> >> because the compiler back-end is the same and the 1315U is slower.
> >> Measuring cycles looks more relevant for this benchmark to me
> >> than measuring time, especially on this core where AVX-512 is
> >> disabled and there is no AVX slowdown.
> >>=20
> >
> >I prefer time, because at the end it's the only thing that matter.
>
> True, and certainly, when stuff like AVX-512 license-based
> downclocking or thermal or power limits come into play (and are
> relevant for the measurement at hand), one has to go there. But then
> you can only compare code running on the same kind of machine,
> configured the same way. Or maybe just running on the same
> machine:-). But then, the generality of the results is questionable.
>
> - anton
Back to original question of the cost of misalignment.
I modified original code to force alignment in the inner loop:
#include <stdint.h>
#include <string.h>
int foo_tst(const uint32_t* keylocks, int len, int li)
{
if (li <= 0 || len <= li)
return 0;
int lix = (li + 31) & -32;
_Alignas(32) uint32_t tmp[lix];
memcpy(tmp, keylocks, li*sizeof(*keylocks));
if (lix > li)
memset(&tmp[li], 0, (lix-li)*sizeof(*keylocks));
int res = 0;
for (int i = li; i < len; ++i) {
uint32_t lock = keylocks[i];
for (int k = 0; k < lix; ++k)
res += (lock & tmp[k])==0;
}
return res - (lix-li)*(len-li);
}
Compiled with 'clang -O3 -march=haswell'
On the same Haswell Xeon it runs at 2.841 usec/call, i.e. almost
twice faster than original and only 1.3x slower than horizontally
unrolled variants.
So, at least on Haswell, unaligned AVX256 loads are slow.