Path: ...!fu-berlin.de!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!eternal-september.org!.POSTED!not-for-mail From: Michael S Newsgroups: comp.arch Subject: Re: Cost of handling misaligned access Date: Sun, 9 Feb 2025 02:57:45 +0200 Organization: A noiseless patient Spider Lines: 161 Message-ID: <20250209025745.00003df4@yahoo.com> References: <5lNnP.1313925$2xE6.991023@fx18.iad> <2025Feb6.115939@mips.complang.tuwien.ac.at> <20250206152808.0000058f@yahoo.com> <20250206211932.00001022@yahoo.com> <20250206233200.00001fc3@yahoo.com> <20250207124138.00006c8d@yahoo.com> <20250207170423.000023b7@yahoo.com> <2025Feb8.091104@mips.complang.tuwien.ac.at> <20250208192119.0000148e@yahoo.com> <2025Feb8.184632@mips.complang.tuwien.ac.at> MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII Content-Transfer-Encoding: 7bit Injection-Date: Sun, 09 Feb 2025 01:57:46 +0100 (CET) Injection-Info: dont-email.me; posting-host="1477f8ca78b756d97f2690a5b818aff1"; logging-data="317284"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18rvWbja8MIrlq1W97kZJvF3BpjOcAlCOs=" Cancel-Lock: sha1:QZqsn3SaUKZ6zAGsizHoYWUtGBk= X-Newsreader: Claws Mail 4.1.1 (GTK 3.24.34; x86_64-w64-mingw32) Bytes: 7747 On Sat, 08 Feb 2025 17:46:32 GMT anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote: > Michael S writes: > >On Sat, 08 Feb 2025 08:11:04 GMT > >anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote: > >Or by my own pasting mistake. I am still not sure whom to blame. > >The mistake was tiny - absence of // at the begining of one line, but > >enough to not compile. Trying it for a second time: > > Now it's worse, it's quoted-printable. E.g.: > > > if (li >=3D len || li <=3D 0) > > Some newsreaders can decode this, mine does not. > > >> First cycles (which eliminates worries about turbo modes) and > >> instructions, then usec/call. > >>=20 > > > >I don't understand that. > >For original code optimized by clang I'd expect 22,000 cycles and > >5.15 usec per call on Haswell. You numbers don't even resamble > >anything like that. > > My cycle numbers are for the whole program that calls keylocks() > 100_000 times. > > If you divide the cycles by 100000, you get 21954 for clang > keylocks1-256, which is what you expect. > > >> instructions > >> 5_779_542_242 gcc avx2 1 =20 > >> 3_484_942_148 gcc avx2 2 8=20 > >> 5_885_742_164 gcc avx2 3 8=20 > >> 7_903_138_230 clang avx2 1 =20 > >> 7_743_938_183 clang avx2 2 8? > >> 3_625_338_104 clang avx2 3 8?=20 > >> 4_204_442_194 gcc 512 1 =20 > >> 2_564_142_161 gcc 512 2 32 > >> 3_061_042_178 gcc 512 3 16 > >> 7_703_938_205 clang 512 1 =20 > >> 3_402_238_102 clang 512 2 16? > >> 3_320_455_741 clang 512 3 16? > >>=20 > > > >I don't understand these numbers either. For original clang, I'd > >expect 25,000 instructions per call. > > clang keylocks1-256 performs 79031 instructions per call (divide the > number given by 100000 calls). If you want to see why that is, you > need to analyse the code produced by clang, which I did only for > select cases. > > >Indeed. 2.08 on 4.4 GHz is only 5% slower than my 2.18 on 4.0 GHz. > >Which could be due to differences in measurements methodology - I > >reported median of 11 runs, you seems to report average. > > I just report one run with 100_000 calls, and just hope that the > variation is small:-) In my last refereed paper I use 30 runs and > median, but I don't go to these lengths here; the cycles seem pretty > repeatable. > > >> On the Golden Cove of a Core i3-1315U (compared to the best result > >> by Terje Mathisen on a Core i7-1365U; the latter can run up to > >> 5.2GHz according to Intel, whereas the former can supposedly run > >> up to 4.5GHz; I only ever measured at most 3.8GHz on our NUC, and > >> this time as well): > >>=20 > > > >I always thought that NUCs have better cooling than all, but high-end > >laptops. Was I wrong? Such slowness is disappointing. > > The cooling may be better or not, that does not come into play here, > as it never reaches higher clocks, even when it's cold; E-cores also > stay 700MHz below their rated turbo speed, even when it's the only > loaded core. One theory I have is that one option we set up in the > BIOS has the effect of limiting turbo speed, but it has not been > important enough to test. > > >> 5.25us Terje Mathisen's Rust code compiled by clang (best on the > >> 1365U) 4.93us clang keylocks1-256 on a 3.8GHz 1315U > >> 4.17us gcc keylocks1-256 on a 3.8GHz 1315U > >> 3.16us gcc keylocks2-256 on a 3.8GHz 1315U > >> 2.38us clang keylocks2-512 on a 3.8GHz 1315U > >>=20 > > > >So, for the best-performing variant IPC of Goldeen Cove is identical > >to ancient Haswell? > > Actually worse: > > For clang keylocks2-512 Haswell has 3.73 IPC, Golden Cove 3.63. > > >That's very disappointing. Haswell has 4-wide front > >end and majority of AVX2 integer instruction is limited to throughput > >of two per clock. Golden Cove has 5+ wide front end and nearly all > >AVX2 integer instruction have throughput of three per clock. > >Could it be that clang introduced some sort of latency bottleneck? > > As far as I looked into the code, I did not see such a bottleneck. > Also, Zen4 has significantly higher IPC on this variant (5.36 IPC for > clang keylocks2-256), and I expect that it would suffer from a general > latency bottleneck, too. Rocket Lake is also faster on this program > than Haswell and Golden Cove. It seems to be just that this program > rubs Golden Cove the wrong way. > > >> I would have expected the clang keylocks1-256 to run slower, > >> because the compiler back-end is the same and the 1315U is slower. > >> Measuring cycles looks more relevant for this benchmark to me > >> than measuring time, especially on this core where AVX-512 is > >> disabled and there is no AVX slowdown. > >>=20 > > > >I prefer time, because at the end it's the only thing that matter. > > True, and certainly, when stuff like AVX-512 license-based > downclocking or thermal or power limits come into play (and are > relevant for the measurement at hand), one has to go there. But then > you can only compare code running on the same kind of machine, > configured the same way. Or maybe just running on the same > machine:-). But then, the generality of the results is questionable. > > - anton Back to original question of the cost of misalignment. I modified original code to force alignment in the inner loop: #include #include int foo_tst(const uint32_t* keylocks, int len, int li) { if (li <= 0 || len <= li) return 0; int lix = (li + 31) & -32; _Alignas(32) uint32_t tmp[lix]; memcpy(tmp, keylocks, li*sizeof(*keylocks)); if (lix > li) memset(&tmp[li], 0, (lix-li)*sizeof(*keylocks)); int res = 0; for (int i = li; i < len; ++i) { uint32_t lock = keylocks[i]; for (int k = 0; k < lix; ++k) res += (lock & tmp[k])==0; } return res - (lix-li)*(len-li); } Compiled with 'clang -O3 -march=haswell' On the same Haswell Xeon it runs at 2.841 usec/call, i.e. almost twice faster than original and only 1.3x slower than horizontally unrolled variants. So, at least on Haswell, unaligned AVX256 loads are slow.