Path: ...!fu-berlin.de!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!eternal-september.org!.POSTED!not-for-mail
From: Michael S <already5chosen@yahoo.com>
Newsgroups: comp.arch
Subject: Re: Cost of handling misaligned access
Date: Sun, 9 Feb 2025 02:57:45 +0200
Organization: A noiseless patient Spider
Lines: 161
Message-ID: <20250209025745.00003df4@yahoo.com>
References: <5lNnP.1313925$2xE6.991023@fx18.iad>
	<2025Feb6.115939@mips.complang.tuwien.ac.at>
	<20250206152808.0000058f@yahoo.com>
	<vo2iqq$30elm$1@dont-email.me>
	<vo2p33$31lqn$1@dont-email.me>
	<20250206211932.00001022@yahoo.com>
	<vo36go$345o3$1@dont-email.me>
	<20250206233200.00001fc3@yahoo.com>
	<vo4lvl$3eu3c$1@dont-email.me>
	<20250207124138.00006c8d@yahoo.com>
	<vo551p$3hhbc$1@dont-email.me>
	<20250207170423.000023b7@yahoo.com>
	<2025Feb8.091104@mips.complang.tuwien.ac.at>
	<20250208192119.0000148e@yahoo.com>
	<2025Feb8.184632@mips.complang.tuwien.ac.at>
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Injection-Date: Sun, 09 Feb 2025 01:57:46 +0100 (CET)
Injection-Info: dont-email.me; posting-host="1477f8ca78b756d97f2690a5b818aff1";
	logging-data="317284"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX18rvWbja8MIrlq1W97kZJvF3BpjOcAlCOs="
Cancel-Lock: sha1:QZqsn3SaUKZ6zAGsizHoYWUtGBk=
X-Newsreader: Claws Mail 4.1.1 (GTK 3.24.34; x86_64-w64-mingw32)
Bytes: 7747

On Sat, 08 Feb 2025 17:46:32 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

> Michael S <already5chosen@yahoo.com> writes:
> >On Sat, 08 Feb 2025 08:11:04 GMT
> >anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
> >Or by my own pasting mistake. I am still not sure whom to blame.
> >The mistake was tiny - absence of // at the begining of one line, but
> >enough to not compile. Trying it for a second time:  
> 
> Now it's worse, it's quoted-printable.  E.g.:
> 
> >  if (li >=3D len || li <=3D 0)  
> 
> Some newsreaders can decode this, mine does not.
> 
> >> First cycles (which eliminates worries about turbo modes) and
> >> instructions, then usec/call.
> >>=20  
> >
> >I don't understand that.
> >For original code optimized by clang I'd expect 22,000 cycles and
> >5.15 usec per call on Haswell. You numbers don't even resamble
> >anything like that.  
> 
> My cycle numbers are for the whole program that calls keylocks()
> 100_000 times.
> 
> If you divide the cycles by 100000, you get 21954 for clang
> keylocks1-256, which is what you expect.
> 
> >> instructions
> >> 5_779_542_242  gcc   avx2 1  =20
> >> 3_484_942_148  gcc   avx2 2 8=20
> >> 5_885_742_164  gcc   avx2 3 8=20
> >> 7_903_138_230  clang avx2 1  =20
> >> 7_743_938_183  clang avx2 2 8?
> >> 3_625_338_104  clang avx2 3 8?=20
> >> 4_204_442_194  gcc   512  1  =20
> >> 2_564_142_161  gcc   512  2 32
> >> 3_061_042_178  gcc   512  3 16
> >> 7_703_938_205  clang 512  1  =20
> >> 3_402_238_102  clang 512  2 16?
> >> 3_320_455_741  clang 512  3 16?
> >>=20  
> >
> >I don't understand these numbers either. For original clang, I'd
> >expect 25,000 instructions per call.  
> 
> clang keylocks1-256 performs 79031 instructions per call (divide the
> number given by 100000 calls).  If you want to see why that is, you
> need to analyse the code produced by clang, which I did only for
> select cases.
> 
> >Indeed. 2.08 on 4.4 GHz is only 5% slower than my 2.18 on 4.0 GHz.
> >Which could be due to differences in measurements methodology - I
> >reported median of 11 runs, you seems to report average.  
> 
> I just report one run with 100_000 calls, and just hope that the
> variation is small:-) In my last refereed paper I use 30 runs and
> median, but I don't go to these lengths here; the cycles seem pretty
> repeatable.
> 
> >> On the Golden Cove of a Core i3-1315U (compared to the best result
> >> by Terje Mathisen on a Core i7-1365U; the latter can run up to
> >> 5.2GHz according to Intel, whereas the former can supposedly run
> >> up to 4.5GHz; I only ever measured at most 3.8GHz on our NUC, and
> >> this time as well):
> >>=20  
> >
> >I always thought that NUCs have better cooling than all, but high-end
> >laptops. Was I wrong? Such slowness is disappointing.  
> 
> The cooling may be better or not, that does not come into play here,
> as it never reaches higher clocks, even when it's cold; E-cores also
> stay 700MHz below their rated turbo speed, even when it's the only
> loaded core.  One theory I have is that one option we set up in the
> BIOS has the effect of limiting turbo speed, but it has not been
> important enough to test.
> 
> >> 5.25us Terje Mathisen's Rust code compiled by clang (best on the
> >> 1365U) 4.93us clang keylocks1-256 on a 3.8GHz 1315U
> >> 4.17us gcc keylocks1-256 on a 3.8GHz 1315U
> >> 3.16us gcc keylocks2-256 on a 3.8GHz 1315U
> >> 2.38us clang keylocks2-512 on a 3.8GHz 1315U
> >>=20  
> >
> >So, for the best-performing variant IPC of Goldeen Cove is identical
> >to ancient Haswell?  
> 
> Actually worse:
> 
> For clang keylocks2-512 Haswell has 3.73 IPC, Golden Cove 3.63.
> 
> >That's very disappointing. Haswell has 4-wide front
> >end and majority of AVX2 integer instruction is limited to throughput
> >of two per clock. Golden Cove has 5+ wide front end and nearly all
> >AVX2 integer instruction have throughput of three per clock.
> >Could it be that clang introduced some sort of latency bottleneck?  
> 
> As far as I looked into the code, I did not see such a bottleneck.
> Also, Zen4 has significantly higher IPC on this variant (5.36 IPC for
> clang keylocks2-256), and I expect that it would suffer from a general
> latency bottleneck, too.  Rocket Lake is also faster on this program
> than Haswell and Golden Cove.  It seems to be just that this program
> rubs Golden Cove the wrong way.
> 
> >> I would have expected the clang keylocks1-256 to run slower,
> >> because the compiler back-end is the same and the 1315U is slower.
> >>  Measuring cycles looks more relevant for this benchmark to me
> >> than measuring time, especially on this core where AVX-512 is
> >> disabled and there is no AVX slowdown.
> >>=20  
> >
> >I prefer time, because at the end it's the only thing that matter.  
> 
> True, and certainly, when stuff like AVX-512 license-based
> downclocking or thermal or power limits come into play (and are
> relevant for the measurement at hand), one has to go there.  But then
> you can only compare code running on the same kind of machine,
> configured the same way.  Or maybe just running on the same
> machine:-).  But then, the generality of the results is questionable.
> 
> - anton

Back to original question of the cost of misalignment.
I modified original code to force alignment in the inner loop:

#include <stdint.h>
#include <string.h>

int foo_tst(const uint32_t* keylocks, int len, int li)
{
  if (li <= 0 || len <= li)
    return 0;
  
  int lix = (li + 31) & -32;
  _Alignas(32) uint32_t tmp[lix];
  memcpy(tmp, keylocks, li*sizeof(*keylocks));
  if (lix > li)
    memset(&tmp[li], 0, (lix-li)*sizeof(*keylocks));
  
  int res = 0;
  for (int i = li; i < len; ++i) {
    uint32_t lock = keylocks[i];
    for (int k = 0; k < lix; ++k)
      res += (lock & tmp[k])==0;
  }
  return res - (lix-li)*(len-li);
}

Compiled with 'clang -O3 -march=haswell'
On the same Haswell Xeon it runs at 2.841 usec/call, i.e. almost
twice faster than original and only 1.3x slower than horizontally
unrolled variants.

So, at least on Haswell, unaligned AVX256 loads are slow.