Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!eternal-september.org!.POSTED!not-for-mail
From: Michael S <already5chosen@yahoo.com>
Newsgroups: comp.arch
Subject: Re: Cost of handling misaligned access
Date: Thu, 6 Feb 2025 00:01:43 +0200
Organization: A noiseless patient Spider
Lines: 61
Message-ID: <20250206000143.00000dd9@yahoo.com>
References: <5lNnP.1313925$2xE6.991023@fx18.iad>
	<vnosj6$t5o0$1@dont-email.me>
	<2025Feb3.075550@mips.complang.tuwien.ac.at>
	<wi7oP.2208275$FOb4.591154@fx15.iad>
	<2025Feb4.191631@mips.complang.tuwien.ac.at>
	<vo061a$2fiql$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Injection-Date: Wed, 05 Feb 2025 23:01:47 +0100 (CET)
Injection-Info: dont-email.me; posting-host="69e09d383b4dc14cef98faf4a6fa8e2a";
	logging-data="2702446"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX1/tcOid2o8xmijR1916Ku5R7e9HT6xo7Co="
Cancel-Lock: sha1:A/uLEjNMsYUWak+PrsHqgo1NflQ=
X-Newsreader: Claws Mail 4.1.1 (GTK 3.24.34; x86_64-w64-mingw32)
Bytes: 3278

On Wed, 5 Feb 2025 18:10:03 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:

> Anton Ertl wrote:
> > EricP <ThatWouldBeTelling@thevillage.com> writes:
> >   
> >> As SIMD no longer requires alignment, presumably code no longer
> >> does so.  
> > 
> > Yes, if you use AVX/AVX2, you don't encounter this particular Intel
> > stupidity.  
> 
> Recently, on the last day (Dec 25th) of Advent of Code, I had a
> problem which lent itself to using 32-bit bitmaps: The task was to
> check which locks were compatible with which keys, so I ended up with
> code like this:
> 
> 
>      let mut part1 = 0;
>      for l in li..keylocks.len() {
>          let lock = keylocks[l];
>          for k in 0..li {
>              let sum = lock & keylocks[k];
>              if sum == 0 {
>                  part1 += 1;
>              }
>          }
>      }
> 
> Telling the rust compiler to target my AVX2-capable laptop CPU (an
> Intel i7), I got code that simply amazed me: The compiler unrolled
> the inner loop by 32, ANDing 4 x 8 keys by 8 copies of the current
> lock into 4 AVX registers (vpand), then comparing with a zeroed
> register (vpcmpeqd) (generating -1/0 results) before subtracting
> (vpsubd) those from 4 accumulators.
> 
> This resulted in just 12 instructions to handle 32 tests.
> 

That sounds suboptimal.
By unrolling outer loop by 2 or 3 you can greatly reduce the number of 
memory accesses per comparison. The speed up would depend on specific
microarchiture, but I would guess that at least 1.2x speedup is here.
Especially so when data is not aligned.

> The final code, with zero unsafe/asm/intrinsics, took 5.8
> microseconds to run all the needed parsing/setup/initialization and
> then test 62500 combinations, so just 93 ps per key/lock test!
> 
> There was no attempt to check for 32-byte algnment, it all just
> worked. :-)
> 
> The task is of course embarrassingly parallelizable, but I suspect
> the overhead of starting 4 or 8 threads will be higher than what I
> would save? I guess I'll have to test!
> 
> Terje
> 
>