Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!eternal-september.org!.POSTED!not-for-mail
From: Terje Mathisen <terje.mathisen@tmsw.no>
Newsgroups: comp.arch
Subject: Re: Cost of handling misaligned access
Date: Wed, 5 Feb 2025 18:10:03 +0100
Organization: A noiseless patient Spider
Lines: 49
Message-ID: <vo061a$2fiql$1@dont-email.me>
References: <5lNnP.1313925$2xE6.991023@fx18.iad> <vnosj6$t5o0$1@dont-email.me>
 <2025Feb3.075550@mips.complang.tuwien.ac.at>
 <wi7oP.2208275$FOb4.591154@fx15.iad>
 <2025Feb4.191631@mips.complang.tuwien.ac.at>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Wed, 05 Feb 2025 18:10:03 +0100 (CET)
Injection-Info: dont-email.me; posting-host="c737d1c68c78a1658e9668338878ae32";
	logging-data="2607957"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX19AlwME4/vd1TPC4TVL5a+Ngaxf1/7rTAKIjerLEXHCnw=="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:128.0) Gecko/20100101
 Firefox/128.0 SeaMonkey/2.53.20
Cancel-Lock: sha1:/Lz4k9E9gNYZySQ5qR1ipJd/gEM=
In-Reply-To: <2025Feb4.191631@mips.complang.tuwien.ac.at>
Bytes: 3001

Anton Ertl wrote:
> EricP <ThatWouldBeTelling@thevillage.com> writes:
> 
>> As SIMD no longer requires alignment, presumably code no longer does so.
> 
> Yes, if you use AVX/AVX2, you don't encounter this particular Intel
> stupidity.

Recently, on the last day (Dec 25th) of Advent of Code, I had a problem 
which lent itself to using 32-bit bitmaps: The task was to check which 
locks were compatible with which keys, so I ended up with code like this:


     let mut part1 = 0;
     for l in li..keylocks.len() {
         let lock = keylocks[l];
         for k in 0..li {
             let sum = lock & keylocks[k];
             if sum == 0 {
                 part1 += 1;
             }
         }
     }

Telling the rust compiler to target my AVX2-capable laptop CPU (an Intel 
i7), I got code that simply amazed me: The compiler unrolled the inner 
loop by 32, ANDing 4 x 8 keys by 8 copies of the current lock into 4 AVX 
registers (vpand), then comparing with a zeroed register (vpcmpeqd) 
(generating -1/0 results) before subtracting (vpsubd) those from 4 
accumulators.

This resulted in just 12 instructions to handle 32 tests.

The final code, with zero unsafe/asm/intrinsics, took 5.8 microseconds 
to run all the needed parsing/setup/initialization and then test 62500 
combinations, so just 93 ps per key/lock test!

There was no attempt to check for 32-byte algnment, it all just worked. :-)

The task is of course embarrassingly parallelizable, but I suspect the 
overhead of starting 4 or 8 threads will be higher than what I would 
save? I guess I'll have to test!

Terje


-- 
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"