Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!eternal-september.org!.POSTED!not-for-mail From: Terje Mathisen Newsgroups: comp.arch Subject: Re: Cost of handling misaligned access Date: Wed, 5 Feb 2025 20:26:18 +0100 Organization: A noiseless patient Spider Lines: 80 Message-ID: References: <5lNnP.1313925$2xE6.991023@fx18.iad> <2025Feb3.075550@mips.complang.tuwien.ac.at> <2025Feb4.191631@mips.complang.tuwien.ac.at> <2025Feb5.184830@mips.complang.tuwien.ac.at> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Injection-Date: Wed, 05 Feb 2025 20:26:20 +0100 (CET) Injection-Info: dont-email.me; posting-host="c737d1c68c78a1658e9668338878ae32"; logging-data="2656267"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19daCm5KjlV15pp8TvFWhqfqCudCpQiqPwS/WfgOoq8Ow==" User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:128.0) Gecko/20100101 Firefox/128.0 SeaMonkey/2.53.20 Cancel-Lock: sha1:jDczULl0TjdAhTe5daOnJhdORe0= In-Reply-To: <2025Feb5.184830@mips.complang.tuwien.ac.at> Bytes: 4149 Anton Ertl wrote: > Terje Mathisen writes: >> for k in 0..li { >> let sum = lock & keylocks[k]; >> if sum == 0 { >> part1 += 1; >> } >> } > > Does Rust only have this roundabout way to express this sequentially? > In Forth I would express that scalarly as > > ( part1 ) li 0 do > keylocks i th @ lock and 0= - loop > > ["-" because 0= produces all-bits-set (-1) for true] > > or in C as > > for (k=0; k part1 += (lock & keylocks[k])==0; I could have written it as part1 += ((lock & keylocks[k]) == 0) as u32; I.e just like C except all casting has to be explicit, and here the boolean result of the '==' test needs to be expanded into a u32. > > which I find much easier to follow. I also expected 0..li to include > li (based on, I guess, the of .. in Pascal and its descendents), but > the net tells me that it does not (starting with 0 was the hint that > made me check my expectations). :-) It is similar to "for (k=0;k >> Telling the rust compiler to target my AVX2-capable laptop CPU (an Intel >> i7) > > I find it deplorable that even knowledgeable people use marketing > labels like "i7" which do not tell anything technical (and very little > non-technical) rather than specifying the full model number (e.g, Core > i7-1270P) or the design (e.g., Alder Lake). But in the present case > "AVX2-capable CPU" is enough information. > >> I got code that simply amazed me: The compiler unrolled the inner >> loop by 32, ANDing 4 x 8 keys by 8 copies of the current lock into 4 AVX >> registers (vpand), then comparing with a zeroed register (vpcmpeqd) >> (generating -1/0 results) before subtracting (vpsubd) those from 4 >> accumulators. > > If you have ever learned about vectorization, it's easy to see that > the inner loop can be vectorized. And obviously auto-vectorization > has worked in this case, not particularly amazing to me. I have some (30 years?) experience with auto-vectorization, usually I've been (very?) disappointed. As I wrote this was the best I have ever seen, and the resulting code actually performed extremely close to theoretical speed of light, i.e. 3 clock cycles for each 3 avx instruction. [snip] > clang is somewhat better: > > For the avx2 case, 70 lines and 250 bytes. > For the x86-64-v4 case, 111 lines and 435 byes. Rustc sits on top of the clang infrastucture, even with that 32-way unroll it was quite compact. I did not count, but your 70 lines seems to be in the ballpark. Terje -- - "almost all programming can be viewed as an exercise in caching"