Article <vo2iqq$30elm$1@dont-email.me>

Deutsch English Français Italiano
<vo2iqq$30elm$1@dont-email.me>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!eternal-september.org!.POSTED!not-for-mail
From: Terje Mathisen <terje.mathisen@tmsw.no>
Newsgroups: comp.arch
Subject: Re: Cost of handling misaligned access
Date: Thu, 6 Feb 2025 16:00:42 +0100
Organization: A noiseless patient Spider
Lines: 76
Message-ID: <vo2iqq$30elm$1@dont-email.me>
References: <5lNnP.1313925$2xE6.991023@fx18.iad> <vnosj6$t5o0$1@dont-email.me>
 <2025Feb3.075550@mips.complang.tuwien.ac.at>
 <wi7oP.2208275$FOb4.591154@fx15.iad>
 <2025Feb4.191631@mips.complang.tuwien.ac.at> <vo061a$2fiql$1@dont-email.me>
 <20250206000143.00000dd9@yahoo.com>
 <2025Feb6.115939@mips.complang.tuwien.ac.at>
 <20250206152808.0000058f@yahoo.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Thu, 06 Feb 2025 16:00:42 +0100 (CET)
Injection-Info: dont-email.me; posting-host="ff60327de3885ffac12ab9243e481f58";
	logging-data="3160758"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX18AZ0g3x6BS6ncqoshC2B154qAVmCAG2U5zhlNGZ7NdLQ=="
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:128.0) Gecko/20100101
 Firefox/128.0 SeaMonkey/2.53.20
Cancel-Lock: sha1:/PdeAZSFOIST6E6oepeM6AaNipI=
In-Reply-To: <20250206152808.0000058f@yahoo.com>
Bytes: 4743

Michael S wrote:
> On Thu, 06 Feb 2025 10:59:39 GMT
> anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
> 
>> Michael S <already5chosen@yahoo.com> writes:
>>>> This resulted in just 12 instructions to handle 32 tests.
>>>>    
>>>
>>> That sounds suboptimal.
>>> By unrolling outer loop by 2 or 3 you can greatly reduce the number
>>> of memory accesses per comparison.
>>
>> Looking at the inner loop code shown in
>> <2025Feb6.113049@mips.complang.tuwien.ac.at>, the 12 instructions do
>> not include the loop overhead and are already unrolled by a factor of
>> 4 (32 for the scalar code).  The loop overhead is 3 instructions, for
>> a total of 15 instructions per iteration.
>>
>>> The speed up would depend on specific
>>> microarchiture, but I would guess that at least 1.2x speedup is
>>> here.
>>
>> Even if you completely eliminate the loop overhead, the number of
>> instructions is reduced by at most a factor 1.25, and I expect that
>> the speedup from further unrolling is a factor of at most 1 on most
>> CPUs (factor <1 can come from handling the remaining elements slowly,
>> which does not seem unlikely for code coming out of gcc and clang).
>>
>> - anton
> 
> The point of my proposal is not reduction of loop overhead and not
> reduction of the # of x86 instructions (in fact, with my proposal the #
> of x86 instructions is increased), but reduction in # of uOps due to
> reuse of loaded values.
> The theory behind it is that most typically in code with very high
> IPC like the one above the main bottleneck is the # of uOps that flows
> through rename stage.

Aha! I see what you mean: Yes, this would be better if the

   VPAND reg,reg,[mem]

instructions actually took more than one cycle each, but as the size of 
the arrays were just 1000 bytes each (250 keys + 250 locks), everything 
fits easily in $L1. (BTW, I did try to add 6 dummy keys and locks just 
to avoid any loop end overhead, but that actually ran slower.)
> 
> Not counting loop overhead, an original 1x4 inner loop consists of 12
> instructions, 16 uops. Suppose, we replace it by 2x2 inner loop that
> does the same amount of work. New inner loop contains only RISC-like
> instructions - 14 instructions, 14 uOps.
> With 3x2 inner loop there are 20 instruction, 20 uOps and 1.5x more
> work done per iteration.
> 
> Another factor that can contribute to a speedup is increased number
> of iterations in the inner loop - from 1..7 iterations in original to
> 1..15 in both of my above mentioned variants.
> 
> 
> Yet another possibility is to follow "work harder not smarter"
> principle, i.e. process the whole square rather than just a relevant
> triangle. The main gain is that loop detector would be able to predict
> the # of iterations in the inner loop, avoiding mispredicted branch at
> the end. If we follow this pass then it probably makes sense to
> not unroll an inner loop beyond SIMD factor of 8 and instead unroll an
> outer loop by 4.
> Going by intuition, in this particular application "smarter" wins
> over "harder", but we know that intuition sucks. Including mine :(
> 
Yeah, it was the actual measured performance which amazed me. :-)

Terje

-- 
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"