| Deutsch English Français Italiano |
|
<f5e5bf81ac2c7e2066d2a181c5a70baf@www.novabbs.org> View for Bookmarking (what is this?) Look up another Usenet article |
Path: ...!news.misty.com!weretis.net!feeder9.news.weretis.net!news.nk.ca!rocksolid2!i2pn2.org!.POSTED!not-for-mail
From: mitchalsup@aol.com (MitchAlsup1)
Newsgroups: comp.arch
Subject: Re: auto predicating branches
Date: Tue, 22 Apr 2025 22:32:40 +0000
Organization: Rocksolid Light
Message-ID: <f5e5bf81ac2c7e2066d2a181c5a70baf@www.novabbs.org>
References: <vbgdms$152jq$1@dont-email.me> <vtq6vh$39sli$1@dont-email.me> <Is7MP.2098019$OrR5.521315@fx18.iad> <4f65d9544ad95edc8b07c869f9921a35@www.novabbs.org> <vtsbga$1tu26$1@dont-email.me> <b8859e8d6b909a4505c0f487a6a0fe35@www.novabbs.org> <vu2542$38qev$1@dont-email.me> <vu46su$1170i$1@dont-email.me> <2025Apr21.080532@mips.complang.tuwien.ac.at> <d47cdad26528b4d2309ac9df60120315@www.novabbs.org> <2025Apr22.071010@mips.complang.tuwien.ac.at> <DwONP.2213540$eNx6.1757109@fx14.iad> <2025Apr22.193103@mips.complang.tuwien.ac.at>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: i2pn2.org;
logging-data="1462855"; mail-complaints-to="usenet@i2pn2.org";
posting-account="o5SwNDfMfYu6Mv4wwLiW6e/jbA93UAdzFodw5PEa6eU";
User-Agent: Rocksolid Light
X-Rslight-Posting-User: cb29269328a20fe5719ed6a1c397e21f651bda71
X-Rslight-Site: $2y$10$e8Ai8NZcXcZs3Hm7Txum9OUqqARRUDWJM/YGFqB0l1IYWzuZGqXCG
X-Spam-Checker-Version: SpamAssassin 4.0.0
Bytes: 6252
Lines: 118
On Tue, 22 Apr 2025 17:31:03 +0000, Anton Ertl wrote:
> EricP <ThatWouldBeTelling@thevillage.com> writes:
>>Anton Ertl wrote:
>>> Compilers certainly can perform if-conversion, and there have been
>>> papers about that for at least 30 years, but compilers do not know
>>> when if-conversion is profitable. E.g., "branchless" (if-converted)
>>> binary search can be faster than a branching one when the searched
>>> array is cached at a close-enough level, but slower when most of the
>>> accesses miss the close caches
>>> <https://stackoverflow.com/questions/11360831/about-the-branchless-binary-search>.
>>
>>I'm missing something because the first answer by BeeOnRope looks wrong.
>>Unfortunately they don't show both pieces of code and the asm,
>>but a branching binary search to me looks just as load data dependent as
>> it
>>recalculates middle each iteration and the array index is array[middle].
>
> Here's the branchless version:
>
> while (n > 1) {
> int middle = n/2;
> base += (needle < *base[middle]) ? 0 : middle;
> n -= middle;
> }
loop:
SRA Rn,Rn,#1 // Rn in
LDD Rl,[Rb,Rm<<3] // Rb in
CMP R7,Rn,Rl
CMOVLT R7,#0,Rm
ADD Rb,Rb,-Rm // Rb out
ADD Rn,Rn,-Rm // Rn out
BR loop
>
> The branching version would then be:
>
> while (n > 1) {
> int middle = n/2;
> if (needle >= *base[middle])
> base += middle;
> n -= middle;
> }
loop:
SRA Rn,Rn,#1 // Rn in
LDD Rl,[Rb,Rm<<3] // Rb in
CMP R7,Rn,Rl
PGE R7,T
ADD Rb,Rb,-Rm // Rb conditionally out
ADD Rn,Rn,-Rm // Rn out
BR loop
The ADD or Rn can be freely positioned in the loop, but the recurrence
remains at 2 cycles.
A 7-wide (or wider) machine will be able to insert a whole iteration
of the loop per cycle. The loop has a 2 cycle recurrence, so, the
execution window will fill up rapidly and retire at 1 loop per 2 cycles;
being recurrence-bound. This agrees with the second paragraph below.
> In the branching version we have the following loop recurrences:
>
> 1) middle = n/2; n -= middle
> 2) base += middle;
>
> Recurrence 1 costs a few cycles per iteration, ideally 2. Recurrence
> 2 costs even less. There is no load in the loop recurrences. The
> load is used only for verifying the branch prediction. And in
> particular, the load does not data-depend on any previous load.
A shifter with a latency of 2 would really hurt this loop.
>>If branch prediction is always correct, the branching version builds
>>an instruction queue that is a long chain of scaled indexed loads
>>where the load index is data dependent on the prior iteration load value.
>
> That's certainly not the case here.
>
> Even if we assume that the branch predictor misses half of the time
> (which is realistic), that means that the next load and the load of
> all followig correctly predicted ifs just have to wait for the load of
> the mispredicted branch plus the misprediction penalty. Which means
> that on average two loads will happen in parallel, while the
> branchless version only performs dependent loads. If the loads cost
> more than a branch misprediction (because of cache misses), the
> branching version is faster.
I do not see 2 LDDs being performed parallel unless the execution
width is at least 14-wide. In any event loop recurrence restricts the
overall retirement to 0.5 LDDs per cycle--it is the recurrence that
feeds the iterations (i.e., retirement). The water hose is fundamentally
restricted by the recurrence.
> One can now improve the performance by using branchless for the first
> few levels (which are cached), and adding prefetching and a mixture of
> branchless and branchy for the rest, but for understanding the
> principle the simple variants are best.
Predicating has added latency on the delivery of Rb's recurrence in that
the consumer has to be ready for {didn't happen and now you are free"
along with "here is the data you are looking for". Complicating the
instruction queue "a little".
>>[*] I want to see the asm because Intel's CMOV always executes the
>>operand operation, then tosses the result if the predicate is false.
Use a less-stupid ISA
I took a look at extracting the < or >= cases and using it as a mask
(0, middle) but this takes 1 more instruction and does nothing about
the fundamental loop limiter.
> That's the usual thing for conditional execution/predication. But
> here 0 has no dependencies and middle has only cheap dependencies, the
> only expensive part is the load for the condition that turns into a
> data dependency in the branchless version.
>
> - anton