Article <f5e5bf81ac2c7e2066d2a181c5a70baf@www.novabbs.org>

Deutsch English Français Italiano
<f5e5bf81ac2c7e2066d2a181c5a70baf@www.novabbs.org>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!news.misty.com!weretis.net!feeder9.news.weretis.net!news.nk.ca!rocksolid2!i2pn2.org!.POSTED!not-for-mail
From: mitchalsup@aol.com (MitchAlsup1)
Newsgroups: comp.arch
Subject: Re: auto predicating branches
Date: Tue, 22 Apr 2025 22:32:40 +0000
Organization: Rocksolid Light
Message-ID: <f5e5bf81ac2c7e2066d2a181c5a70baf@www.novabbs.org>
References: <vbgdms$152jq$1@dont-email.me> <vtq6vh$39sli$1@dont-email.me> <Is7MP.2098019$OrR5.521315@fx18.iad> <4f65d9544ad95edc8b07c869f9921a35@www.novabbs.org> <vtsbga$1tu26$1@dont-email.me> <b8859e8d6b909a4505c0f487a6a0fe35@www.novabbs.org> <vu2542$38qev$1@dont-email.me> <vu46su$1170i$1@dont-email.me> <2025Apr21.080532@mips.complang.tuwien.ac.at> <d47cdad26528b4d2309ac9df60120315@www.novabbs.org> <2025Apr22.071010@mips.complang.tuwien.ac.at> <DwONP.2213540$eNx6.1757109@fx14.iad> <2025Apr22.193103@mips.complang.tuwien.ac.at>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: i2pn2.org;
	logging-data="1462855"; mail-complaints-to="usenet@i2pn2.org";
	posting-account="o5SwNDfMfYu6Mv4wwLiW6e/jbA93UAdzFodw5PEa6eU";
User-Agent: Rocksolid Light
X-Rslight-Posting-User: cb29269328a20fe5719ed6a1c397e21f651bda71
X-Rslight-Site: $2y$10$e8Ai8NZcXcZs3Hm7Txum9OUqqARRUDWJM/YGFqB0l1IYWzuZGqXCG
X-Spam-Checker-Version: SpamAssassin 4.0.0
Bytes: 6252
Lines: 118

On Tue, 22 Apr 2025 17:31:03 +0000, Anton Ertl wrote:

> EricP <ThatWouldBeTelling@thevillage.com> writes:
>>Anton Ertl wrote:
>>> Compilers certainly can perform if-conversion, and there have been
>>> papers about that for at least 30 years, but compilers do not know
>>> when if-conversion is profitable.  E.g., "branchless" (if-converted)
>>> binary search can be faster than a branching one when the searched
>>> array is cached at a close-enough level, but slower when most of the
>>> accesses miss the close caches
>>> <https://stackoverflow.com/questions/11360831/about-the-branchless-binary-search>.
>>
>>I'm missing something because the first answer by BeeOnRope looks wrong.
>>Unfortunately they don't show both pieces of code and the asm,
>>but a branching binary search to me looks just as load data dependent as
>> it
>>recalculates middle each iteration and the array index is array[middle].
>
> Here's the branchless version:
>
> while (n > 1) {
>   int middle = n/2;
>   base += (needle < *base[middle]) ? 0 : middle;
>   n -= middle;
> }

loop:
       SRA    Rn,Rn,#1           // Rn in
       LDD    Rl,[Rb,Rm<<3]      // Rb in
       CMP    R7,Rn,Rl
       CMOVLT R7,#0,Rm
       ADD    Rb,Rb,-Rm          // Rb out
       ADD    Rn,Rn,-Rm          // Rn out
       BR     loop
>
> The branching version would then be:
>
> while (n > 1) {
>   int middle = n/2;
>   if (needle >= *base[middle])
>      base +=  middle;
>   n -= middle;
> }

loop:
       SRA    Rn,Rn,#1           // Rn in
       LDD    Rl,[Rb,Rm<<3]      // Rb in
       CMP    R7,Rn,Rl
       PGE    R7,T
       ADD    Rb,Rb,-Rm          // Rb conditionally out
       ADD    Rn,Rn,-Rm          // Rn out
       BR     loop

The ADD or Rn can be freely positioned in the loop, but the recurrence
remains at 2 cycles.

A 7-wide (or wider) machine will be able to insert a whole iteration
of the loop per cycle. The loop has a 2 cycle recurrence, so, the
execution window will fill up rapidly and retire at 1 loop per 2 cycles;
being recurrence-bound. This agrees with the second paragraph below.

> In the branching version we have the following loop recurrences:
>
> 1) middle = n/2; n -= middle
> 2) base +=  middle;
>
> Recurrence 1 costs a few cycles per iteration, ideally 2.  Recurrence
> 2 costs even less.  There is no load in the loop recurrences.  The
> load is used only for verifying the branch prediction.  And in
> particular, the load does not data-depend on any previous load.

A shifter with a latency of 2 would really hurt this loop.

>>If branch prediction is always correct, the branching version builds
>>an instruction queue that is a long chain of scaled indexed loads
>>where the load index is data dependent on the prior iteration load value.
>
> That's certainly not the case here.
>
> Even if we assume that the branch predictor misses half of the time
> (which is realistic), that means that the next load and the load of
> all followig correctly predicted ifs just have to wait for the load of
> the mispredicted branch plus the misprediction penalty.  Which means
> that on average two loads will happen in parallel, while the
> branchless version only performs dependent loads.  If the loads cost
> more than a branch misprediction (because of cache misses), the
> branching version is faster.

I do not see 2 LDDs being performed parallel unless the execution
width is at least 14-wide. In any event loop recurrence restricts the
overall retirement to 0.5 LDDs per cycle--it is the recurrence that
feeds the iterations (i.e., retirement). The water hose is fundamentally
restricted by the recurrence.

> One can now improve the performance by using branchless for the first
> few levels (which are cached), and adding prefetching and a mixture of
> branchless and branchy for the rest, but for understanding the
> principle the simple variants are best.

Predicating has added latency on the delivery of Rb's recurrence in that
the consumer has to be ready for {didn't happen and now you are free"
along with "here is the data you are looking for". Complicating the
instruction queue "a little".

>>[*] I want to see the asm because Intel's CMOV always executes the
>>operand operation, then tosses the result if the predicate is false.

Use a less-stupid ISA

I took a look at extracting the < or >= cases and using it as a mask
(0, middle) but this takes 1 more instruction and does nothing about
the fundamental loop limiter.

> That's the usual thing for conditional execution/predication.  But
> here 0 has no dependencies and middle has only cheap dependencies, the
> only expensive part is the load for the condition that turns into a
> data dependency in the branchless version.
>
> - anton