Path: ...!news.misty.com!weretis.net!feeder9.news.weretis.net!news.nk.ca!rocksolid2!i2pn2.org!.POSTED!not-for-mail From: mitchalsup@aol.com (MitchAlsup1) Newsgroups: comp.arch Subject: Re: auto predicating branches Date: Tue, 22 Apr 2025 22:32:40 +0000 Organization: Rocksolid Light Message-ID: References: <4f65d9544ad95edc8b07c869f9921a35@www.novabbs.org> <2025Apr21.080532@mips.complang.tuwien.ac.at> <2025Apr22.071010@mips.complang.tuwien.ac.at> <2025Apr22.193103@mips.complang.tuwien.ac.at> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Injection-Info: i2pn2.org; logging-data="1462855"; mail-complaints-to="usenet@i2pn2.org"; posting-account="o5SwNDfMfYu6Mv4wwLiW6e/jbA93UAdzFodw5PEa6eU"; User-Agent: Rocksolid Light X-Rslight-Posting-User: cb29269328a20fe5719ed6a1c397e21f651bda71 X-Rslight-Site: $2y$10$e8Ai8NZcXcZs3Hm7Txum9OUqqARRUDWJM/YGFqB0l1IYWzuZGqXCG X-Spam-Checker-Version: SpamAssassin 4.0.0 Bytes: 6252 Lines: 118 On Tue, 22 Apr 2025 17:31:03 +0000, Anton Ertl wrote: > EricP writes: >>Anton Ertl wrote: >>> Compilers certainly can perform if-conversion, and there have been >>> papers about that for at least 30 years, but compilers do not know >>> when if-conversion is profitable. E.g., "branchless" (if-converted) >>> binary search can be faster than a branching one when the searched >>> array is cached at a close-enough level, but slower when most of the >>> accesses miss the close caches >>> . >> >>I'm missing something because the first answer by BeeOnRope looks wrong. >>Unfortunately they don't show both pieces of code and the asm, >>but a branching binary search to me looks just as load data dependent as >> it >>recalculates middle each iteration and the array index is array[middle]. > > Here's the branchless version: > > while (n > 1) { > int middle = n/2; > base += (needle < *base[middle]) ? 0 : middle; > n -= middle; > } loop: SRA Rn,Rn,#1 // Rn in LDD Rl,[Rb,Rm<<3] // Rb in CMP R7,Rn,Rl CMOVLT R7,#0,Rm ADD Rb,Rb,-Rm // Rb out ADD Rn,Rn,-Rm // Rn out BR loop > > The branching version would then be: > > while (n > 1) { > int middle = n/2; > if (needle >= *base[middle]) > base += middle; > n -= middle; > } loop: SRA Rn,Rn,#1 // Rn in LDD Rl,[Rb,Rm<<3] // Rb in CMP R7,Rn,Rl PGE R7,T ADD Rb,Rb,-Rm // Rb conditionally out ADD Rn,Rn,-Rm // Rn out BR loop The ADD or Rn can be freely positioned in the loop, but the recurrence remains at 2 cycles. A 7-wide (or wider) machine will be able to insert a whole iteration of the loop per cycle. The loop has a 2 cycle recurrence, so, the execution window will fill up rapidly and retire at 1 loop per 2 cycles; being recurrence-bound. This agrees with the second paragraph below. > In the branching version we have the following loop recurrences: > > 1) middle = n/2; n -= middle > 2) base += middle; > > Recurrence 1 costs a few cycles per iteration, ideally 2. Recurrence > 2 costs even less. There is no load in the loop recurrences. The > load is used only for verifying the branch prediction. And in > particular, the load does not data-depend on any previous load. A shifter with a latency of 2 would really hurt this loop. >>If branch prediction is always correct, the branching version builds >>an instruction queue that is a long chain of scaled indexed loads >>where the load index is data dependent on the prior iteration load value. > > That's certainly not the case here. > > Even if we assume that the branch predictor misses half of the time > (which is realistic), that means that the next load and the load of > all followig correctly predicted ifs just have to wait for the load of > the mispredicted branch plus the misprediction penalty. Which means > that on average two loads will happen in parallel, while the > branchless version only performs dependent loads. If the loads cost > more than a branch misprediction (because of cache misses), the > branching version is faster. I do not see 2 LDDs being performed parallel unless the execution width is at least 14-wide. In any event loop recurrence restricts the overall retirement to 0.5 LDDs per cycle--it is the recurrence that feeds the iterations (i.e., retirement). The water hose is fundamentally restricted by the recurrence. > One can now improve the performance by using branchless for the first > few levels (which are cached), and adding prefetching and a mixture of > branchless and branchy for the rest, but for understanding the > principle the simple variants are best. Predicating has added latency on the delivery of Rb's recurrence in that the consumer has to be ready for {didn't happen and now you are free" along with "here is the data you are looking for". Complicating the instruction queue "a little". >>[*] I want to see the asm because Intel's CMOV always executes the >>operand operation, then tosses the result if the predicate is false. Use a less-stupid ISA I took a look at extracting the < or >= cases and using it as a mask (0, middle) but this takes 1 more instruction and does nothing about the fundamental loop limiter. > That's the usual thing for conditional execution/predication. But > here 0 has no dependencies and middle has only cheap dependencies, the > only expensive part is the load for the condition that turns into a > data dependency in the branchless version. > > - anton