Path: news.eternal-september.org!eternal-september.org!.POSTED!not-for-mail
From: anton@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: auto predicating branches
Date: Tue, 22 Apr 2025 17:31:03 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 74
Message-ID: <2025Apr22.193103@mips.complang.tuwien.ac.at>
References: <vbgdms$152jq$1@dont-email.me> <vtq6vh$39sli$1@dont-email.me> <Is7MP.2098019$OrR5.521315@fx18.iad> <4f65d9544ad95edc8b07c869f9921a35@www.novabbs.org> <vtsbga$1tu26$1@dont-email.me> <b8859e8d6b909a4505c0f487a6a0fe35@www.novabbs.org> <vu2542$38qev$1@dont-email.me> <vu46su$1170i$1@dont-email.me> <2025Apr21.080532@mips.complang.tuwien.ac.at> <d47cdad26528b4d2309ac9df60120315@www.novabbs.org> <2025Apr22.071010@mips.complang.tuwien.ac.at> <DwONP.2213540$eNx6.1757109@fx14.iad>
Injection-Date: Tue, 22 Apr 2025 19:51:16 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="538ba95c38275f5ed8ab3f554e59756f";
	logging-data="1161477"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX1+2Z4jtIFzFMQyLgsqdHMDM"
Cancel-Lock: sha1:XjoGWTca9bcLhkfRQU92plZjmbE=
X-newsreader: xrn 10.11

EricP <ThatWouldBeTelling@thevillage.com> writes:
>Anton Ertl wrote:
>> Compilers certainly can perform if-conversion, and there have been
>> papers about that for at least 30 years, but compilers do not know
>> when if-conversion is profitable.  E.g., "branchless" (if-converted)
>> binary search can be faster than a branching one when the searched
>> array is cached at a close-enough level, but slower when most of the
>> accesses miss the close caches
>> <https://stackoverflow.com/questions/11360831/about-the-branchless-binary-search>.
>
>I'm missing something because the first answer by BeeOnRope looks wrong.
>Unfortunately they don't show both pieces of code and the asm,
>but a branching binary search to me looks just as load data dependent as it
>recalculates middle each iteration and the array index is array[middle].

Here's the branchless version:

while (n > 1) {
  int middle = n/2;
  base += (needle < *base[middle]) ? 0 : middle;
  n -= middle;
}

The branching version would then be:

while (n > 1) {
  int middle = n/2;
  if (needle >= *base[middle])
     base +=  middle;
  n -= middle;
}

In the branching version we have the following loop recurrences:

1) middle = n/2; n -= middle
2) base +=  middle;

Recurrence 1 costs a few cycles per iteration, ideally 2.  Recurrence
2 costs even less.  There is no load in the loop recurrences.  The
load is used only for verifying the branch prediction.  And in
particular, the load does not data-depend on any previous load.

>If branch prediction is always correct, the branching version builds
>an instruction queue that is a long chain of scaled indexed loads
>where the load index is data dependent on the prior iteration load value.

That's certainly not the case here.

Even if we assume that the branch predictor misses half of the time
(which is realistic), that means that the next load and the load of
all followig correctly predicted ifs just have to wait for the load of
the mispredicted branch plus the misprediction penalty.  Which means
that on average two loads will happen in parallel, while the
branchless version only performs dependent loads.  If the loads cost
more than a branch misprediction (because of cache misses), the
branching version is faster.

One can now improve the performance by using branchless for the first
few levels (which are cached), and adding prefetching and a mixture of
branchless and branchy for the rest, but for understanding the
principle the simple variants are best.

>[*] I want to see the asm because Intel's CMOV always executes the
>operand operation, then tosses the result if the predicate is false.

That's the usual thing for conditional execution/predication.  But
here 0 has no dependencies and middle has only cheap dependencies, the
only expensive part is the load for the condition that turns into a
data dependency in the branchless version.

- anton
-- 
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
  Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>