Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!eternal-september.org!.POSTED!not-for-mail
From: anton@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: auto predicating branches
Date: Tue, 22 Apr 2025 05:10:10 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 130
Message-ID: <2025Apr22.071010@mips.complang.tuwien.ac.at>
References: <vbgdms$152jq$1@dont-email.me> <vfvi1f$2kp4s$1@dont-email.me> <vgerdr$1v4nd$1@dont-email.me> <vtptau$3p73s$1@dont-email.me> <vtq6vh$39sli$1@dont-email.me> <Is7MP.2098019$OrR5.521315@fx18.iad> <4f65d9544ad95edc8b07c869f9921a35@www.novabbs.org> <vtsbga$1tu26$1@dont-email.me> <b8859e8d6b909a4505c0f487a6a0fe35@www.novabbs.org> <vu2542$38qev$1@dont-email.me> <vu46su$1170i$1@dont-email.me> <2025Apr21.080532@mips.complang.tuwien.ac.at> <d47cdad26528b4d2309ac9df60120315@www.novabbs.org>
Injection-Date: Tue, 22 Apr 2025 08:59:50 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="538ba95c38275f5ed8ab3f554e59756f";
	logging-data="84884"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX1/CdRvvaivr/fIUbeSKj7uH"
Cancel-Lock: sha1:eCoIDbTYqP7bLo1rFqYKyUN5WhM=
X-newsreader: xrn 10.11
Bytes: 7817

mitchalsup@aol.com (MitchAlsup1) writes:
>On Mon, 21 Apr 2025 6:05:32 +0000, Anton Ertl wrote:
>
>> Robert Finch <robfi680@gmail.com> writes:
>>>Having branches automatically convert into
>>>predicates when they branch forward a short distance <7 instructions.
>>
>> If-conversion in hardware is a good idea, if done well, because it
>> involves issues that tend to be unknown to compilers:
>
>I had little trouble teaching Brian how to put if-conversion into the
>compiler with my PRED instructions. Alleviating HW from having to bother
>other than being able to execute PREDicated clauses.

Compilers certainly can perform if-conversion, and there have been
papers about that for at least 30 years, but compilers do not know
when if-conversion is profitable.  E.g., "branchless" (if-converted)
binary search can be faster than a branching one when the searched
array is cached at a close-enough level, but slower when most of the
accesses miss the close caches
<https://stackoverflow.com/questions/11360831/about-the-branchless-binary-search>.

>> * How predictable is the condition?  If the condition is very well
>>   predictable, if-conversion is not a good idea, because it turns the
>>   control dependency (which does not cost latency when the prediction
>>   is correct) into a data dependency.  Moreover, in this case the
>>   if-conversion increases the resource consumption.  Compilers are not
>>   good at predicting the predictability AFAIK.
>
>Rather than base the choice on the predictability of the condition,
>It is based on whether FETCH will pass the join-point before the
>condition resolves. On an 8-wide machine this might be "THE next
>cycle".

On a CPU with out-of-order execution, the instruction fetcher often
runs many dozens of instructions ahead of the functional units which
produce the conditions, so your criterion could cover pretty big IFs
And, given that you want the compiler to do it, the compiler would
have to know about that.  Ok, what decision will you take in what
case, and why?

>> * Is the condition available before or after the original data
>>   dependencies?  And if afterwards, by how many cycles?  If it is
>>   afterwards and the branch prediction would be correct, the
>>   if-conversion means that the result of the instruction is available
>>   later, which may reduce IPC.
>
>Generally, it only adds latency--if the execution window is not staled
>at either end this does not harm IPC.

If the additional latency is on the critical path and execution is
dependency-limited, this reduces IPC.  And yes, this will result in
the buffers (especially the schedulers) filling up and stalling the
front end.

>>                                 OTOH, if the branch prediction would
>>   be incorrect, the recovery also depends on when the condition
>>   becomes available,
>
>There is no "recovery" from PREDication, just one clause getting
>nullified.

I apparently wrote that in a misunderstandable way.  Here's another
attempt: When comparing the branching variant to the predicated
(if-converted) variant, if the branching variant would be
mispredicted, it is always at a disadvantage wrt. latency compared to
the predicated variant, because the branching variant restarts from
the instruction fetch when the condition becomes available, while the
predicated variant is already fetched and decoded and waits in a
scheduler for the condition.

Note that in the binary-search case linked-to above, that's also the
case, but in the branchy version the benefit comes from the correct
predictions and the lack of data-dependencies between the loads: In
those cases the cache-missing load does not depend on the previous
cache-missing load (unlike the branchless version), resulting in an
overall shorter latency.

>>                      and the total latency is higher in the case of no
>>   if-conversion.  The compiler may do an ok job at predicting whether
>>   a condition is available before or after the original data
>>   dependencies (I don't know a paper that evaluates that), but without
>>   knowing about the prediction accuracy of a specific condition that
>>   does not help much.
>>
>> So the hardware should take predictability of a condition and the
>> availability of the condition into consideration for if-conversion.
>
>My argument is that this is a SW decision (in the compiler) not a
>HW decision (other than providing the PREDs).

That's a position, not an argument.  Do you have an argument for your
position?

>Since PREDs are not
>predicted (unless you think they are predicted BOTH ways) they do
>not diminish the performance of the branch predictors.

Nor increase it.  But it sounds like you think that the compiler
should choose predication when the condition is not particularly
predictable.  How should the compiler know that?

>The compiler choose PRED because FETCH reaches the join-point prior
>to the branch resolving. PRED is almost always faster--and when
>it has both then-clause and else-clause, it always saves a branch
>instruction (jumping over the else-clause).

It appears that you have an underpowered front end in mind.  Even in
the wide machines of today and without predication, the front end
normally does not have a problem fetching at least as many
instructions as the rest of the machine can handle, as long as the
predictions are correct.  Even if the instruction fetcher cannot fetch
the full width in one cycle due to having more taken branches than the
instruction fetcher can handle or some other hiccup, this is usually
made up by delivering more instructions in other cycles than the rest
of the CPU can handle.  E.g., Skymont
<https://old.chipsandcheese.com/2024/06/15/intel-details-skymont/>
fetches from three sequential streams, 3*32bytes/cycle, and decodes
into 3*3 uops/cycle and stores them in 3 32-entry uop queues; the
renamer consumes 8 instructions from these queues, so as long as the
predictions are correct and the average fetching and decoding is >8
uops/cycle, the renamer will rarely see fewer than 8 uops available,
even if there is the occasional cycle where the taken branches are so
dense that the instruction fetcher cannot deliver enough relevant
bytes to the instruction decoder.

- anton
-- 
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
  Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>