Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!eternal-september.org!.POSTED!not-for-mail From: anton@mips.complang.tuwien.ac.at (Anton Ertl) Newsgroups: comp.arch Subject: Re: auto predicating branches Date: Tue, 22 Apr 2025 05:10:10 GMT Organization: Institut fuer Computersprachen, Technische Universitaet Wien Lines: 130 Message-ID: <2025Apr22.071010@mips.complang.tuwien.ac.at> References: <4f65d9544ad95edc8b07c869f9921a35@www.novabbs.org> <2025Apr21.080532@mips.complang.tuwien.ac.at> Injection-Date: Tue, 22 Apr 2025 08:59:50 +0200 (CEST) Injection-Info: dont-email.me; posting-host="538ba95c38275f5ed8ab3f554e59756f"; logging-data="84884"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/CdRvvaivr/fIUbeSKj7uH" Cancel-Lock: sha1:eCoIDbTYqP7bLo1rFqYKyUN5WhM= X-newsreader: xrn 10.11 Bytes: 7817 mitchalsup@aol.com (MitchAlsup1) writes: >On Mon, 21 Apr 2025 6:05:32 +0000, Anton Ertl wrote: > >> Robert Finch writes: >>>Having branches automatically convert into >>>predicates when they branch forward a short distance <7 instructions. >> >> If-conversion in hardware is a good idea, if done well, because it >> involves issues that tend to be unknown to compilers: > >I had little trouble teaching Brian how to put if-conversion into the >compiler with my PRED instructions. Alleviating HW from having to bother >other than being able to execute PREDicated clauses. Compilers certainly can perform if-conversion, and there have been papers about that for at least 30 years, but compilers do not know when if-conversion is profitable. E.g., "branchless" (if-converted) binary search can be faster than a branching one when the searched array is cached at a close-enough level, but slower when most of the accesses miss the close caches . >> * How predictable is the condition? If the condition is very well >> predictable, if-conversion is not a good idea, because it turns the >> control dependency (which does not cost latency when the prediction >> is correct) into a data dependency. Moreover, in this case the >> if-conversion increases the resource consumption. Compilers are not >> good at predicting the predictability AFAIK. > >Rather than base the choice on the predictability of the condition, >It is based on whether FETCH will pass the join-point before the >condition resolves. On an 8-wide machine this might be "THE next >cycle". On a CPU with out-of-order execution, the instruction fetcher often runs many dozens of instructions ahead of the functional units which produce the conditions, so your criterion could cover pretty big IFs And, given that you want the compiler to do it, the compiler would have to know about that. Ok, what decision will you take in what case, and why? >> * Is the condition available before or after the original data >> dependencies? And if afterwards, by how many cycles? If it is >> afterwards and the branch prediction would be correct, the >> if-conversion means that the result of the instruction is available >> later, which may reduce IPC. > >Generally, it only adds latency--if the execution window is not staled >at either end this does not harm IPC. If the additional latency is on the critical path and execution is dependency-limited, this reduces IPC. And yes, this will result in the buffers (especially the schedulers) filling up and stalling the front end. >> OTOH, if the branch prediction would >> be incorrect, the recovery also depends on when the condition >> becomes available, > >There is no "recovery" from PREDication, just one clause getting >nullified. I apparently wrote that in a misunderstandable way. Here's another attempt: When comparing the branching variant to the predicated (if-converted) variant, if the branching variant would be mispredicted, it is always at a disadvantage wrt. latency compared to the predicated variant, because the branching variant restarts from the instruction fetch when the condition becomes available, while the predicated variant is already fetched and decoded and waits in a scheduler for the condition. Note that in the binary-search case linked-to above, that's also the case, but in the branchy version the benefit comes from the correct predictions and the lack of data-dependencies between the loads: In those cases the cache-missing load does not depend on the previous cache-missing load (unlike the branchless version), resulting in an overall shorter latency. >> and the total latency is higher in the case of no >> if-conversion. The compiler may do an ok job at predicting whether >> a condition is available before or after the original data >> dependencies (I don't know a paper that evaluates that), but without >> knowing about the prediction accuracy of a specific condition that >> does not help much. >> >> So the hardware should take predictability of a condition and the >> availability of the condition into consideration for if-conversion. > >My argument is that this is a SW decision (in the compiler) not a >HW decision (other than providing the PREDs). That's a position, not an argument. Do you have an argument for your position? >Since PREDs are not >predicted (unless you think they are predicted BOTH ways) they do >not diminish the performance of the branch predictors. Nor increase it. But it sounds like you think that the compiler should choose predication when the condition is not particularly predictable. How should the compiler know that? >The compiler choose PRED because FETCH reaches the join-point prior >to the branch resolving. PRED is almost always faster--and when >it has both then-clause and else-clause, it always saves a branch >instruction (jumping over the else-clause). It appears that you have an underpowered front end in mind. Even in the wide machines of today and without predication, the front end normally does not have a problem fetching at least as many instructions as the rest of the machine can handle, as long as the predictions are correct. Even if the instruction fetcher cannot fetch the full width in one cycle due to having more taken branches than the instruction fetcher can handle or some other hiccup, this is usually made up by delivering more instructions in other cycles than the rest of the CPU can handle. E.g., Skymont fetches from three sequential streams, 3*32bytes/cycle, and decodes into 3*3 uops/cycle and stores them in 3 32-entry uop queues; the renamer consumes 8 instructions from these queues, so as long as the predictions are correct and the average fetching and decoding is >8 uops/cycle, the renamer will rarely see fewer than 8 uops available, even if there is the occasional cycle where the taken branches are so dense that the instruction fetcher cannot deliver enough relevant bytes to the instruction decoder. - anton -- 'Anyone trying for "industrial quality" ISA should avoid undefined behavior.' Mitch Alsup,