Deutsch English Français Italiano |
<2025Apr23.194456@mips.complang.tuwien.ac.at> View for Bookmarking (what is this?) Look up another Usenet article |
Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!eternal-september.org!.POSTED!not-for-mail From: anton@mips.complang.tuwien.ac.at (Anton Ertl) Newsgroups: comp.arch Subject: Re: auto predicating branches Date: Wed, 23 Apr 2025 17:44:56 GMT Organization: Institut fuer Computersprachen, Technische Universitaet Wien Lines: 35 Message-ID: <2025Apr23.194456@mips.complang.tuwien.ac.at> References: <vbgdms$152jq$1@dont-email.me> <vtsbga$1tu26$1@dont-email.me> <b8859e8d6b909a4505c0f487a6a0fe35@www.novabbs.org> <vu2542$38qev$1@dont-email.me> <vu46su$1170i$1@dont-email.me> <2025Apr21.080532@mips.complang.tuwien.ac.at> <d47cdad26528b4d2309ac9df60120315@www.novabbs.org> <2025Apr22.071010@mips.complang.tuwien.ac.at> <DwONP.2213540$eNx6.1757109@fx14.iad> <2025Apr22.193103@mips.complang.tuwien.ac.at> <f5e5bf81ac2c7e2066d2a181c5a70baf@www.novabbs.org> Injection-Date: Wed, 23 Apr 2025 20:07:50 +0200 (CEST) Injection-Info: dont-email.me; posting-host="65d91472a7cddc6d4f9bae1a9a480ab9"; logging-data="3782457"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19HitOusSgD6P2JqXxrAYYS" Cancel-Lock: sha1:TIGalNEsgzr/YFIAb/UBSSOXHIw= X-newsreader: xrn 10.11 Bytes: 3041 mitchalsup@aol.com (MitchAlsup1) writes: >I do not see 2 LDDs being performed parallel unless the execution >width is at least 14-wide. In any event loop recurrence restricts the >overall retirement to 0.5 LDDs per cycle--it is the recurrence that >feeds the iterations (i.e., retirement). Yes. But with loads that take longer than two cycles (very common in OoO microarchitectures even for L1 hits), the second load starts before the first finishes. And in the case where the branchy version is profitable (when the load latency longer than the misprediction penalty due to cache misses), many loads will start before the first finishes (most of them will be canceled due to misprediction, but even an average of two useful parallel loads produces a good speedup). [EricP:] >>>[*] I want to see the asm because Intel's CMOV always executes the >>>operand operation, then tosses the result if the predicate is false. > >Use a less-stupid ISA The ISA does not require that. It could just as well be implemented as waiting for the condition, and only then perform the operation. And with a more sophisticated implementation one could even do that for operations that are not part of the CMOV instruction, but produce one of the source operands of the CMOV instruction. However, apparently such implementations have enough disadvantages (probably in performance) that nobody has gone there AFAIK. AFAIK everyone, including implementations of different ISAs implements CMOV/predication as performing the operation and then conditionally squashing the result. - anton -- 'Anyone trying for "industrial quality" ISA should avoid undefined behavior.' Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>