Path: ...!weretis.net!feeder9.news.weretis.net!news.quux.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!eternal-september.org!.POSTED!not-for-mail From: Stefan Monnier Newsgroups: comp.arch Subject: Re: auto predicating branches Date: Tue, 22 Apr 2025 22:59:10 -0400 Organization: A noiseless patient Spider Lines: 27 Message-ID: References: <4f65d9544ad95edc8b07c869f9921a35@www.novabbs.org> <2025Apr21.080532@mips.complang.tuwien.ac.at> <2025Apr22.071010@mips.complang.tuwien.ac.at> <2025Apr22.193103@mips.complang.tuwien.ac.at> MIME-Version: 1.0 Content-Type: text/plain Injection-Date: Wed, 23 Apr 2025 04:59:11 +0200 (CEST) Injection-Info: dont-email.me; posting-host="43118405bc7ac250caa7415fe4e2a694"; logging-data="2175243"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/0BPRaJAe9RXl9ECJT5y+6hUpv2vvsqoM=" User-Agent: Gnus/5.13 (Gnus v5.13) Cancel-Lock: sha1:mDPGXPkDswHuZfmYA2A+4je2ls4= sha1:JjLo0taWWpqu5q2v8sTjjXBk5es= Bytes: 2914 > I do not see 2 LDDs being performed parallel unless the execution > width is at least 14-wide. In any event loop recurrence restricts the IIUC you'll get multiple loads in parallel if the loads take a long time because of cache misses. Say each load takes 100 cycles, then there is plenty of time during one load to predict many more iterations of the loop and hence issue many more loads. With a branching code, the addresses of those loads depend mostly on the branch predictions, so the branch predictions end up performing a kind of "value prediction" (where the value that's predicted is the address of the next lookup). With predication your load address will conceptually depend on 3 inputs: the computation of `base + middle`, the computation of `base + 0`, and the computation of the previous `needle < *base[middle]` test to choose between the first two. If the LD of `*base[middle]` takes 100 cycle, that means a delay of 100 cycles before the next LD can be issued. Of course, nothing prevents a CPU from doing "predicate prediction": instead of waiting for an answer to `needle < *base[middle]`, it could try and guess whether it will be true or false and thus choose to send one of the two addresses (or both) to the memory (and later check the prediction and rollback, just like we do with normal branches). Stefan