Article <jwvcyd338k2.fsf-monnier+comp.arch@gnu.org>

Deutsch English Français Italiano
<jwvcyd338k2.fsf-monnier+comp.arch@gnu.org>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!weretis.net!feeder9.news.weretis.net!news.quux.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!eternal-september.org!.POSTED!not-for-mail
From: Stefan Monnier <monnier@iro.umontreal.ca>
Newsgroups: comp.arch
Subject: Re: auto predicating branches
Date: Tue, 22 Apr 2025 22:59:10 -0400
Organization: A noiseless patient Spider
Lines: 27
Message-ID: <jwvcyd338k2.fsf-monnier+comp.arch@gnu.org>
References: <vbgdms$152jq$1@dont-email.me> <vtq6vh$39sli$1@dont-email.me>
	<Is7MP.2098019$OrR5.521315@fx18.iad>
	<4f65d9544ad95edc8b07c869f9921a35@www.novabbs.org>
	<vtsbga$1tu26$1@dont-email.me>
	<b8859e8d6b909a4505c0f487a6a0fe35@www.novabbs.org>
	<vu2542$38qev$1@dont-email.me> <vu46su$1170i$1@dont-email.me>
	<2025Apr21.080532@mips.complang.tuwien.ac.at>
	<d47cdad26528b4d2309ac9df60120315@www.novabbs.org>
	<2025Apr22.071010@mips.complang.tuwien.ac.at>
	<DwONP.2213540$eNx6.1757109@fx14.iad>
	<2025Apr22.193103@mips.complang.tuwien.ac.at>
	<f5e5bf81ac2c7e2066d2a181c5a70baf@www.novabbs.org>
MIME-Version: 1.0
Content-Type: text/plain
Injection-Date: Wed, 23 Apr 2025 04:59:11 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="43118405bc7ac250caa7415fe4e2a694";
	logging-data="2175243"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX1/0BPRaJAe9RXl9ECJT5y+6hUpv2vvsqoM="
User-Agent: Gnus/5.13 (Gnus v5.13)
Cancel-Lock: sha1:mDPGXPkDswHuZfmYA2A+4je2ls4=
	sha1:JjLo0taWWpqu5q2v8sTjjXBk5es=
Bytes: 2914

> I do not see 2 LDDs being performed parallel unless the execution
> width is at least 14-wide. In any event loop recurrence restricts the

IIUC you'll get multiple loads in parallel if the loads take a long time
because of cache misses.  Say each load takes 100 cycles, then there is
plenty of time during one load to predict many more iterations of the
loop and hence issue many more loads.

With a branching code, the addresses of those loads depend mostly on the
branch predictions, so the branch predictions end up performing a kind
of "value prediction" (where the value that's predicted is the address
of the next lookup).

With predication your load address will conceptually depend on 3 inputs:
the computation of `base + middle`, the computation of  `base + 0`, and
the computation of the previous `needle < *base[middle]` test to choose
between the first two.  If the LD of `*base[middle]` takes 100 cycle,
that means a delay of 100 cycles before the next LD can be issued.

Of course, nothing prevents a CPU from doing "predicate prediction":
instead of waiting for an answer to `needle < *base[middle]`, it could
try and guess whether it will be true or false and thus choose to send
one of the two addresses (or both) to the memory (and later check the
prediction and rollback, just like we do with normal branches).


        Stefan