Article <2024Aug15.123928@mips.complang.tuwien.ac.at>

Deutsch English Français Italiano
<2024Aug15.123928@mips.complang.tuwien.ac.at>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!news.nobody.at!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: anton@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: Decrement And Branch
Date: Thu, 15 Aug 2024 10:39:28 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 100
Message-ID: <2024Aug15.123928@mips.complang.tuwien.ac.at>
References: <v9f7b9$3qj3c$1@dont-email.me> <v9gl1b$30as$7@dont-email.me> <2024Aug14.111001@mips.complang.tuwien.ac.at> <c6653232ff022a7f991a061bfbf46ec3@www.novabbs.org>
Injection-Date: Thu, 15 Aug 2024 13:25:54 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="e4273ddcc5b5dd90fe3856bf5f2abacf";
	logging-data="999814"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX185jy772L2AVd+ubNWuZOM4"
Cancel-Lock: sha1:j8TOJO29EgbFbu6fJ0cPVV5E8gg=
X-newsreader: xrn 10.11
Bytes: 5611

mitchalsup@aol.com (MitchAlsup1) writes:
>On Wed, 14 Aug 2024 9:10:01 +0000, Anton Ertl wrote:
>
>> Lawrence D'Oliveiro <ldo@nz.invalid> writes:
>>>Like I said, I wondered why this sort of thing wasn't more common ...
>>
>> For the early RISCs, the pipeline was designed for early branch
>> execution.  Performing an ALU op before the branch did not fit that
>> kind of pipeline.
>
>MIPS would disagree.

In nearly all of the MIPS history, there is no ALU op before the
branch, only a comparison of two registers for equality.  They revised
the branches significantly in 2014, but that's not early MIPS, and by
that time branch predictors were so good that resolving the branch one
cycle later was not a big issue.

>MIPS pipeline performed Branch Target Calculation by pasting bits
>from the instruction onto bits vacated from IP.

Conditional branches in MIPS are relative.  Only J and JAL have this
misfeature.

>> For over a decade, Intel decoders have decoded many sequences of ALU
>> and branch instructions into one uop, so they can do at a
>> microarchitectural level what you are asking about at the architecture
>> level.  Other microarchitectures have followed this pattern, and
>> RISC-V seems to make a philosophy out of this.
>
>On the Intel side they mostly depend on prediction.

Every high-performance CPU depends on prediction.  Your point is what?

>On the RISC-V side they mostly depend on fusion. As far as I understand,
>They only fuse pairs not ADD-CMP-BCs.

RISC-V has compare-and-branch instructions; I don't know if any
implementations fuse that with a preceding addition/subtraction, but
if so, it's a fusion of a pair of instructions.

As for only fusing pairs, one of the patterns, in a section called
"Fusion Pair Candidates" Celio et al.
<https://arxiv.org/pdf/1607.02318> give the sequence

slli rd, rs1, {1,2,3}
add rd, rd, rs2
ld rd, 0(rd)

However, as they point out, this may be the result of first pairing
the first two instructions and then pairing the result with the third
instruction.

The paper does not describe any implementation that actually performs
such instruction fusions, so any real implementation may perform the
fusions shown there, or more or fewer fusion patterns.

>> ARM A64 OTOH seems to put everything into an instruction that fits in
>> 32 bits, and while they have instructions (TBNZ and TBZ) that tests a
>> specific bit in a register and branch if the bit is set or clear, they
>> have not added a subtract-and-branch or branch-and-subtract
>> instruction.  Apparently the uses for such an instruction are not that
>> frequent.
>
>My 66000 finds use cases all the time, and I also have Branch on bit
>instructions and have my CMP instructions build bit-vectors of outcomes.

If an architecture has the 88000-style treatment of comparison results
(fill a GPR with conditions, one bit per condition), instructions like
TBNZ and TBZ certainly are useful, but ARM A64 uses a condition code
register with NZCV flags for dealing with conditions, so what is TBNZ
and TBZ used for on this architecture?  Looking at a binary I have at
hand, I see a lot of checking bit #63 and some checking of #31, #15,
#7, i.e., checking for whether a 64-bit, ... 8-bit number is negative.
There are also a number of uses coming from libgcc, e.g.,

   6f0a8:       37e001c3        tbnz    w3, #28, 6f0e0 <__aarch64_sync_cache_range+0x50>
   6f0e8:       37e801e2        tbnz    w2, #29, 6f124 <__aarch64_sync_cache_range+0x94>
   6f6dc:       b7980b84        tbnz    x4, #51, 6f84c <__addtf3+0x71c>
   6fb28:       b79000a3        tbnz    x3, #50, 6fb3c <__addtf3+0xa0c>
   6fc30:       b79000a3        tbnz    x3, #50, 6fc44 <__addtf3+0xb14>
   70248:       b7980d02        tbnz    x2, #51, 703e8 <__multf3+0x728>
   7036c:       b79809a2        tbnz    x2, #51, 704a0 <__multf3+0x7e0>
   70430:       b77801a2        tbnz    x2, #47, 70464 <__multf3+0x7a4>
   7048c:       b79ffae2        tbnz    x2, #51, 703e8 <__multf3+0x728>
   70498:       b79ffa82        tbnz    x2, #51, 703e8 <__multf3+0x728>

The tf3 stuff probably is the implementation of long doubles.  In any
case, in this binary with 26473 instructions, there are 30 occurences
of tbnz and 41 of tbz, for a total of 71 (0.3% of static instruction
count).

Apparently the usefulness of decrement-and-branch is even lower.

Certainly in my code most loops count upwards.

- anton
-- 
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
  Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>