Deutsch English Français Italiano |
<2024Aug15.123928@mips.complang.tuwien.ac.at> View for Bookmarking (what is this?) Look up another Usenet article |
Path: ...!news.nobody.at!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: anton@mips.complang.tuwien.ac.at (Anton Ertl) Newsgroups: comp.arch Subject: Re: Decrement And Branch Date: Thu, 15 Aug 2024 10:39:28 GMT Organization: Institut fuer Computersprachen, Technische Universitaet Wien Lines: 100 Message-ID: <2024Aug15.123928@mips.complang.tuwien.ac.at> References: <v9f7b9$3qj3c$1@dont-email.me> <v9gl1b$30as$7@dont-email.me> <2024Aug14.111001@mips.complang.tuwien.ac.at> <c6653232ff022a7f991a061bfbf46ec3@www.novabbs.org> Injection-Date: Thu, 15 Aug 2024 13:25:54 +0200 (CEST) Injection-Info: dont-email.me; posting-host="e4273ddcc5b5dd90fe3856bf5f2abacf"; logging-data="999814"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX185jy772L2AVd+ubNWuZOM4" Cancel-Lock: sha1:j8TOJO29EgbFbu6fJ0cPVV5E8gg= X-newsreader: xrn 10.11 Bytes: 5611 mitchalsup@aol.com (MitchAlsup1) writes: >On Wed, 14 Aug 2024 9:10:01 +0000, Anton Ertl wrote: > >> Lawrence D'Oliveiro <ldo@nz.invalid> writes: >>>Like I said, I wondered why this sort of thing wasn't more common ... >> >> For the early RISCs, the pipeline was designed for early branch >> execution. Performing an ALU op before the branch did not fit that >> kind of pipeline. > >MIPS would disagree. In nearly all of the MIPS history, there is no ALU op before the branch, only a comparison of two registers for equality. They revised the branches significantly in 2014, but that's not early MIPS, and by that time branch predictors were so good that resolving the branch one cycle later was not a big issue. >MIPS pipeline performed Branch Target Calculation by pasting bits >from the instruction onto bits vacated from IP. Conditional branches in MIPS are relative. Only J and JAL have this misfeature. >> For over a decade, Intel decoders have decoded many sequences of ALU >> and branch instructions into one uop, so they can do at a >> microarchitectural level what you are asking about at the architecture >> level. Other microarchitectures have followed this pattern, and >> RISC-V seems to make a philosophy out of this. > >On the Intel side they mostly depend on prediction. Every high-performance CPU depends on prediction. Your point is what? >On the RISC-V side they mostly depend on fusion. As far as I understand, >They only fuse pairs not ADD-CMP-BCs. RISC-V has compare-and-branch instructions; I don't know if any implementations fuse that with a preceding addition/subtraction, but if so, it's a fusion of a pair of instructions. As for only fusing pairs, one of the patterns, in a section called "Fusion Pair Candidates" Celio et al. <https://arxiv.org/pdf/1607.02318> give the sequence slli rd, rs1, {1,2,3} add rd, rd, rs2 ld rd, 0(rd) However, as they point out, this may be the result of first pairing the first two instructions and then pairing the result with the third instruction. The paper does not describe any implementation that actually performs such instruction fusions, so any real implementation may perform the fusions shown there, or more or fewer fusion patterns. >> ARM A64 OTOH seems to put everything into an instruction that fits in >> 32 bits, and while they have instructions (TBNZ and TBZ) that tests a >> specific bit in a register and branch if the bit is set or clear, they >> have not added a subtract-and-branch or branch-and-subtract >> instruction. Apparently the uses for such an instruction are not that >> frequent. > >My 66000 finds use cases all the time, and I also have Branch on bit >instructions and have my CMP instructions build bit-vectors of outcomes. If an architecture has the 88000-style treatment of comparison results (fill a GPR with conditions, one bit per condition), instructions like TBNZ and TBZ certainly are useful, but ARM A64 uses a condition code register with NZCV flags for dealing with conditions, so what is TBNZ and TBZ used for on this architecture? Looking at a binary I have at hand, I see a lot of checking bit #63 and some checking of #31, #15, #7, i.e., checking for whether a 64-bit, ... 8-bit number is negative. There are also a number of uses coming from libgcc, e.g., 6f0a8: 37e001c3 tbnz w3, #28, 6f0e0 <__aarch64_sync_cache_range+0x50> 6f0e8: 37e801e2 tbnz w2, #29, 6f124 <__aarch64_sync_cache_range+0x94> 6f6dc: b7980b84 tbnz x4, #51, 6f84c <__addtf3+0x71c> 6fb28: b79000a3 tbnz x3, #50, 6fb3c <__addtf3+0xa0c> 6fc30: b79000a3 tbnz x3, #50, 6fc44 <__addtf3+0xb14> 70248: b7980d02 tbnz x2, #51, 703e8 <__multf3+0x728> 7036c: b79809a2 tbnz x2, #51, 704a0 <__multf3+0x7e0> 70430: b77801a2 tbnz x2, #47, 70464 <__multf3+0x7a4> 7048c: b79ffae2 tbnz x2, #51, 703e8 <__multf3+0x728> 70498: b79ffa82 tbnz x2, #51, 703e8 <__multf3+0x728> The tf3 stuff probably is the implementation of long doubles. In any case, in this binary with 26473 instructions, there are 30 occurences of tbnz and 41 of tbz, for a total of 71 (0.3% of static instruction count). Apparently the usefulness of decrement-and-branch is even lower. Certainly in my code most loops count upwards. - anton -- 'Anyone trying for "industrial quality" ISA should avoid undefined behavior.' Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>