Article <vdp343$9d38$1@dont-email.me>

Deutsch English Français Italiano
<vdp343$9d38$1@dont-email.me>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: Robert Finch <robfi680@gmail.com>
Newsgroups: comp.arch
Subject: Re: Tonights Tradeoff - Background Execution Buffers
Date: Fri, 4 Oct 2024 11:54:40 -0400
Organization: A noiseless patient Spider
Lines: 117
Message-ID: <vdp343$9d38$1@dont-email.me>
References: <vbgdms$152jq$1@dont-email.me> <vbj5af$1puhu$1@dont-email.me>
 <a37e9bd652d7674493750ccc04674759@www.novabbs.org>
 <vbog6d$2p2rc$1@dont-email.me>
 <f2d99c60ba76af28c8b63b9628fb56fa@www.novabbs.org>
 <vc61e6$21skv$1@dont-email.me> <vc8gl4$2m5tp$1@dont-email.me>
 <vcv5uj$3arh6$1@dont-email.me>
 <37067f65c5982e4d03825b997b23c128@www.novabbs.org>
 <vd352q$3s1e$1@dont-email.me>
 <5f8ee3d3b2321ffa7e6c570882686b57@www.novabbs.org>
 <vd6a5e$o0aj$2@dont-email.me> <vdnpg4$3c9e$2@dont-email.me>
 <2024Oct4.081931@mips.complang.tuwien.ac.at>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Fri, 04 Oct 2024 17:54:43 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="8239decc76384eb48cf0e5f094ee0a60";
	logging-data="308328"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX18DlwKANtr16cPK1+VE16u2ITgX9BADxuU="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:RDkAR4fmYPwosGn0PVNAF47eNis=
In-Reply-To: <2024Oct4.081931@mips.complang.tuwien.ac.at>
Content-Language: en-US
Bytes: 7435

On 2024-10-04 2:19 a.m., Anton Ertl wrote:
> Robert Finch <robfi680@gmail.com> writes:
>> Today I am wondering how many predicate registers are enough. Scanning
>> webpages reveals a variety. The Itanium has 64-predicates, but they are
>> used for modulo loops and rotated. Rotating register is Itaniums method
>> of register renaming, so it needs more visible registers. In a classic
>> superscalar design with a RAT where registers are renamed, it seems like
>> 64 would be far too many.
> 
> Would it?  Zen5 has 192 flags registers
> <https://i0.wp.com/chipsandcheese.com/wp-content/uploads/2024/09/hc2024_zen5_spec_uplift.png?ssl=1>,
> and I assume that means it has 192 C, 192 V, and 192 NZP registers
> (physical), for one architectural flags register.
> 
>> I cannot see the compiler making use of very many predicate registers
>> simultaneously.
> 
> Maybe not, but what are the alternatives:
> 
> 1) Have one flags register, like AMD64 and ARM A32, T32, and A64, or
> the carry flag of Power and 88K, and the flags result of most Power
> instructions.  Then the compilers typically only know that other
> instructions will overwrite that register, and is forced to consume
> the flag right away.  This leads to bad code generation, as shown in
> <2021Mar15.104123@mips.complang.tuwien.ac.at>:
> 
> |E.g., in
> |<2016May24.093059@mips.complang.tuwien.ac.at> we see that gcc-5.3.0
> |compiles
> |
> |   cf = _addcarry_u64(cf, src1[1], src2[1], &dst[1]);
> |   cf = _addcarry_u64(cf, src1[2], src2[2], &dst[2]);
> |
> |into
> |
> | d:	48 8b 42 08          	mov    0x8(%rdx),%rax
> |11:	41 80 c1 ff          	add    $0xff,%r9b
> |15:	49 13 40 08          	adc    0x8(%r8),%rax
> |19:	41 0f 92 c1          	setb   %r9b
> |1d:	48 89 41 08          	mov    %rax,0x8(%rcx)
> |21:	48 8b 42 10          	mov    0x10(%rdx),%rax
> |25:	41 80 c1 ff          	add    $0xff,%r9b
> |29:	49 13 40 10          	adc    0x10(%r8),%rax
> |2d:	41 0f 92 c1          	setb   %r9b
> |31:	48 89 41 10          	mov    %rax,0x10(%rcx)
> |
> |Here gcc reifies the carry bit in a GPR (r9b) with the instructions at
> |19 and 2d, and also converts it from a GPR into a carry flag in 11 and
> |25.  This shows that the compiler does not trust itself to preserve
> |the carry flag from one adc to the next.
> 
> 2) Have multiple flags registers, like IA-64.  The compiler will
> certainly be able to deal with that, but extra instructions are needed
> for generating the flags.
> 
> 3) Use the GPRs for flags.  This also often requires additional
> instructions for generating the flags, as in MIPS, 88K, or RISC-V
> (with quite a bit of differentce between the MIPS/Alpha/RISC-V
> approach and the 88K approach).  This disadvantage is often mitigated
> by having compare-and-branch instructions or instructions that branch
> on certain properties of a register's content.
> 
> 4) Keep the flags results along with GPRs: have carry and overflow as
> bit 64 and 65, N is bit 63, and Z tells something about bits 0-63.
> The advantage is that you do not have to track the flags separately
> (and, in case of AMD64, track each of C, O, and NZP separately), but
> instead can use the RAT that is already there for the GPRs.  You can
> find a preliminary paper on that on
> <https://www.complang.tuwien.ac.at/anton/tmp/carry.pdf>.
> 
>> Since they are not used simultaneously, and register
>> renaming is in effect, there should not be a great need for predicate
>> registers.
> 
> You need to preserve one instance for every recovery point, i.e.,
> every instruction that branches or can trap, and that have not yet
> been committed.  You also need to preserve one instance if there is
> any consumer that has not yet proceeded through execution.  The
> simplest way to satisfy both requirements is to just preserve any
> flags result until the generating instruction retires.  And if most
> instructions generate flags, that means a lot of instances of the
> flags.  There is a reason why Zen5 has 192.
> 
> - anton

I was thinking more along the line of architectural predicate registers, 
and reserving bits in the instruction for them. The 192 flags of Zen5 
are physical registers. Q+ has the predicate registers as a subset of 
the GPRs. There are 512 physical registers, so potentially loads of 
registers for renaming predicates. Alternative #3 is in use, GPRs are 
being used for general flag usage.

Q+ has a three input add instruction to help support multi-precision 
arithmetic. The idea was the carry input could be calculated and fed in 
the third register. The carry value would be generated by an add 
instruction (addgc) that just produces the carry bit, given the same 
argument registers as the add the carry is needed for. But that is ugly 
and takes an extra instruction.

One solution, not mentioned in your article, is to support arithmetic 
with two bits less than the number of bit a register can support, so 
that the carry and overflow can be stored. On a 64-bit machine have all 
operations use only 62-bits. It would solve the issue of how to load or 
store the carry and overflow bits associated with a register. Sometimes 
arithmetic is performed with fewer bits, as for pointer representation. 
I wonder if pointer masking could somehow be involved. It may be useful 
to have a bit indicating the presence of a pointer. Also thinking of how 
to track a binary point position for fixed point arithmetic. Perhaps 
using the whole upper byte of a register for status/control bits would work.

It may be possible with Q+ to support a second destination register 
which is in a subset of the GPRs. For example, one of eight registers 
could be specified to holds the carry/overflow status. That effectively 
ties up a second ALU though as an extra write port is needed for the 
instruction.