Article <vboqpp$2r5v4$1@dont-email.me>

Deutsch English Français Italiano
<vboqpp$2r5v4$1@dont-email.me>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: BGB <cr88192@gmail.com>
Newsgroups: comp.arch
Subject: Re: Tonights Tradeoff
Date: Tue, 10 Sep 2024 02:00:00 -0500
Organization: A noiseless patient Spider
Lines: 176
Message-ID: <vboqpp$2r5v4$1@dont-email.me>
References: <vbgdms$152jq$1@dont-email.me>
 <17537125c53e616e22f772e5bcd61943@www.novabbs.org>
 <vbj5af$1puhu$1@dont-email.me>
 <a37e9bd652d7674493750ccc04674759@www.novabbs.org>
 <vbog6d$2p2rc$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Tue, 10 Sep 2024 09:00:10 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="8642d80dc83a1f0de3aeee9315299589";
	logging-data="2988004"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX1/uBS8P4RECk5/UGHohpMJo3e2VP6ONbtQ="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:f6n9qkuAbu36jin2ORcICSRx+nA=
In-Reply-To: <vbog6d$2p2rc$1@dont-email.me>
Content-Language: en-US
Bytes: 8758

On 9/9/2024 10:59 PM, Robert Finch wrote:
> On 2024-09-08 2:06 p.m., MitchAlsup1 wrote:
>> On Sun, 8 Sep 2024 3:22:55 +0000, Robert Finch wrote:
>>
>>> On 2024-09-07 10:41 a.m., MitchAlsup1 wrote:
>>>> On Sat, 7 Sep 2024 2:27:40 +0000, Robert Finch wrote:
>>>>
>>>>> Making the scalar register file a subset of the vector register file.
>>>>> And renaming only vector elements.
>>>>>
>>>>> There are eight elements in a vector register and each element is
>>>>> 128-bits wide. (Corresponding to the size of a GPR). Vector register
>>>>> file elements are subject to register renaming to allow the full power
>>>>> of the OoO machine to be used to process vectors. The issue is that 
>>>>> with
>>>>> both the vector and scalar registers present for renaming there are a
>>>>> lot of registers to rename. It is desirable to keep the number of
>>>>> renamed registers (including vector elements) <= 256 total. So, the 64
>>>>> scalar registers are aliased with the first eight vector registers.
>>>>> Leaving only 24 truly available vector registers. Hm. There are 1024
>>>>> physical registers, so maybe going up to about 300 renamable register
>>>>> would not hurt.
>>>>
>>>> Why do you think a vector register file is the way to go ??
>>>
>>> I think vector use is somewhat dubious, but they have some uses. In many
>>> cases data can be processed just fine without vector registers. In the
>>> current project vector instructions use the scalar functional units to
>>> compute, making them no faster than scalar calcs. But vectors have a lot
>>> of code density where parallel computation on multiple data items using
>>> a single instruction is desirable. I do not know why people use vector
>>> registers in general, but they are present in some modern architectures.
>>
>> There is no doubt that much code can utilize vector arrangements, and
>> that a processor should be very efficient in performing these work
>> loads.
>>
>> The problem I see is that CRAY-like vectors vectorize instructions
>> instead of vectorizing loops. Any kind of flow control within the
>> loop becomes tedious at best.
>>
>> On the other hand, the Virtual Vector Method vectorizes loops and
>> can be implemented such that it performs as well as CRAY-like
>> vector machines without the overhead of a vector register file.
>> In actuality there are only 6-bits of HW flip-flops governing
>> VVM--compared to 4 KBytes for CRAY-1.
>>
>>> Qupls vector registers are 512 bits wide (8 64-bit elements). Bigfoot’s
>>> vector registers are 1024 bits wide (8 128-bit elements).
>>
>> When properly abstracted, one can dedicate as many or few HW
>> flip-flops as staging buffers for vector work loads to suit
>> the implementation at hand. A GBOoO may utilize that 4KB
>> file of CRAY-1 while the little low power core 3-cache lines.
>> Both run the same ASM code and both are efficient in their own
>> sense of "efficient".
>>
>> So, instead of having ~500 vector instructions and ~1000 SIMD
>> instructions one has 2 instructions and a medium scale state
>> machine.
>>
> 
> 
> Still trying to grasp the virtual vector method. Been wondering if it 
> can be implemented using renamed registers.
> 

I haven't really understood how it could be implemented.
But, granted, my pipeline design is relatively simplistic, and my 
priority had usually been trying to make a "fast but cheap and simple" 
pipeline, rather than a "clever" pipeline.

Still not as cheap or simple as I would want.



> Qupls has RISC-V style vector / SIMD registers. For Q+ every instruction 
> can be a vector instruction, as there are bits indicating which 
> registers are vector registers in the instruction. All the scalar 
> instructions become vector. This cuts down on some of the bloat in the 
> ISA. There is only a handful of vector specific instructions (about 
> eight I think). The drawback is that the ISA is 48-bits wide. However, 
> the code bloat is less than 50% as some instructions have dual- 
> operations. Branches can increment or decrement and loop. Bigfoot uses a 
> postfix word to indicate to use the vector form of the instruction. 
> Bigfoot’s code density is a lot better being variable length, but I 
> suspect it will not run as fast. Bigfoot and Q+ share a lot of the same 
> code. Trying to make the guts of the cores generic.
> 

In my case, the core ended up generic enough that it can support both 
BJX2 and RISC-V. Could almost make sense to lean more heavily into this 
(trying to consolidate more things and better optimize costs).

Did also recently get around to more-or-less implementing support for 
the 'C' extension, even as much as it is kinda dog-chewed and does not 
efficiently utilize the encoding space.


It burns a lot of encoding space on 6 and 8 bit immediate fields (with 
11 bit branch displacements), more 5-bit register fields than ideal, ... 
so, has relatively few unique instructions, but:
Many of the instructions it does have are left with 3 bit register fields;
Has way a bit too many immediate-field layouts as it just sort of 
shoe-horns immediate fields into whatever bits are left.

Though, turns out I could skip a few things due to them being N/E in 
RV64 (RV32, RV64, and RV128 get a slightly different selection of ops in 
the C extension).

Like, many things in RV land make "annoying and kinda poor" design choices.

Then again, if one assumes that the role of 'C' is mostly:
   Does SP-relative loads/stores and MOV-RR.

Well, it does do this at least...

Nevermind if you want to use any of the ALU ops (besides ADD), or 
non-stack-relative Load/Store, well then, enjoy the 3 bit register fields.

And, still way too many immediate-field encodings for what is 
effectively load/store and a few ALU ops.




I am not as much a fan of RISC-V's 'V' extension mostly in that it would 
require essentially doubling the size of the register file.

And, if I were to do something like 'V' I would likely do some things 
differently:
Rather than having an instruction to load vector control state into 
CSR's, it would make more sense IMO to use bigger 64-bit instructions 
and encode the vector state directly into these instructions.

While this would be worse for code density, it would avoid needing to 
burn instructions setting up vector state, and would have less penalty 
(in terms of clock-cycles) if working with heterogeneous vectors.


Say, one possibility could be a combo-SIMD op with a control field:
   2b vector size
     64 / 128 / resv / resv
   2b element size
     8 / 16/ 32/ 64
   2b category
     wrap / modulo
     float
     signed saturate
     unsigned saturate
   6b operator
     add, sub, mul, mac, mulhi, ...

Though, with not every combination necessarily being allowed.
Say, for example, if the implementation limits FP-SIMD to 4 or 8 vector 
elements.

Though, it may make sense to be asymmetric as well:
   2-vide vectors can support Binary64
   4-wide can support Binary32
   8-wide can support Binary16 ( + 4x FP16 units)
   16 can support FP8 ( + 8x FP8 units)

Whereas, say, 16x Binary32 capable units would be infeasible.

Well, as opposed to defining encodings one-at-a-time in the 32-bit 
encoding space.


It could be tempting to possibly consider using pipelining and 
multi-stage decoding to allow some ops as well. Say, possibly handling 
8-wide vectors internally as 2x 4-wide operations, or maybe allowing 
256-bit vector ops in the absence of 256-bit vectors in hardware.

....
========== REMAINDER OF ARTICLE TRUNCATED ==========