Warning: mysqli::__construct(): (HY000/1203): User howardkn already has more than 'max_user_connections' active connections in D:\Inetpub\vhosts\howardknight.net\al.howardknight.net\includes\artfuncs.php on line 21
Failed to connect to MySQL: (1203) User howardkn already has more than 'max_user_connections' active connections
Warning: mysqli::query(): Couldn't fetch mysqli in D:\Inetpub\vhosts\howardknight.net\al.howardknight.net\index.php on line 66
Article <vbqcds$35l1q$2@dont-email.me>
Deutsch   English   Français   Italiano  
<vbqcds$35l1q$2@dont-email.me>

View for Bookmarking (what is this?)
Look up another Usenet article

Path: ...!news.mixmin.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: BGB <cr88192@gmail.com>
Newsgroups: comp.arch
Subject: Re: Tonights Tradeoff
Date: Tue, 10 Sep 2024 16:07:00 -0500
Organization: A noiseless patient Spider
Lines: 242
Message-ID: <vbqcds$35l1q$2@dont-email.me>
References: <vbgdms$152jq$1@dont-email.me>
 <17537125c53e616e22f772e5bcd61943@www.novabbs.org>
 <vbj5af$1puhu$1@dont-email.me>
 <a37e9bd652d7674493750ccc04674759@www.novabbs.org>
 <vbog6d$2p2rc$1@dont-email.me> <vboqpp$2r5v4$1@dont-email.me>
 <vbpmqr$30vto$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Tue, 10 Sep 2024 23:07:09 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="8642d80dc83a1f0de3aeee9315299589";
	logging-data="3331130"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX1/fgK/ySqIUtMXUxP51oyw9dZmxGYM23Ho="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:HIqSJyTXbGY8r5RkxtIyZ0zcmAw=
Content-Language: en-US
In-Reply-To: <vbpmqr$30vto$1@dont-email.me>
Bytes: 11002

On 9/10/2024 9:58 AM, Robert Finch wrote:
> On 2024-09-10 3:00 a.m., BGB wrote:
>>
>> I haven't really understood how it could be implemented.
>> But, granted, my pipeline design is relatively simplistic, and my 
>> priority had usually been trying to make a "fast but cheap and simple" 
>> pipeline, rather than a "clever" pipeline.
>>
>> Still not as cheap or simple as I would want.
>>
>>
>>
>>> Qupls has RISC-V style vector / SIMD registers. For Q+ every 
>>> instruction can be a vector instruction, as there are bits indicating 
>>> which registers are vector registers in the instruction. All the 
>>> scalar instructions become vector. This cuts down on some of the 
>>> bloat in the ISA. There is only a handful of vector specific 
>>> instructions (about eight I think). The drawback is that the ISA is 
>>> 48-bits wide. However, the code bloat is less than 50% as some 
>>> instructions have dual- operations. Branches can increment or 
>>> decrement and loop. Bigfoot uses a postfix word to indicate to use 
>>> the vector form of the instruction. Bigfoot’s code density is a lot 
>>> better being variable length, but I suspect it will not run as fast. 
>>> Bigfoot and Q+ share a lot of the same code. Trying to make the guts 
>>> of the cores generic.
>>>
>>
>> In my case, the core ended up generic enough that it can support both 
>> BJX2 and RISC-V. Could almost make sense to lean more heavily into 
>> this (trying to consolidate more things and better optimize costs).
>>
>> Did also recently get around to more-or-less implementing support for 
>> the 'C' extension, even as much as it is kinda dog-chewed and does not 
>> efficiently utilize the encoding space.
>>
>>
>> It burns a lot of encoding space on 6 and 8 bit immediate fields (with 
>> 11 bit branch displacements), more 5-bit register fields than 
>> ideal, ... so, has relatively few unique instructions, but:
>> Many of the instructions it does have are left with 3 bit register 
>> fields;
>> Has way a bit too many immediate-field layouts as it just sort of 
>> shoe- horns immediate fields into whatever bits are left.
>>
>> Though, turns out I could skip a few things due to them being N/E in 
>> RV64 (RV32, RV64, and RV128 get a slightly different selection of ops 
>> in the C extension).
>>
>> Like, many things in RV land make "annoying and kinda poor" design 
>> choices.
>>
>> Then again, if one assumes that the role of 'C' is mostly:
>>    Does SP-relative loads/stores and MOV-RR.
>>
>> Well, it does do this at least...
>>
>> Nevermind if you want to use any of the ALU ops (besides ADD), or non- 
>> stack-relative Load/Store, well then, enjoy the 3 bit register fields.
>>
>> And, still way too many immediate-field encodings for what is 
>> effectively load/store and a few ALU ops.
>>
>>
>>
>>
>> I am not as much a fan of RISC-V's 'V' extension mostly in that it 
>> would require essentially doubling the size of the register file.
> 
> The register file in Q+ is huge. One of the drawbacks of supporting 
> vectors. There were 1024 physical registers for support. Reduced it to 
> 512 and that still may be too many. There was a 4kb wide mapping ram, 
> resulting in a warning message. I may have to split up components into 
> multiple copies to get the desired size to work.
> 

I am dealing with 64 registers.
In RV Mode, it is split between the GPRs and FPRs, in BJX2 a unified GPR 
space;
V would mean either extending the register set to 128, or adding a 
separate 32*128 bit register file, which is AFAICT the effective minimum.

Neither option would be good for resource cost.
   Cheaper would have been a SIMD system based on paired FPRs or similar.


>>
>> And, if I were to do something like 'V' I would likely do some things 
>> differently:
>> Rather than having an instruction to load vector control state into 
>> CSR's, it would make more sense IMO to use bigger 64-bit instructions 
>> and encode the vector state directly into these instructions.
>>
>> While this would be worse for code density, it would avoid needing to 
>> burn instructions setting up vector state, and would have less penalty 
>> (in terms of clock-cycles) if working with heterogeneous vectors.
>>
>>
>> Say, one possibility could be a combo-SIMD op with a control field:
>>    2b vector size
>>      64 / 128 / resv / resv
>>    2b element size
>>      8 / 16/ 32/ 64
>>    2b category
>>      wrap / modulo
>>      float
>>      signed saturate
>>      unsigned saturate
>>    6b operator
>>      add, sub, mul, mac, mulhi, ...
>>
> 
> Q+ is setup almost that way. It uses 48b instructions. There is a 2b 
> precision field in instructions that determines the lane/sub element 
> size 8/16/32/64. The precision field also applies to scalar registers. 
> The category is wrapped up in the opcode which is seven bits. One can do 
> a float add on a vector register, then a bitwise operation on the same 
> register. The vector registers work the same way as the scalar ones. 
> There is no type state associated with them, unlike RISCV. To control 
> the length (which lanes are active) there is a global mask register 
> instead of a vector length register.
> 
> Sign control plus a vector indicator for each register spec results in a 
> seven-bit spec, and there are four registers encoded in an instruction, 
> which uses 28-bit, combined with a seven-bit opcode is 35 bits. There 
> was just no way the instruction set was fitting in 32b. For a while the 
> ISA was 40-bit, but I figured it was better to go 48-bit then add some 
> additional functionality to make up for the wider ISA.
> 

My leaning for 64 bit was mostly so that it does not break 32-bit 
alignment for the 32-bit instructions.

In this case:
WEX currently needs 32-bit alignment;
If I added superscalar to BJX2, it would likely also require 32-bit 
alignment.


> 
> 
>> Though, with not every combination necessarily being allowed.
>> Say, for example, if the implementation limits FP-SIMD to 4 or 8 
>> vector elements.
>>
>> Though, it may make sense to be asymmetric as well:
>>    2-vide vectors can support Binary64
>>    4-wide can support Binary32
>>    8-wide can support Binary16 ( + 4x FP16 units)
>>    16 can support FP8 ( + 8x FP8 units)
>>
>> Whereas, say, 16x Binary32 capable units would be infeasible.
>>
>> Well, as opposed to defining encodings one-at-a-time in the 32-bit 
>> encoding space.
>>
>>
>> It could be tempting to possibly consider using pipelining and multi- 
>> stage decoding to allow some ops as well. Say, possibly handling 8- 
>> wide vectors internally as 2x 4-wide operations, or maybe allowing 
>> 256-bit vector ops in the absence of 256-bit vectors in hardware.
>>
>> ...
>>
> Q+ has two ALU’s, which may, at some point, be expanded by two more ALUs 
> with reduced functionality.
> 

I have 3x 64-bit ALUs.
The first 2 may combine for 128-bit operations.


> It sounds great, but I cannot seem to get Q+ to synthesize correctly. It 
> reports the size as 45kLUTs, but I know the size is about double that, 
> based on previous synthesis. A bunch of the components are showing up as 
========== REMAINDER OF ARTICLE TRUNCATED ==========