Deutsch English Français Italiano |
<vbqcds$35l1q$2@dont-email.me> View for Bookmarking (what is this?) Look up another Usenet article |
Path: ...!news.mixmin.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: BGB <cr88192@gmail.com> Newsgroups: comp.arch Subject: Re: Tonights Tradeoff Date: Tue, 10 Sep 2024 16:07:00 -0500 Organization: A noiseless patient Spider Lines: 242 Message-ID: <vbqcds$35l1q$2@dont-email.me> References: <vbgdms$152jq$1@dont-email.me> <17537125c53e616e22f772e5bcd61943@www.novabbs.org> <vbj5af$1puhu$1@dont-email.me> <a37e9bd652d7674493750ccc04674759@www.novabbs.org> <vbog6d$2p2rc$1@dont-email.me> <vboqpp$2r5v4$1@dont-email.me> <vbpmqr$30vto$1@dont-email.me> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Injection-Date: Tue, 10 Sep 2024 23:07:09 +0200 (CEST) Injection-Info: dont-email.me; posting-host="8642d80dc83a1f0de3aeee9315299589"; logging-data="3331130"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/fgK/ySqIUtMXUxP51oyw9dZmxGYM23Ho=" User-Agent: Mozilla Thunderbird Cancel-Lock: sha1:HIqSJyTXbGY8r5RkxtIyZ0zcmAw= Content-Language: en-US In-Reply-To: <vbpmqr$30vto$1@dont-email.me> Bytes: 11002 On 9/10/2024 9:58 AM, Robert Finch wrote: > On 2024-09-10 3:00 a.m., BGB wrote: >> >> I haven't really understood how it could be implemented. >> But, granted, my pipeline design is relatively simplistic, and my >> priority had usually been trying to make a "fast but cheap and simple" >> pipeline, rather than a "clever" pipeline. >> >> Still not as cheap or simple as I would want. >> >> >> >>> Qupls has RISC-V style vector / SIMD registers. For Q+ every >>> instruction can be a vector instruction, as there are bits indicating >>> which registers are vector registers in the instruction. All the >>> scalar instructions become vector. This cuts down on some of the >>> bloat in the ISA. There is only a handful of vector specific >>> instructions (about eight I think). The drawback is that the ISA is >>> 48-bits wide. However, the code bloat is less than 50% as some >>> instructions have dual- operations. Branches can increment or >>> decrement and loop. Bigfoot uses a postfix word to indicate to use >>> the vector form of the instruction. Bigfoot’s code density is a lot >>> better being variable length, but I suspect it will not run as fast. >>> Bigfoot and Q+ share a lot of the same code. Trying to make the guts >>> of the cores generic. >>> >> >> In my case, the core ended up generic enough that it can support both >> BJX2 and RISC-V. Could almost make sense to lean more heavily into >> this (trying to consolidate more things and better optimize costs). >> >> Did also recently get around to more-or-less implementing support for >> the 'C' extension, even as much as it is kinda dog-chewed and does not >> efficiently utilize the encoding space. >> >> >> It burns a lot of encoding space on 6 and 8 bit immediate fields (with >> 11 bit branch displacements), more 5-bit register fields than >> ideal, ... so, has relatively few unique instructions, but: >> Many of the instructions it does have are left with 3 bit register >> fields; >> Has way a bit too many immediate-field layouts as it just sort of >> shoe- horns immediate fields into whatever bits are left. >> >> Though, turns out I could skip a few things due to them being N/E in >> RV64 (RV32, RV64, and RV128 get a slightly different selection of ops >> in the C extension). >> >> Like, many things in RV land make "annoying and kinda poor" design >> choices. >> >> Then again, if one assumes that the role of 'C' is mostly: >> Does SP-relative loads/stores and MOV-RR. >> >> Well, it does do this at least... >> >> Nevermind if you want to use any of the ALU ops (besides ADD), or non- >> stack-relative Load/Store, well then, enjoy the 3 bit register fields. >> >> And, still way too many immediate-field encodings for what is >> effectively load/store and a few ALU ops. >> >> >> >> >> I am not as much a fan of RISC-V's 'V' extension mostly in that it >> would require essentially doubling the size of the register file. > > The register file in Q+ is huge. One of the drawbacks of supporting > vectors. There were 1024 physical registers for support. Reduced it to > 512 and that still may be too many. There was a 4kb wide mapping ram, > resulting in a warning message. I may have to split up components into > multiple copies to get the desired size to work. > I am dealing with 64 registers. In RV Mode, it is split between the GPRs and FPRs, in BJX2 a unified GPR space; V would mean either extending the register set to 128, or adding a separate 32*128 bit register file, which is AFAICT the effective minimum. Neither option would be good for resource cost. Cheaper would have been a SIMD system based on paired FPRs or similar. >> >> And, if I were to do something like 'V' I would likely do some things >> differently: >> Rather than having an instruction to load vector control state into >> CSR's, it would make more sense IMO to use bigger 64-bit instructions >> and encode the vector state directly into these instructions. >> >> While this would be worse for code density, it would avoid needing to >> burn instructions setting up vector state, and would have less penalty >> (in terms of clock-cycles) if working with heterogeneous vectors. >> >> >> Say, one possibility could be a combo-SIMD op with a control field: >> 2b vector size >> 64 / 128 / resv / resv >> 2b element size >> 8 / 16/ 32/ 64 >> 2b category >> wrap / modulo >> float >> signed saturate >> unsigned saturate >> 6b operator >> add, sub, mul, mac, mulhi, ... >> > > Q+ is setup almost that way. It uses 48b instructions. There is a 2b > precision field in instructions that determines the lane/sub element > size 8/16/32/64. The precision field also applies to scalar registers. > The category is wrapped up in the opcode which is seven bits. One can do > a float add on a vector register, then a bitwise operation on the same > register. The vector registers work the same way as the scalar ones. > There is no type state associated with them, unlike RISCV. To control > the length (which lanes are active) there is a global mask register > instead of a vector length register. > > Sign control plus a vector indicator for each register spec results in a > seven-bit spec, and there are four registers encoded in an instruction, > which uses 28-bit, combined with a seven-bit opcode is 35 bits. There > was just no way the instruction set was fitting in 32b. For a while the > ISA was 40-bit, but I figured it was better to go 48-bit then add some > additional functionality to make up for the wider ISA. > My leaning for 64 bit was mostly so that it does not break 32-bit alignment for the 32-bit instructions. In this case: WEX currently needs 32-bit alignment; If I added superscalar to BJX2, it would likely also require 32-bit alignment. > > >> Though, with not every combination necessarily being allowed. >> Say, for example, if the implementation limits FP-SIMD to 4 or 8 >> vector elements. >> >> Though, it may make sense to be asymmetric as well: >> 2-vide vectors can support Binary64 >> 4-wide can support Binary32 >> 8-wide can support Binary16 ( + 4x FP16 units) >> 16 can support FP8 ( + 8x FP8 units) >> >> Whereas, say, 16x Binary32 capable units would be infeasible. >> >> Well, as opposed to defining encodings one-at-a-time in the 32-bit >> encoding space. >> >> >> It could be tempting to possibly consider using pipelining and multi- >> stage decoding to allow some ops as well. Say, possibly handling 8- >> wide vectors internally as 2x 4-wide operations, or maybe allowing >> 256-bit vector ops in the absence of 256-bit vectors in hardware. >> >> ... >> > Q+ has two ALU’s, which may, at some point, be expanded by two more ALUs > with reduced functionality. > I have 3x 64-bit ALUs. The first 2 may combine for 128-bit operations. > It sounds great, but I cannot seem to get Q+ to synthesize correctly. It > reports the size as 45kLUTs, but I know the size is about double that, > based on previous synthesis. A bunch of the components are showing up as ========== REMAINDER OF ARTICLE TRUNCATED ==========