Article <vq16gp$n467$1@dont-email.me>

Deutsch English Français Italiano
<vq16gp$n467$1@dont-email.me>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!eternal-september.org!.POSTED!not-for-mail
From: BGB <cr88192@gmail.com>
Newsgroups: comp.arch
Subject: Re: Why VAX Was the Ultimate CISC and Not RISC
Date: Sun, 2 Mar 2025 02:56:49 -0600
Organization: A noiseless patient Spider
Lines: 181
Message-ID: <vq16gp$n467$1@dont-email.me>
References: <vpufbv$4qc5$1@dont-email.me>
 <2025Mar1.125817@mips.complang.tuwien.ac.at> <vq01oh$dq4s$1@dont-email.me>
 <525e0ecc62c0275aba6ba75e6515929f@www.novabbs.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sun, 02 Mar 2025 09:56:59 +0100 (CET)
Injection-Info: dont-email.me; posting-host="a4cd612affcc1148eabb86759e7b6b28";
	logging-data="757959"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX1/eMxLef1KjFM63DdZcUI21Jmdquoh6VgI="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:jX47KeuKzwvid6ZKXbqyDfqp/LQ=
In-Reply-To: <525e0ecc62c0275aba6ba75e6515929f@www.novabbs.org>
Content-Language: en-US
Bytes: 7340

On 3/1/2025 7:02 PM, MitchAlsup1 wrote:
> On Sat, 1 Mar 2025 22:29:27 +0000, BGB wrote:
> 
>> On 3/1/2025 5:58 AM, Anton Ertl wrote:
>>> Lawrence D'Oliveiro <ldo@nz.invalid> writes:
> ------------------------------
>> Would likely need some new internal operators to deal with bit-array
>> operations and similar, with bit-ranges allowed as a pseudo-value type
>> (may exist in constant expressions but will not necessarily exist as an
>> actual value type at runtime).
>>    Say:
>>      val[63:32]
>> Has the (63:32) as a BitRange type, which then has special semantics
>> when used as an array index on an integer type, ...
> 
> Mc 88K and My 66000 both have bit-vector operations.
> 

OK.

I didn't previously.

But, use-cases have started to appear.


>> The previous idea for bitfield extract/insert had turned into a
>> composite BITMOV instruction that could potentially do both operations
>> in a single instruction (along with moving a bitfield directly between
>> two instructions).
> 
> Using CARRY and extract + insert, one can extract a field spanning
> a doubleword and then insert it into another pair of doublewords.
> 1 pseudo-instruction, 2 actual instructions.
> 
>> Idea here is that it may do, essentially a combination of a shift and a
>> masked bit-select, say:
>>    Low 8 bits of immediate encode a shift in the usual format:
>>      Signed 8-bit shift amount, negative is right shift.
>>    High bits give a pair of bit-offsets used to compose a bit-mask.
>>      These will MUX between the shifted value and another input value.
> 
> You want the offset (a 6-bit number) and the size (another 6-bit number)
> in order to identify the field in question.

It is 8 bits partly because this is what the existing shifter uses.
This can also deal with up to 128 bits (-128 .. 127).
Don't necessarily want to have different encodings for 64 and 128 bit 
variants.


It would represent a shift of (DestOffset-SrcOffset), where:
   For insert it will be positive, for extract, negative.

Keeping this part as-is means that the operation doesn't need to 
fundamentally change the behavior of the SHAD unit (it will just do the 
shift as normal).


If I used bare 6 bit fields:
I couldn't do both extract and insert using the same operation;
It couldn't directly perform 128-bit extract or insert, which needs at 
least 7 bits.

Granted, full 8 bit for all the fields is possibly overkill.
   Though, this leaves possibly 8+7+7.

Or, 7+6+6 is limiting to 64-bits only.
But, would need to special-case the handling, as Bit(7) is effectively 
used as the shift direction:
   00..3F: Left, 0..63 bits
   40..7F: Also Left, 0..63 bits.
   80..BF: Right, 64..1 bits.
   C0..FF: Also Right, 64..1 bits.
Except for 128 bit:
   00..7F: Left, 0..127 bits
   80..FF: Right, 128..1 bits.
And, 32 bits:
   00..1F: Left, 0..31 bits
   20..3F: Also Left, 0..31 bits
   ...

For the right shift operators, the sign is inverted in hardware (these 
existed initially mostly to save a need to negate the input for variable 
right shift).

For RISC-V mode, it still uses this behavior, but generally code doesn't 
notice (a more strict interpretation of the RV spec would require 
masking off Bit(7) for the shift amount, such that giving them negative 
amounts wouldn't flip the shift direction).

Though, AFAIK, my existing behavior is closer to the original PDP/VAX 
shift operators...


As-is, decoding rules would have:
   JumboImm+3RI: Gives Imm33s with XG3, imm29s with XG1/XG2
   JumboOp +3RI: Gives 4RI Imm11, or 3RI Imm17s

One consideration was to special-case SHLR.L and similar, such that 
JumboImm+SHLR could instead encode:
   BITMOV  Rs, Rp, Rn, Imm24

With SHLR.Q encoding a 128-bit BITMOVX.

But, debatable if this would be "actually a good idea".


> 
>> I am still not sure whether this would make sense in hardware, but is
>> not entirely implausible to implement in the Verilog.
> 
> In the extract case, you have the shifter before the masker
> In the insert case, you have the masker before the shifter
> followed by a merge (OR). Both maskers use the size. Offset
> goes only to the shifter.
> 

I was thinking:
   tmp=Rm<<Ro
   mask=MASKGEN(H, L)
   Rn=(Rp&(~mask))|(tmp&mask);

With a singed shift amount, this can do both insert and extract with the 
same logic.

Though, extract will require feeding a 0 into Rp.


MASKGEN(H, L):
   H>L:
     ((1<<H)-1) & (~((1<<L)-1))
   H<=L:
     ((1<<H)-1) | (~((1<<L)-1))

The H<=L could encode some other cases that don't directly correlate to 
bitfields, such as shifting most of the bits left or right but then 
inserting something non-moving in between (possibly from a different 
bitfield).

Generating H dynamically as L+W was considered, and could possibly save 
bits, but would increase the cost of the mask generation logic in this case.


>> Would likely be a 2 or 3 cycle operation, say:
>>    EX1: Do a Shift and Mask Generation;
>>      May reuse the normal SHAD unit for the shift;
>>      Mask-Gen will be specialized logic;
>>    EX2:
>>      Do the MUX.
>>    EX3:
>>      Present MUX result as output (passed over from EX2).
> 
> I have done these in 1 cycle ...


This is pushing it though.


As-is, Shift is a 2 cycle operation. Mostly to keep timing from being tight.

Granted, I have noted that at present, timing is made a lot tighter 
(throughout most of the core), due to the SIMD unit.


If I disable the SIMD unit, I am suddenly left with around 2.5ns of 
slack... (vs otherwise sitting at around 0.4ns of slack).

But, then, OpenGL is slower (as, without the SIMD unit, FPU SIMD 
operators go from pipelined 3 cycles to stalling 10 cycles; but does 
save around 2 kLUT).



But, thinking, since the MUX is only a single level of LUTs, could 
probably fit in onto the end of the shift stage without too much issue.

The MaskGen operation can be done with actually lower latency than a 
========== REMAINDER OF ARTICLE TRUNCATED ==========