Article <v8bqik$17qhc$1@dont-email.me>

Warning: mysqli::__construct(): (HY000/1203): User howardkn already has more than 'max_user_connections' active connections in D:\Inetpub\vhosts\howardknight.net\al.howardknight.net\includes\artfuncs.php on line 21
Failed to connect to MySQL: (1203) User howardkn already has more than 'max_user_connections' active connections
Warning: mysqli::query(): Couldn't fetch mysqli in D:\Inetpub\vhosts\howardknight.net\al.howardknight.net\index.php on line 66
Article <v8bqik$17qhc$1@dont-email.me>
Deutsch English Français Italiano
<v8bqik$17qhc$1@dont-email.me>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: BGB <cr88192@gmail.com>
Newsgroups: comp.arch
Subject: Re: Arguments for a sane ISA 6-years later
Date: Tue, 30 Jul 2024 17:47:44 -0500
Organization: A noiseless patient Spider
Lines: 164
Message-ID: <v8bqik$17qhc$1@dont-email.me>
References: <b5d4a172469485e9799de44f5f120c73@www.novabbs.org>
 <v7ubd4$2e8dr$1@dont-email.me> <v7uc71$2ec3f$1@dont-email.me>
 <2024Jul26.190007@mips.complang.tuwien.ac.at> <v811ub$309dk$1@dont-email.me>
 <2024Jul29.145933@mips.complang.tuwien.ac.at> <v88gru$ij11$1@dont-email.me>
 <2024Jul30.114424@mips.complang.tuwien.ac.at> <v8bi3e$16ahe$1@dont-email.me>
 <v8bk13$15rb6$7@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 31 Jul 2024 00:47:48 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="c2768dd02059df2193d7bfbb2f7883e4";
	logging-data="1305132"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX1/Q+U1kUAXuc4dmlxOWXMJU3rdVQN2KGGk="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:HxN+WjZoQAg2JDKEh/CgKJaNlrg=
In-Reply-To: <v8bk13$15rb6$7@dont-email.me>
Content-Language: en-US
Bytes: 6555

On 7/30/2024 3:56 PM, Chris M. Thomasson wrote:
> On 7/30/2024 1:23 PM, BGB wrote:
>> On 7/30/2024 4:44 AM, Anton Ertl wrote:
>>> BGB <cr88192@gmail.com> writes:
>>>> Otherwise, stuff isn't going to fit into the FPGAs.
>>>>
>>>> Something like TSO is a lot of complexity for not much gain.
>>>
>>> Given that you are so constrained, the easiest corner to cut is to
>>> have only one core.  And then even seqyential consistency is trivial
>>> to implement.
>>>
>>
>> On the XC7A100T, this is what I am doing...
>>
>> With the current feature-set, don't have enough resource budget to go 
>> dual core at present.
>>
>> I can go dual core on the Xc7A200T though.
>>
>>
>>
>> Granted, one could argue that maybe one should not do such an 
>> elaborate CPU. Say, a case could be made for just doing a RISC-V 
>> implementation.
>>
>> There is an RV32GC implementation (dual-issue superscalar) that can 
>> run on the XC7A100T that, ironically, still takes most of the FPGA and 
>> can only run at ~ 25 or 33 MHz. Its IPC is pretty good, but it runs at 
>> a low clock-speed and is 32-bit.
>>
>> Only real way to make small/fast cores though is to make them 
>> single-issue and limit the feature-set (only doing a basic integer ISA).
> [...]
> 
> Have you ever messed around with a Cell processor? Think of its vector 
> processing units, or Synergistic Processing Elements (SPE) iirc. Also, 
> iirc it was not that easy to program for. buffered DMA wrt the SPE's, 
> again iirc. So, some games only used the "single" PPE unit. Iirc, they 
> wanted more PPE units but that was not realized...
> 


No real first-hand experience programming for it, but was early 20s when 
the PlayStation3 came out, and wasn't really messing with much of 
anything beyond normal desktop PCs at the time.


I had a few times considered trying to pair a bigger core (such as one 
running BJX2 main profile) with smaller cores (running a smaller profile 
for BJX2), but couldn't really get the smaller core small enough while 
still being useful for what I wanted to do with it.


While a moderately smaller core is possible by using a single-issue 
integer-only design, this is rather limited...

And, sticking two more feature-limited cores on an FPGA isn't terribly 
useful.

Nor is going tri-core or quad-core with minimalist cores.

Say:
   One core, of my current configuration is more useful than, say:
   Two cores that do basic Integer+FPU+TLB;
   Four cores, that only do Integer.

Like, say, an RV64I or RV32IM quad-core would not necessarily all that 
useful.



Trying to fit the BJX2 core on an XC7S50, I needed to drop to 2-wide in 
order to fit it in with the fast SIMD unit.
It was a tradeoff between:
   3-wide, but 10 cycle SIMD ops;
   2-wide, with 3 cycle SIMD ops.


On the XC7S25 or XC7A35T, not really managed to fit much beyond simple 
integer cores. But, these FPGAs are small enough, that it is generally 
better to drop to 32-bit.

So, for example, an RV32IM is about what makes sense on an XC7S25 or 
XC7A35T.

Where the last number is loosely correlated to total LUT size:
   XC7A100T is ~ 3x the LUTs as the XC7A35T.

But not exactly 1:1 between Artix and Spartan.

For Spartan, the number is closer to the number of kLUTs, but Artix has 
slightly less LUTs relative to the part number; so the XC7S25 and 
XC7A35T are fairly comparable.




As for the matter of, if I add SIMD ops for 8-bit multiply widening to 
Binary16, whether to use A-Law or FP8, currently FP8 seems to be ahead:
   More popular (NVIDIA is also using FP8);
   More dynamic range;
   Will be slightly cheaper to implement;
   ...

Also torn between the more expensive route:
   Trying for a 3 cycle MAC operation;
   Would likely glue it onto the low-precision SIMD unit.
Or, the cheaper route:
   Trying for a 2 cycle PMUL;
   Likely via the CONV2 path.

Not likely worthwhile to put it in the 3-cycle MUL path:
   Would gain little performance-wise over converter ops;
   This was mostly used for more complex converters:
     Index-Color Packing;
     Color-Cell Encode;
     ...

The operation logic is likely fast enough that it could be put in a 
2-cycle path.
Though, trying to shove it onto the front-end of a SIMD FADD is likely 
pushing it.

Multiplier logic likely something like:
   tSgnA=valA[7];
   tSgnB=valB[7];
   tExpA={ valA[6], !valA[6], valA[5:3] };
   tExpB={ valB[6], !valB[6], valB[5:3] };
   tFraA=valA[2:0];
   tFraB=valB[2:0];
   tZeroA=(valA[6:0]==7'h00);
   tZeroB=(valB[6:0]==7'h00);
   tSgnC=tSgnA^tSgnB;
   tExpC0=tExpA+tExpB+0;
   tExpC1=tExpA+tExpB+1;
   tZeroC=tZeroA|tZeroB;
   case({tFraA, tFraB})
     6'b000_000: tFraC0=8'h40;
     6'b000_001: tFraC0=8'h48;
     ...
     6'b001_001: tFraC0=8'h51;
     ...
     6'b111_111: tFraC0=8'hE1;
   endcase
   if(tFraC0[7])
   begin
     tExpC=tExpC1;
     tFraC={tFraC0[7:0], 3'h0};
   end
   else
   begin
     tExpC=tExpC0;
     tFraC={tFraC0[6:0], 4'h0};
   end
   tValC={tSgnC, tExpC, tFraC[9:0]};
   if(tZeroC)
     tValC=16'h0000;


Which can most likely fit in a 2-cycle operation...

....