Article <v8bi3e$16ahe$1@dont-email.me>

Deutsch English Français Italiano
<v8bi3e$16ahe$1@dont-email.me>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!3.eu.feeder.erje.net!feeder.erje.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: BGB <cr88192@gmail.com>
Newsgroups: comp.arch
Subject: Re: Arguments for a sane ISA 6-years later
Date: Tue, 30 Jul 2024 15:23:07 -0500
Organization: A noiseless patient Spider
Lines: 241
Message-ID: <v8bi3e$16ahe$1@dont-email.me>
References: <b5d4a172469485e9799de44f5f120c73@www.novabbs.org>
 <v7ubd4$2e8dr$1@dont-email.me> <v7uc71$2ec3f$1@dont-email.me>
 <2024Jul26.190007@mips.complang.tuwien.ac.at> <v811ub$309dk$1@dont-email.me>
 <2024Jul29.145933@mips.complang.tuwien.ac.at> <v88gru$ij11$1@dont-email.me>
 <2024Jul30.114424@mips.complang.tuwien.ac.at>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 30 Jul 2024 22:23:11 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="ecf92ebdb9bfec0c842ce7f3fa23e571";
	logging-data="1255982"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX192RSqYEwspR63aGtV5x31yOIEO95LEN7Q="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:uCnMN35qYwP6xebtAiC9Wc05Pq0=
In-Reply-To: <2024Jul30.114424@mips.complang.tuwien.ac.at>
Content-Language: en-US
Bytes: 11236

On 7/30/2024 4:44 AM, Anton Ertl wrote:
> BGB <cr88192@gmail.com> writes:
>> Otherwise, stuff isn't going to fit into the FPGAs.
>>
>> Something like TSO is a lot of complexity for not much gain.
> 
> Given that you are so constrained, the easiest corner to cut is to
> have only one core.  And then even seqyential consistency is trivial
> to implement.
> 

On the XC7A100T, this is what I am doing...

With the current feature-set, don't have enough resource budget to go 
dual core at present.

I can go dual core on the Xc7A200T though.



Granted, one could argue that maybe one should not do such an elaborate 
CPU. Say, a case could be made for just doing a RISC-V implementation.

There is an RV32GC implementation (dual-issue superscalar) that can run 
on the XC7A100T that, ironically, still takes most of the FPGA and can 
only run at ~ 25 or 33 MHz. Its IPC is pretty good, but it runs at a low 
clock-speed and is 32-bit.

Only real way to make small/fast cores though is to make them 
single-issue and limit the feature-set (only doing a basic integer ISA).



Some amount of the cases where consistency issues have come up in my 
case have do do with RAM-backed hardware devices, like the rasterizer 
module. It has its own internal caches that need to be flushed, and not 
flushing caches (between this module and CPU) when trying to "transfer" 
control over things like the framebuffer or Z-buffer, can result in 
obvious graphical issues (and, texture-corruption doesn't necessarily 
look good either).

At present, the implementation is based mostly on drawing to a backing 
buffer which (at least once per frame, often more) needs to be reclaimed 
by the main CPU, so that its contents can be drawn to the screen or into 
the window buffer (in GUI mode).


Currently though, this module is relatively fast, but generally the CPU 
side of things isn't fast enough to keep it busy.

At present, the CPU still does transform. It is looking like, if one 
wants speed, one might also need a module that is able to do things like 
3D transform (and/or figure out ways to try to make the front-end stages 
faster).

Ironically, despite its seeming levels of suck, I am apparently getting 
(technical) 3D performance stats on par with the original PlayStation, 
but a lot of the PS1 games had arguably comically low geometric complexity.

Sadly, AFAIK, no one has open-sourced any of the PS1 games (and Quake 
1/2/3 don't have quite the same level of geometric minimalism).


Might be nice if the front-end stages could have been done using 
fixed-point math, but generally OpenGL is built around floating point, 
and generally stuff doesn't work correctly unless one had more or less 
full precision Binary32 in the transform stages.


Also using "glBegin()"/"glEnd()" and doing math per-vertex is not ideal 
for a CPU bound use-case.

It is generally better in this case to try to prebuild vertex arrays and 
use "glDrawArrays()" or "glDrawElements()" or similar. But, this isn't 
really how the Quake engines work. If anything, Quake3 seems to lean a 
little more into it, seemingly going much of its rendering with 
GL_TRIANGLE_FAN and GL_TRIANGLE_STRIP.

Ironically, in contrast to Quake 1 which really liked using GL_POLYGON.

With the current implementation, likely fastest case would be to use 
vertex arrays and GL_QUADS (with an occasional collinear vertex when one 
needs a triangle).

Though, seems like GL 1.x assumed costs being per-vertex rather than 
per-primitive (Triangle or Quad in this case).

Actually, I am almost left to wonder if an API design like Direct3D 
might have fared better here.


Granted, a case could be made for trying to make an implementation which 
does most of its front-end work in homogeneous coordinates (AKA: 4D XYZW 
space) rather than world-space (but, this would require a non-trivial 
rewrite to the front-end stages). But, could somewhat reduce the amount 
of times I need to send vertices through the transformation matrix.

Based on past experiments with software rasterized GL though, I had 
assumed that most of the time was going to be eaten up by the backend 
work (edge walking and span drawing); which is where I had put most of 
my optimization attention in TKRA-GL.


OTOH, there are possibly other uses for a rasterizer module, such as 
potentially using it for 2D rendering tasks (without otherwise sending 
everything through the OpenGL API or similar).

Though, its use-cases are partially limited in that it only supports 
squarish power-of-2 textures in Morton Order (which are atypical in 
things like UI drawing, where images are often NPOT and raster order).

Well, technically also the texture images and buffers needs to be at a 
physical address and with a 16-byte-aligned base address, ..., but 
nevermind this part...



>> Contrast, floating point and precise exceptions are a lot more relevant
>> to software.
> 
> John von Neumann (IIRC) argued against floating point, with similar
> arguments that are now used to defend weak ordering.
> 

Floating point has a lot of obvious use-cases though (and is already in 
widespread use). Would be a hard sell to have a processor without any 
floating point support.


Like, we can use fixed point where it makes sense to do so, but there 
are also a lot of cases where fixed-point doesn't really work for the 
problem.

Granted, there are also cases of people using floating point where maybe 
they shouldn't.


Though, there are cases where one could argue for precision-reduced 
floating point, say:
   S.E8.F16.Z7
   S.E11.F32.Z20
Where, Z bits are ignored and filled with 0's (and the other low-order 
bits are not necessarily accurate).

The argument being, the former is Binary32 but using logic similar to 
what one might use for a proper Binary16 unit, and the latter is 
Binary64 with logic similar to what one may use for a proper Binary32 unit.

FWIW: The latter is what one may get in my case if using the 
FADDA/FSUBA/FMULA instructions, with no guarantees though about format 
other than that it is "equivalent to or slightly better than Binary32".

The former existed for SIMD, but got largely displaced by proper 
Binary32 as I actually needed fast Binary32 SIMD (and the truncated case 
only makes sense if the hardware is natively doing Binary16 or similar).

Granted, in both cases, assuming that one is doing the internal math for 
FMUL using DSP48's (hard logic) or something similar.

Don't currently have any "native" floating point smaller than Binary16; 
though several 8-bit formats (including A-Law) are supported via 
converter ops.


Despite going and defining dedicated FP8 formats though (E4.F3.S, E4.F4, 
S.E4.F3), have more often ended up using A-Law (S.E3.F4) and sometimes 
adding an exponent bias (though generally because it has both a sign and 
is more accurate in these cases).

Generally can't use A-Law directly for NN's, because it seems one needs 
around 8 bits or so for the intermediate accumulator (mostly requiring 
Binary16 or similar). But, FP8 or biased A-Law would make a sensible 
format for weights and inputs/outputs.

I guess, if one wants, they could try to make a case for a SIMD op that 
========== REMAINDER OF ARTICLE TRUNCATED ==========