Article <v98mdq$tucp$1@dont-email.me>

Warning: mysqli::__construct(): (HY000/1203): User howardkn already has more than 'max_user_connections' active connections in D:\Inetpub\vhosts\howardknight.net\al.howardknight.net\includes\artfuncs.php on line 21
Failed to connect to MySQL: (1203) User howardkn already has more than 'max_user_connections' active connections
Warning: mysqli::query(): Couldn't fetch mysqli in D:\Inetpub\vhosts\howardknight.net\al.howardknight.net\index.php on line 66
Article <v98mdq$tucp$1@dont-email.me>
Deutsch English Français Italiano
<v98mdq$tucp$1@dont-email.me>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!feeds.phibee-telecom.net!3.eu.feeder.erje.net!feeder.erje.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: BGB <cr88192@gmail.com>
Newsgroups: comp.arch
Subject: Re: Instruction Tracing
Date: Sat, 10 Aug 2024 16:34:47 -0500
Organization: A noiseless patient Spider
Lines: 289
Message-ID: <v98mdq$tucp$1@dont-email.me>
References: <v970s3$flpo$1@dont-email.me>
 <2024Aug10.121802@mips.complang.tuwien.ac.at>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sat, 10 Aug 2024 23:34:51 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="5cc651a5fc60f56c841dd20083265ec1";
	logging-data="981401"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX18WR+hMz+zLSGaIHueZg42phYw/QKZ6oZU="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:EUfbl+0JUU+9hpm0teVXfCqK27w=
Content-Language: en-US
In-Reply-To: <2024Aug10.121802@mips.complang.tuwien.ac.at>
Bytes: 12520

On 8/10/2024 5:18 AM, Anton Ertl wrote:
> Lawrence D'Oliveiro <ldo@nz.invalid> writes:
>> One thing these instruction traces would frequently report is that integer
>> multiply and divide instructions were not so common, and so could be
>> omitted and emulated in software, with minimal impact on overall
>> performance. We saw this design decision taken in the early versions of
>> Sun’s SPARC for example, and also IBM’s ROMP as used in the RT PC.
> 
> Alpha and IA-64 have no integer division.  IIRC IA-64 has no FP
> division.
> 
> One interesting aspect of RISC-V is that they put multiplication and
> division in the same extension (which is included in RV64G, i.e., the
> General version of RISC-V).
> 

Initially, BJX2 didn't have division, but now it does.
I will put the blame on trying to support a RISC-V decoder as well...

One can leave out both and the performance effect is fairly modest.
Some programs need faster integer divide, but usual workaround is to use 
a lookup table and multiply by reciprocal (often needed because on many 
machines with a divide instruction, it was often still slow).


There is an optimization which can reduce common-case integer divide 
down to around 3 cycles, but if the code already uses lookup tables (to 
sidestep slow or absent integer divide), this doesn't save much (so only 
really has benefit if one assumes that the code already assumes fast 
integer divide).

So, seemingly, in practice the only thing to really notice/care about 
this is Dhrystone (which significantly over-represents the value of 
integer divide).



As-is, leaving out hardware divide would save ~ 2K LUTs, as this is 
mostly the cost of the Shift-ADD unit.

But, this unit also implements another (arguably slightly more useful) 
feature: 64-bit multiply. Was also able to route floating-point divide 
through this unit.


Theoretically, I could extend the Shift-Add unit to 128 bits, and 
potentially add:
   128-bit integer multiply and divide;
   Binary128 FMUL and FDIV.

For Binary128, FADD/FSUB and FCMP are cheaper than FMUL, so this 
approach could potentially make Binary128 support in hardware "viable".

But, debatable if worth the LUTs.


The BJX2 core is already expensive...

Though, a few big/expensive features being:
   The FP-SIMD unit (supports 4x Binary32 with a 3-cycle latency);
   The stuff needed for the LDOP / RISC-V 'A' extension (*);
   The main FPU (Binary64);
   ...

A little over 1/4 of the LUT cost of the core goes into the L1 caches.


*: The LDOP extension adds x86 style Load-Op and Op-Store instructions 
for basic ALU instructions, because the RISC-V 'A' extension already 
requires one to pay most of the cost of doing so (even if the 'A' 
extension has slightly different behavior, most of the difference is in 
the decoder).

The "cheaper option" would have been:
   Don't bother with doing ALU ops against memory;
   Don't bother with LL/SC or CAS;
   Just add a SWAP/XCHG instruction, with non-caching variants.
     Non-Caching XCHG is sufficient to implement a Spinlock/Mutex.


I guess the selling point of atomic operations is that one only sees the 
before or after of an operation, but I am not sold on its merits.

The design of the 'A' extension also seems to assume a memory subsystem 
where a core can 'reserve' a cache line in a single-writer sense. I just 
sort of "winged it", as my memory subsystem was designed around 
volatile/non-volatile access:
   Volatile:
     The L1 cache flushes the line (if needed)
     Fetches;
     Does operation;
     Flushes line back to memory shortly afterwards.
   Non-volatile:
     Default, fetch line and keep it around;
     Memory may become stale if not flushed.
So, the AQ/RL flags mostly just serve as hints for whether to use 
Volatile access.

As-is, the RV LL/SC instructions wont actually work as described.
   Also FENCEI will just trap, ...

Did see some comments online where people were also saying that 'A' 
semantics and LL/SC can't be implemented on an AXI bus, but not really 
looked into this.



Outside the main CPU, there is the rasterizer module, which uses about 
as many LUTs as a small 32-bit CPU core (and roughly the same number of 
DSP48's as the main CPU core).


There is a feature that is "kinda expensive", namely the "LDTEX" 
instruction (Load Texture / Texel Load), which is less needed with the 
rasterizer module. It was mostly relevant to software-rasterization 
performance in TKRA-GL. But, annoyingly, would be relevant if/ever I get 
an ARB-ASM or GLSL compiler implemented.


But, then I would also need to figure out how to approach GL's behavior 
towards Nearest vs Linear fetch in shaders. Don't necessarily want to 
handle it dynamically in software.

Though, I guess one option would be to have a multi-stage compiler:
First stage, compile to an IR, likely similar to a modified/extended 
form of ARB-ASM (ARM-ASM would still require basic translation, mostly 
to map symbolic names to internal register numbers and likely to unpack 
some complex instructions into simpler ones);
As needed, JIT to machine code, with some variation based on (among 
other things) the parameters of the bound textures.



>> Later, it seems, the CPU designers realized that instruction traces were
>> not the final word on performance measurements, and started to include
>> hardware integer multiply and divide instructions.
> 
> When you invest more hardware to increase performance per cycle, at
> one point the best return on investment is to have multiplication and
> division instructions.  What is interesting is that the multipliers
> have than soon been fully pipelined.  Or, as Mitch Alsup reports, in
> cases where that was cheaper, have two half-pipelined multipliers.
> Apparently there are enough applications that require a huge number of
> multiplications; my guess is that the NSA won't tell us what they are.
> 

Multiply is probably 1 or 2 orders of magnitude more common than divide.


My rough ranking of instruction probabilities (descending probability, *):
   Load/Store (Constant Displacement, ~30%);
   Branch (~ 14% of ops);
   ALU, ADD/SUB/AND/OR (~ 13%);
   Load/Store (Register Indexed, ~10%);
   Compare and Test (~ 6%);
   Integer Shift (~ 4%);
   Register Move (~ 3%);
   Sign/Zero Extension (~ 3%);
   ALU, XOR (~ 2%);
   Multiply (~ 2%);
   ...

*: Crude estimate based on categorizing the dynamic execution 
probabilities (which are per-instruction rather than by category).

Meanwhile, DIV and friends are generally closer to 0.05% or so...
   You can leave them out and hardly anyone will notice.


For the most part, something like RISC-V makes sense, except that 
omitting Indexed Load/Store is basically akin to shooting oneself in the 
foot (and does result in a significant increase in the amount of Shift 
and ADD instructions used).


With RISC-V, one may see ~ 25% Load/Store followed by ~ 20% ADD and 15% 
Shift, ...

========== REMAINDER OF ARTICLE TRUNCATED ==========