Article <vpasaa$3itge$1@dont-email.me>

Deutsch English Français Italiano
<vpasaa$3itge$1@dont-email.me>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!weretis.net!feeder9.news.weretis.net!news.quux.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!eternal-september.org!.POSTED!not-for-mail
From: BGB <cr88192@gmail.com>
Newsgroups: comp.arch
Subject: Re: Cost of handling misaligned access
Date: Fri, 21 Feb 2025 15:47:50 -0600
Organization: A noiseless patient Spider
Lines: 205
Message-ID: <vpasaa$3itge$1@dont-email.me>
References: <5lNnP.1313925$2xE6.991023@fx18.iad> <vnosj6$t5o0$1@dont-email.me>
 <2025Feb3.075550@mips.complang.tuwien.ac.at> <volg1m$31ca1$1@dont-email.me>
 <voobnc$3l2dl$1@dont-email.me>
 <0fc4cc997441e25330ff5c8735247b54@www.novabbs.org>
 <vp0m3f$1cth6$1@dont-email.me>
 <74142fbdc017bc560d75541f3f3c5118@www.novabbs.org>
 <20250218150739.0000192a@yahoo.com>
 <0357b097bbbf6b87de9bc91dd16757e3@www.novabbs.org>
 <vp2sv2$1skve$1@dont-email.me>
 <a34ce3b43fab761d13b2432f9e255fab@www.novabbs.org>
 <vp518t$2bhib$1@dont-email.me>
 <a56e446b2e2df9f01eb558aa68279d35@www.novabbs.org>
 <vp5mnu$2fjhi$1@dont-email.me> <BP4uP.273689$6Mub.167898@fx45.iad>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Fri, 21 Feb 2025 22:47:54 +0100 (CET)
Injection-Info: dont-email.me; posting-host="98b20ea62ba459119821c932ae14e520";
	logging-data="3765774"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX1/dZbO1vPoZVZq7avjkxd37mDfFPQNA20Y="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:emJVrpGkHDBBZI2m0ToXufHFnho=
In-Reply-To: <BP4uP.273689$6Mub.167898@fx45.iad>
Content-Language: en-US
Bytes: 9371

On 2/21/2025 1:51 PM, EricP wrote:
> BGB wrote:
>>
>> Can note that the latency of carry-select adders is a little weird:
>>   16/32/64: Latency goes up steadily;
>>     But, still less than linear;
>>   128-bit: Only slightly more latency than 64-bit.
>>
>> The best I could find in past testing was seemingly 16-bit chunks for 
>> normal adding. Where, 16-bits seemed to be around the break-even 
>> between the chained CARRY4's and the Carry-Select (CS being slower 
>> below 16 bits).
>>
>> But, for a 64-bit adder, still basically need to give it a clock-cycle 
>> to do its thing. Though, not like 32 is particularly fast either; 
>> hence part of the whole 2 cycle latency on ALU ops thing. Mostly has 
>> to do with ADD/SUB (and CMP, which is based on SUB).
>>
>>
>> Admittedly part of why I have such mixed feelings on full compare-and- 
>> branch:
>>   Pro: It can offer a performance advantage (in terms of per-clock);
>>   Con: Branch is now beholden to the latency of a Subtract.
> 
> IIRC your cpu clock speed is about 75 MHz (13.3 ns)
> and you are saying it takes 2 clocks for a 64-bit ADD.
> 

The 75MHz was mostly experimental, mostly I am running at 50MHz because 
it is easier (a whole lot of corners need to be cut for 75MHz, so often 
overall performance ended up being worse).


Via the main ALU, which also shares the logic for SUB and CMP and similar...

Generally, I give more or less a full cycle for the ADD to do its thing, 
with the result presented to the outside world on the second cycle, 
where it can go through the register forwarding chains and similar.

This gives it a 2 cycle latency.

Operations with a 1 cycle latency need to feed their output directly 
into the register forwarding logic.


In a pseudocode sense, something like:
   tValB = IsSUB ? ~valB : valB;
   tAddA0={ 1'b0, valA[15:0] } + { 1'b0, tValB[15:0] } + 0;
   tAddA1={ 1'b0, valA[15:0] } + { 1'b0, tValB[15:0] } + 1;
   tAddB0={ 1'b0, valA[31:16] } + { 1'b0, tValB[31:16] } + 0;
   tAddB1={ 1'b0, valA[31:16] } + { 1'b0, tValB[31:16] } + 1;
   tAddC0=...
   ...
   tAddSbA = tCarryIn;
   tAddSbB = tAddSbA ? tAddA1[16] : tAddA0[16];
   tAddSbC = tAddSbB ? tAddB1[16] : tAddB0[16];
   ...
   tAddRes = {
      tAddSbD ? tAddD1[15:0] : tAddD0[15:0],
      tAddSbC ? tAddC1[15:0] : tAddC0[15:0],
      tAddSbB ? tAddB1[15:0] : tAddB0[15:0],
      tAddSbA ? tAddA1[15:0] : tAddA0[15:0]
   };


This works, but still need to ideally give it a full clock-cycle to do 
its work.



Note that one has to be careful with logic coupling, as if too many 
things are tied together, one may get a "routing congestion" warning 
message, and generally timing fails in this case...

Also, "inferring latch" warning is one of those "you really gotta go fix 
this" issues (both generally indicates Verilog bugs, and also negatively 
effects timing).


> I don't remember what Xilinx chip you are using but this paper describes
> how to do a 64-bit ADD at between 350 Mhz (2.8 ns) to 400 MHz (2.5 ns)
> on a Virtex-5:
> 
> A Fast Carry Chain Adder for Virtex-5 FPGAs, 2010
> https://scholar.archive.org/work/tz6fy2zm4fcobc6k7khsbwskh4/access/ 
> wayback/http://ece.gmu.edu:80/coursewebpages/ECE/ECE645/S11/projects/ 
> project_1_resources/Adders_MELECON_2010.pdf
> 

As for Virtex: I am not made of money...

Virtex tends to be absurdly expensive high-end FPGAs.
   Even the older Virtex chips are still absurdly expensive.


Kintex is considered mid range, but still too expensive, and mostly not 
usable in the free versions of Vivado (and there are no real viable FOSS 
alternatives to Vivado). When I tried looking at some of the "open 
source" tools for targeting Xilinx chips, they were doing the hacky 
thing of basically invoking Xilinx's tools in the background (which, if 
used to target a Kintex, is essentially piracy).

Where, a valid FOSS tool would need to be able to do everything and 
generate the bitstream itself.



Mostly I am using Spartan-7 and Artix-7.
   Generally at the -1 speed grade (slowest, but cheapest).

These are mostly considered low-end and consumer-electronics oriented 
FPGAs by Xilinx.


Or, by "car analogies":
You can't expect a "VW Jetta" to perform like a "Ferrari Enzo" even if 
the "Jetta" is a newer model year...



Cheapest FPGA dev-boards I have gotten a (minimal) BJX2 core onto were 
around $70 (XC7S25). Most expensive dev-board I have is the Nexys A7 
(XC7A100T), but it has gone up in price (IIRC, it was around $290 at the 
time; right now seems like $350, but was IIRC a bit more in 2021/2022).


There was the temptation to get a "Nexys Video", which an XC7A200T-2, 
but, very expensive (around $600 IIRC). However, this chip *could* pass 
75MHz a bit more easily (though, still not enough to easily reach 100MHz).


I have a QMTech board with an XC7A200T at -1, but generally, it seems to 
actually have a slightly harder time passing timing constraints than the 
XC7A100T in the Nexys A7 (possibly some sort of Vivado magic here).


Generally, also hard to find FPGA boards much under $100 that "aren't crap".

A lot of the ICE40 boards fail on both, often not really any cheaper 
than XC7S25 or XC7A35T based boards, but much worse in comparison.

There was rumor of cheaper boards (in the $30-$50 range), but not seen 
anything that seems worth bothering with in this range.


Had noted that I could get higher clock speeds (or "fmax" in their 
terms) on Intel/Altera chips (according to Quartus), but didn't buy any 
of these:
The DE10 was expensive, and at the time, even a less feature-rich 
version of the BJX2 core basically ate the entire resource budget of the 
DE10 (but, IIRC, was otherwise an fmax of around 85MHz or something).


IIRC, something had seemingly gone horribly wrong with attempts to use 
LUTRAM arrays, and it was needing to fall back to trying to make them 
out of Flip-Flops...

IIRC, it was something like they didn't have LUTRAM's in the same sort 
as Xilinx, but rather smaller and larger Block RAMs; and, possibly, the 
register file would need to be reworked to fit BRAM-like access patterns 
(namely, that reads are only performed on a clock-edge rather than 
combinatorial). Granted, could be done in theory, but means I would need 
to feed in the register-port inputs on the ID1/ID2 edge, rather than in 
the ID2 stage itself.

I didn't really mess with it much at the time to figure it out...

========== REMAINDER OF ARTICLE TRUNCATED ==========