Article <vpd8b0$3afn$1@dont-email.me>

Deutsch English Français Italiano
<vpd8b0$3afn$1@dont-email.me>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!eternal-september.org!.POSTED!not-for-mail
From: Robert Finch <robfi680@gmail.com>
Newsgroups: comp.arch
Subject: Re: Cost of handling misaligned access
Date: Sat, 22 Feb 2025 14:25:18 -0500
Organization: A noiseless patient Spider
Lines: 210
Message-ID: <vpd8b0$3afn$1@dont-email.me>
References: <5lNnP.1313925$2xE6.991023@fx18.iad> <vnosj6$t5o0$1@dont-email.me>
 <2025Feb3.075550@mips.complang.tuwien.ac.at> <volg1m$31ca1$1@dont-email.me>
 <voobnc$3l2dl$1@dont-email.me>
 <0fc4cc997441e25330ff5c8735247b54@www.novabbs.org>
 <vp0m3f$1cth6$1@dont-email.me>
 <74142fbdc017bc560d75541f3f3c5118@www.novabbs.org>
 <20250218150739.0000192a@yahoo.com>
 <0357b097bbbf6b87de9bc91dd16757e3@www.novabbs.org>
 <vp2sv2$1skve$1@dont-email.me>
 <a34ce3b43fab761d13b2432f9e255fab@www.novabbs.org>
 <vp518t$2bhib$1@dont-email.me>
 <a56e446b2e2df9f01eb558aa68279d35@www.novabbs.org>
 <vp5mnu$2fjhi$1@dont-email.me> <BP4uP.273689$6Mub.167898@fx45.iad>
 <vpasaa$3itge$1@dont-email.me> <OTluP.690994$rHoc.634573@fx17.iad>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sat, 22 Feb 2025 20:25:20 +0100 (CET)
Injection-Info: dont-email.me; posting-host="5d7472d8ddba9eb3c4ffd70ed86a1efb";
	logging-data="109047"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX19jvtYzTQnuZLLZztbgYd/ZRWIQ8LukCiI="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:L5PzqWLFJob0WkfRVFK7zW2HCho=
Content-Language: en-US
In-Reply-To: <OTluP.690994$rHoc.634573@fx17.iad>
Bytes: 10263

On 2025-02-22 10:16 a.m., EricP wrote:
> BGB wrote:
>> On 2/21/2025 1:51 PM, EricP wrote:
>>> BGB wrote:
>>>>
>>>> Can note that the latency of carry-select adders is a little weird:
>>>>   16/32/64: Latency goes up steadily;
>>>>     But, still less than linear;
>>>>   128-bit: Only slightly more latency than 64-bit.
>>>>
>>>> The best I could find in past testing was seemingly 16-bit chunks 
>>>> for normal adding. Where, 16-bits seemed to be around the break-even 
>>>> between the chained CARRY4's and the Carry-Select (CS being slower 
>>>> below 16 bits).
>>>>
>>>> But, for a 64-bit adder, still basically need to give it a clock- 
>>>> cycle to do its thing. Though, not like 32 is particularly fast 
>>>> either; hence part of the whole 2 cycle latency on ALU ops thing. 
>>>> Mostly has to do with ADD/SUB (and CMP, which is based on SUB).
>>>>
>>>>
>>>> Admittedly part of why I have such mixed feelings on full compare- 
>>>> and- branch:
>>>>   Pro: It can offer a performance advantage (in terms of per-clock);
>>>>   Con: Branch is now beholden to the latency of a Subtract.
>>>
>>> IIRC your cpu clock speed is about 75 MHz (13.3 ns)
>>> and you are saying it takes 2 clocks for a 64-bit ADD.
>>>
>>
>> The 75MHz was mostly experimental, mostly I am running at 50MHz 
>> because it is easier (a whole lot of corners need to be cut for 75MHz, 
>> so often overall performance ended up being worse).
>>
>>
>> Via the main ALU, which also shares the logic for SUB and CMP and 
>> similar...
>>
>> Generally, I give more or less a full cycle for the ADD to do its 
>> thing, with the result presented to the outside world on the second 
>> cycle, where it can go through the register forwarding chains and 
>> similar.
>>
>> This gives it a 2 cycle latency.
>>
>> Operations with a 1 cycle latency need to feed their output directly 
>> into the register forwarding logic.
>>
>>
>> In a pseudocode sense, something like:
>>   tValB = IsSUB ? ~valB : valB;
>>   tAddA0={ 1'b0, valA[15:0] } + { 1'b0, tValB[15:0] } + 0;
>>   tAddA1={ 1'b0, valA[15:0] } + { 1'b0, tValB[15:0] } + 1;
>>   tAddB0={ 1'b0, valA[31:16] } + { 1'b0, tValB[31:16] } + 0;
>>   tAddB1={ 1'b0, valA[31:16] } + { 1'b0, tValB[31:16] } + 1;
>>   tAddC0=...
>>   ...
>>   tAddSbA = tCarryIn;
>>   tAddSbB = tAddSbA ? tAddA1[16] : tAddA0[16];
>>   tAddSbC = tAddSbB ? tAddB1[16] : tAddB0[16];
>>   ...
>>   tAddRes = {
>>      tAddSbD ? tAddD1[15:0] : tAddD0[15:0],
>>      tAddSbC ? tAddC1[15:0] : tAddC0[15:0],
>>      tAddSbB ? tAddB1[15:0] : tAddB0[15:0],
>>      tAddSbA ? tAddA1[15:0] : tAddA0[15:0]
>>   };
>>
>>
>> This works, but still need to ideally give it a full clock-cycle to do 
>> its work.
>>
>>
>>
>> Note that one has to be careful with logic coupling, as if too many 
>> things are tied together, one may get a "routing congestion" warning 
>> message, and generally timing fails in this case...
>>
>> Also, "inferring latch" warning is one of those "you really gotta go 
>> fix this" issues (both generally indicates Verilog bugs, and also 
>> negatively effects timing).
>>
>>
>>> I don't remember what Xilinx chip you are using but this paper describes
>>> how to do a 64-bit ADD at between 350 Mhz (2.8 ns) to 400 MHz (2.5 ns)
>>> on a Virtex-5:
>>>
>>> A Fast Carry Chain Adder for Virtex-5 FPGAs, 2010
>>> https://scholar.archive.org/work/tz6fy2zm4fcobc6k7khsbwskh4/access/ 
>>> wayback/http://ece.gmu.edu:80/coursewebpages/ECE/ECE645/S11/projects/ 
>>> project_1_resources/Adders_MELECON_2010.pdf
>>>
>>
>> As for Virtex: I am not made of money...
>>
>> Virtex tends to be absurdly expensive high-end FPGAs.
>>   Even the older Virtex chips are still absurdly expensive.
>>
>>
>> Kintex is considered mid range, but still too expensive, and mostly 
>> not usable in the free versions of Vivado (and there are no real 
>> viable FOSS alternatives to Vivado). When I tried looking at some of 
>> the "open source" tools for targeting Xilinx chips, they were doing 
>> the hacky thing of basically invoking Xilinx's tools in the background 
>> (which, if used to target a Kintex, is essentially piracy).
> 
> I don't think that it is copyright infringement to have a script or code
> generator output drive a compiler or tool instead of your hands.
> 
>> Where, a valid FOSS tool would need to be able to do everything and 
>> generate the bitstream itself.
>>
>>
>>
>> Mostly I am using Spartan-7 and Artix-7.
>>   Generally at the -1 speed grade (slowest, but cheapest).
> 
> The second paper was also on both Spartan-6 and says it has the same
> LUT architecture as Vertex-5 and -6. Their speed testing was done on
> Vertex-6 but the design should apply.
> 
> Anyway it was the concepts of how to optimize the carry that were 
> important.
> I would expect to have to write code to port the ideas.
> 
>> These are mostly considered low-end and consumer-electronics oriented 
>> FPGAs by Xilinx.
> 
> <snip>
> 
>> I have a QMTech board with an XC7A200T at -1, but generally, it seems 
>> to actually have a slightly harder time passing timing constraints 
>> than the XC7A100T in the Nexys A7 (possibly some sort of Vivado magic 
>> here).
>>
>>
>>> and this does 64-bit ADD up to 428 MHz (2.3 ns) on a Virtex-6:
>>>
>>> Fast and Area Efficient Adder for Wide Data in Recent Xilinx FPGAs, 2016
>>> http://www.diva-portal.org/smash/get/diva2:967655/FULLTEXT02.pdf
>>>
>>
>> Errm, skim, this doesn't really look like something you can pull off 
>> in normal Verilog.
> 
> Well that's what I'm trying to figure out because its not just this paper
> but a lot, like many hundreds, of papers I've read from commercial or
> academic source that seem to be able to control the FPGA results
> to a fine degree.
> 
>> Generally, one doesn't control over how the components hook together, 
>> only one can influence what happens based on how they write their 
>> Verilog.
> 
> That paper mentions in section III
> "In order to reduce uncontrollable routing delays in the comparisons,
> everything was manually placed, according to the floorplan in Fig. 7."
> 
> Is that the key - manually place things adjacent and hope the
> wire router does the right thing?
> 
> That sounds too flaky. You need to be able to reliably construct optimized
> modules and then attach to them.
> 
>> You can just write:
>>   reg[63:0] tValA;
========== REMAINDER OF ARTICLE TRUNCATED ==========