| Deutsch English Français Italiano |
|
<vpd8b0$3afn$1@dont-email.me> View for Bookmarking (what is this?) Look up another Usenet article |
Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!eternal-september.org!.POSTED!not-for-mail
From: Robert Finch <robfi680@gmail.com>
Newsgroups: comp.arch
Subject: Re: Cost of handling misaligned access
Date: Sat, 22 Feb 2025 14:25:18 -0500
Organization: A noiseless patient Spider
Lines: 210
Message-ID: <vpd8b0$3afn$1@dont-email.me>
References: <5lNnP.1313925$2xE6.991023@fx18.iad> <vnosj6$t5o0$1@dont-email.me>
<2025Feb3.075550@mips.complang.tuwien.ac.at> <volg1m$31ca1$1@dont-email.me>
<voobnc$3l2dl$1@dont-email.me>
<0fc4cc997441e25330ff5c8735247b54@www.novabbs.org>
<vp0m3f$1cth6$1@dont-email.me>
<74142fbdc017bc560d75541f3f3c5118@www.novabbs.org>
<20250218150739.0000192a@yahoo.com>
<0357b097bbbf6b87de9bc91dd16757e3@www.novabbs.org>
<vp2sv2$1skve$1@dont-email.me>
<a34ce3b43fab761d13b2432f9e255fab@www.novabbs.org>
<vp518t$2bhib$1@dont-email.me>
<a56e446b2e2df9f01eb558aa68279d35@www.novabbs.org>
<vp5mnu$2fjhi$1@dont-email.me> <BP4uP.273689$6Mub.167898@fx45.iad>
<vpasaa$3itge$1@dont-email.me> <OTluP.690994$rHoc.634573@fx17.iad>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sat, 22 Feb 2025 20:25:20 +0100 (CET)
Injection-Info: dont-email.me; posting-host="5d7472d8ddba9eb3c4ffd70ed86a1efb";
logging-data="109047"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19jvtYzTQnuZLLZztbgYd/ZRWIQ8LukCiI="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:L5PzqWLFJob0WkfRVFK7zW2HCho=
Content-Language: en-US
In-Reply-To: <OTluP.690994$rHoc.634573@fx17.iad>
Bytes: 10263
On 2025-02-22 10:16 a.m., EricP wrote:
> BGB wrote:
>> On 2/21/2025 1:51 PM, EricP wrote:
>>> BGB wrote:
>>>>
>>>> Can note that the latency of carry-select adders is a little weird:
>>>> 16/32/64: Latency goes up steadily;
>>>> But, still less than linear;
>>>> 128-bit: Only slightly more latency than 64-bit.
>>>>
>>>> The best I could find in past testing was seemingly 16-bit chunks
>>>> for normal adding. Where, 16-bits seemed to be around the break-even
>>>> between the chained CARRY4's and the Carry-Select (CS being slower
>>>> below 16 bits).
>>>>
>>>> But, for a 64-bit adder, still basically need to give it a clock-
>>>> cycle to do its thing. Though, not like 32 is particularly fast
>>>> either; hence part of the whole 2 cycle latency on ALU ops thing.
>>>> Mostly has to do with ADD/SUB (and CMP, which is based on SUB).
>>>>
>>>>
>>>> Admittedly part of why I have such mixed feelings on full compare-
>>>> and- branch:
>>>> Pro: It can offer a performance advantage (in terms of per-clock);
>>>> Con: Branch is now beholden to the latency of a Subtract.
>>>
>>> IIRC your cpu clock speed is about 75 MHz (13.3 ns)
>>> and you are saying it takes 2 clocks for a 64-bit ADD.
>>>
>>
>> The 75MHz was mostly experimental, mostly I am running at 50MHz
>> because it is easier (a whole lot of corners need to be cut for 75MHz,
>> so often overall performance ended up being worse).
>>
>>
>> Via the main ALU, which also shares the logic for SUB and CMP and
>> similar...
>>
>> Generally, I give more or less a full cycle for the ADD to do its
>> thing, with the result presented to the outside world on the second
>> cycle, where it can go through the register forwarding chains and
>> similar.
>>
>> This gives it a 2 cycle latency.
>>
>> Operations with a 1 cycle latency need to feed their output directly
>> into the register forwarding logic.
>>
>>
>> In a pseudocode sense, something like:
>> tValB = IsSUB ? ~valB : valB;
>> tAddA0={ 1'b0, valA[15:0] } + { 1'b0, tValB[15:0] } + 0;
>> tAddA1={ 1'b0, valA[15:0] } + { 1'b0, tValB[15:0] } + 1;
>> tAddB0={ 1'b0, valA[31:16] } + { 1'b0, tValB[31:16] } + 0;
>> tAddB1={ 1'b0, valA[31:16] } + { 1'b0, tValB[31:16] } + 1;
>> tAddC0=...
>> ...
>> tAddSbA = tCarryIn;
>> tAddSbB = tAddSbA ? tAddA1[16] : tAddA0[16];
>> tAddSbC = tAddSbB ? tAddB1[16] : tAddB0[16];
>> ...
>> tAddRes = {
>> tAddSbD ? tAddD1[15:0] : tAddD0[15:0],
>> tAddSbC ? tAddC1[15:0] : tAddC0[15:0],
>> tAddSbB ? tAddB1[15:0] : tAddB0[15:0],
>> tAddSbA ? tAddA1[15:0] : tAddA0[15:0]
>> };
>>
>>
>> This works, but still need to ideally give it a full clock-cycle to do
>> its work.
>>
>>
>>
>> Note that one has to be careful with logic coupling, as if too many
>> things are tied together, one may get a "routing congestion" warning
>> message, and generally timing fails in this case...
>>
>> Also, "inferring latch" warning is one of those "you really gotta go
>> fix this" issues (both generally indicates Verilog bugs, and also
>> negatively effects timing).
>>
>>
>>> I don't remember what Xilinx chip you are using but this paper describes
>>> how to do a 64-bit ADD at between 350 Mhz (2.8 ns) to 400 MHz (2.5 ns)
>>> on a Virtex-5:
>>>
>>> A Fast Carry Chain Adder for Virtex-5 FPGAs, 2010
>>> https://scholar.archive.org/work/tz6fy2zm4fcobc6k7khsbwskh4/access/
>>> wayback/http://ece.gmu.edu:80/coursewebpages/ECE/ECE645/S11/projects/
>>> project_1_resources/Adders_MELECON_2010.pdf
>>>
>>
>> As for Virtex: I am not made of money...
>>
>> Virtex tends to be absurdly expensive high-end FPGAs.
>> Even the older Virtex chips are still absurdly expensive.
>>
>>
>> Kintex is considered mid range, but still too expensive, and mostly
>> not usable in the free versions of Vivado (and there are no real
>> viable FOSS alternatives to Vivado). When I tried looking at some of
>> the "open source" tools for targeting Xilinx chips, they were doing
>> the hacky thing of basically invoking Xilinx's tools in the background
>> (which, if used to target a Kintex, is essentially piracy).
>
> I don't think that it is copyright infringement to have a script or code
> generator output drive a compiler or tool instead of your hands.
>
>> Where, a valid FOSS tool would need to be able to do everything and
>> generate the bitstream itself.
>>
>>
>>
>> Mostly I am using Spartan-7 and Artix-7.
>> Generally at the -1 speed grade (slowest, but cheapest).
>
> The second paper was also on both Spartan-6 and says it has the same
> LUT architecture as Vertex-5 and -6. Their speed testing was done on
> Vertex-6 but the design should apply.
>
> Anyway it was the concepts of how to optimize the carry that were
> important.
> I would expect to have to write code to port the ideas.
>
>> These are mostly considered low-end and consumer-electronics oriented
>> FPGAs by Xilinx.
>
> <snip>
>
>> I have a QMTech board with an XC7A200T at -1, but generally, it seems
>> to actually have a slightly harder time passing timing constraints
>> than the XC7A100T in the Nexys A7 (possibly some sort of Vivado magic
>> here).
>>
>>
>>> and this does 64-bit ADD up to 428 MHz (2.3 ns) on a Virtex-6:
>>>
>>> Fast and Area Efficient Adder for Wide Data in Recent Xilinx FPGAs, 2016
>>> http://www.diva-portal.org/smash/get/diva2:967655/FULLTEXT02.pdf
>>>
>>
>> Errm, skim, this doesn't really look like something you can pull off
>> in normal Verilog.
>
> Well that's what I'm trying to figure out because its not just this paper
> but a lot, like many hundreds, of papers I've read from commercial or
> academic source that seem to be able to control the FPGA results
> to a fine degree.
>
>> Generally, one doesn't control over how the components hook together,
>> only one can influence what happens based on how they write their
>> Verilog.
>
> That paper mentions in section III
> "In order to reduce uncontrollable routing delays in the comparisons,
> everything was manually placed, according to the floorplan in Fig. 7."
>
> Is that the key - manually place things adjacent and hope the
> wire router does the right thing?
>
> That sounds too flaky. You need to be able to reliably construct optimized
> modules and then attach to them.
>
>> You can just write:
>> reg[63:0] tValA;
========== REMAINDER OF ARTICLE TRUNCATED ==========