| Deutsch English Français Italiano |
|
<vpasaa$3itge$1@dont-email.me> View for Bookmarking (what is this?) Look up another Usenet article |
Path: ...!weretis.net!feeder9.news.weretis.net!news.quux.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!eternal-september.org!.POSTED!not-for-mail
From: BGB <cr88192@gmail.com>
Newsgroups: comp.arch
Subject: Re: Cost of handling misaligned access
Date: Fri, 21 Feb 2025 15:47:50 -0600
Organization: A noiseless patient Spider
Lines: 205
Message-ID: <vpasaa$3itge$1@dont-email.me>
References: <5lNnP.1313925$2xE6.991023@fx18.iad> <vnosj6$t5o0$1@dont-email.me>
<2025Feb3.075550@mips.complang.tuwien.ac.at> <volg1m$31ca1$1@dont-email.me>
<voobnc$3l2dl$1@dont-email.me>
<0fc4cc997441e25330ff5c8735247b54@www.novabbs.org>
<vp0m3f$1cth6$1@dont-email.me>
<74142fbdc017bc560d75541f3f3c5118@www.novabbs.org>
<20250218150739.0000192a@yahoo.com>
<0357b097bbbf6b87de9bc91dd16757e3@www.novabbs.org>
<vp2sv2$1skve$1@dont-email.me>
<a34ce3b43fab761d13b2432f9e255fab@www.novabbs.org>
<vp518t$2bhib$1@dont-email.me>
<a56e446b2e2df9f01eb558aa68279d35@www.novabbs.org>
<vp5mnu$2fjhi$1@dont-email.me> <BP4uP.273689$6Mub.167898@fx45.iad>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Fri, 21 Feb 2025 22:47:54 +0100 (CET)
Injection-Info: dont-email.me; posting-host="98b20ea62ba459119821c932ae14e520";
logging-data="3765774"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/dZbO1vPoZVZq7avjkxd37mDfFPQNA20Y="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:emJVrpGkHDBBZI2m0ToXufHFnho=
In-Reply-To: <BP4uP.273689$6Mub.167898@fx45.iad>
Content-Language: en-US
Bytes: 9371
On 2/21/2025 1:51 PM, EricP wrote:
> BGB wrote:
>>
>> Can note that the latency of carry-select adders is a little weird:
>> 16/32/64: Latency goes up steadily;
>> But, still less than linear;
>> 128-bit: Only slightly more latency than 64-bit.
>>
>> The best I could find in past testing was seemingly 16-bit chunks for
>> normal adding. Where, 16-bits seemed to be around the break-even
>> between the chained CARRY4's and the Carry-Select (CS being slower
>> below 16 bits).
>>
>> But, for a 64-bit adder, still basically need to give it a clock-cycle
>> to do its thing. Though, not like 32 is particularly fast either;
>> hence part of the whole 2 cycle latency on ALU ops thing. Mostly has
>> to do with ADD/SUB (and CMP, which is based on SUB).
>>
>>
>> Admittedly part of why I have such mixed feelings on full compare-and-
>> branch:
>> Pro: It can offer a performance advantage (in terms of per-clock);
>> Con: Branch is now beholden to the latency of a Subtract.
>
> IIRC your cpu clock speed is about 75 MHz (13.3 ns)
> and you are saying it takes 2 clocks for a 64-bit ADD.
>
The 75MHz was mostly experimental, mostly I am running at 50MHz because
it is easier (a whole lot of corners need to be cut for 75MHz, so often
overall performance ended up being worse).
Via the main ALU, which also shares the logic for SUB and CMP and similar...
Generally, I give more or less a full cycle for the ADD to do its thing,
with the result presented to the outside world on the second cycle,
where it can go through the register forwarding chains and similar.
This gives it a 2 cycle latency.
Operations with a 1 cycle latency need to feed their output directly
into the register forwarding logic.
In a pseudocode sense, something like:
tValB = IsSUB ? ~valB : valB;
tAddA0={ 1'b0, valA[15:0] } + { 1'b0, tValB[15:0] } + 0;
tAddA1={ 1'b0, valA[15:0] } + { 1'b0, tValB[15:0] } + 1;
tAddB0={ 1'b0, valA[31:16] } + { 1'b0, tValB[31:16] } + 0;
tAddB1={ 1'b0, valA[31:16] } + { 1'b0, tValB[31:16] } + 1;
tAddC0=...
...
tAddSbA = tCarryIn;
tAddSbB = tAddSbA ? tAddA1[16] : tAddA0[16];
tAddSbC = tAddSbB ? tAddB1[16] : tAddB0[16];
...
tAddRes = {
tAddSbD ? tAddD1[15:0] : tAddD0[15:0],
tAddSbC ? tAddC1[15:0] : tAddC0[15:0],
tAddSbB ? tAddB1[15:0] : tAddB0[15:0],
tAddSbA ? tAddA1[15:0] : tAddA0[15:0]
};
This works, but still need to ideally give it a full clock-cycle to do
its work.
Note that one has to be careful with logic coupling, as if too many
things are tied together, one may get a "routing congestion" warning
message, and generally timing fails in this case...
Also, "inferring latch" warning is one of those "you really gotta go fix
this" issues (both generally indicates Verilog bugs, and also negatively
effects timing).
> I don't remember what Xilinx chip you are using but this paper describes
> how to do a 64-bit ADD at between 350 Mhz (2.8 ns) to 400 MHz (2.5 ns)
> on a Virtex-5:
>
> A Fast Carry Chain Adder for Virtex-5 FPGAs, 2010
> https://scholar.archive.org/work/tz6fy2zm4fcobc6k7khsbwskh4/access/
> wayback/http://ece.gmu.edu:80/coursewebpages/ECE/ECE645/S11/projects/
> project_1_resources/Adders_MELECON_2010.pdf
>
As for Virtex: I am not made of money...
Virtex tends to be absurdly expensive high-end FPGAs.
Even the older Virtex chips are still absurdly expensive.
Kintex is considered mid range, but still too expensive, and mostly not
usable in the free versions of Vivado (and there are no real viable FOSS
alternatives to Vivado). When I tried looking at some of the "open
source" tools for targeting Xilinx chips, they were doing the hacky
thing of basically invoking Xilinx's tools in the background (which, if
used to target a Kintex, is essentially piracy).
Where, a valid FOSS tool would need to be able to do everything and
generate the bitstream itself.
Mostly I am using Spartan-7 and Artix-7.
Generally at the -1 speed grade (slowest, but cheapest).
These are mostly considered low-end and consumer-electronics oriented
FPGAs by Xilinx.
Or, by "car analogies":
You can't expect a "VW Jetta" to perform like a "Ferrari Enzo" even if
the "Jetta" is a newer model year...
Cheapest FPGA dev-boards I have gotten a (minimal) BJX2 core onto were
around $70 (XC7S25). Most expensive dev-board I have is the Nexys A7
(XC7A100T), but it has gone up in price (IIRC, it was around $290 at the
time; right now seems like $350, but was IIRC a bit more in 2021/2022).
There was the temptation to get a "Nexys Video", which an XC7A200T-2,
but, very expensive (around $600 IIRC). However, this chip *could* pass
75MHz a bit more easily (though, still not enough to easily reach 100MHz).
I have a QMTech board with an XC7A200T at -1, but generally, it seems to
actually have a slightly harder time passing timing constraints than the
XC7A100T in the Nexys A7 (possibly some sort of Vivado magic here).
Generally, also hard to find FPGA boards much under $100 that "aren't crap".
A lot of the ICE40 boards fail on both, often not really any cheaper
than XC7S25 or XC7A35T based boards, but much worse in comparison.
There was rumor of cheaper boards (in the $30-$50 range), but not seen
anything that seems worth bothering with in this range.
Had noted that I could get higher clock speeds (or "fmax" in their
terms) on Intel/Altera chips (according to Quartus), but didn't buy any
of these:
The DE10 was expensive, and at the time, even a less feature-rich
version of the BJX2 core basically ate the entire resource budget of the
DE10 (but, IIRC, was otherwise an fmax of around 85MHz or something).
IIRC, something had seemingly gone horribly wrong with attempts to use
LUTRAM arrays, and it was needing to fall back to trying to make them
out of Flip-Flops...
IIRC, it was something like they didn't have LUTRAM's in the same sort
as Xilinx, but rather smaller and larger Block RAMs; and, possibly, the
register file would need to be reworked to fit BRAM-like access patterns
(namely, that reads are only performed on a clock-edge rather than
combinatorial). Granted, could be done in theory, but means I would need
to feed in the register-port inputs on the ID1/ID2 edge, rather than in
the ID2 stage itself.
I didn't really mess with it much at the time to figure it out...
========== REMAINDER OF ARTICLE TRUNCATED ==========