Article <vpdn4k$6130$1@dont-email.me>

Deutsch English Français Italiano
<vpdn4k$6130$1@dont-email.me>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!eternal-september.org!.POSTED!not-for-mail
From: BGB <cr88192@gmail.com>
Newsgroups: comp.arch
Subject: Re: Cost of handling misaligned access
Date: Sat, 22 Feb 2025 17:37:53 -0600
Organization: A noiseless patient Spider
Lines: 332
Message-ID: <vpdn4k$6130$1@dont-email.me>
References: <5lNnP.1313925$2xE6.991023@fx18.iad> <vnosj6$t5o0$1@dont-email.me>
 <2025Feb3.075550@mips.complang.tuwien.ac.at> <volg1m$31ca1$1@dont-email.me>
 <voobnc$3l2dl$1@dont-email.me>
 <0fc4cc997441e25330ff5c8735247b54@www.novabbs.org>
 <vp0m3f$1cth6$1@dont-email.me>
 <74142fbdc017bc560d75541f3f3c5118@www.novabbs.org>
 <20250218150739.0000192a@yahoo.com>
 <0357b097bbbf6b87de9bc91dd16757e3@www.novabbs.org>
 <vp2sv2$1skve$1@dont-email.me>
 <a34ce3b43fab761d13b2432f9e255fab@www.novabbs.org>
 <vp518t$2bhib$1@dont-email.me>
 <a56e446b2e2df9f01eb558aa68279d35@www.novabbs.org>
 <vp5mnu$2fjhi$1@dont-email.me> <BP4uP.273689$6Mub.167898@fx45.iad>
 <vpasaa$3itge$1@dont-email.me> <OTluP.690994$rHoc.634573@fx17.iad>
 <vpd8b0$3afn$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sun, 23 Feb 2025 00:37:57 +0100 (CET)
Injection-Info: dont-email.me; posting-host="0de476d50361b2dca6cbf57666f38050";
	logging-data="197728"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX18MbeYBaX3nAbLDBb4ItVsIIBwW78pjRp4="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:zB1QsYQmgzlb8Bd6Ygh+5dyfzeY=
Content-Language: en-US
In-Reply-To: <vpd8b0$3afn$1@dont-email.me>
Bytes: 15378

On 2/22/2025 1:25 PM, Robert Finch wrote:
> On 2025-02-22 10:16 a.m., EricP wrote:
>> BGB wrote:
>>> On 2/21/2025 1:51 PM, EricP wrote:
>>>> BGB wrote:
>>>>>
>>>>> Can note that the latency of carry-select adders is a little weird:
>>>>>   16/32/64: Latency goes up steadily;
>>>>>     But, still less than linear;
>>>>>   128-bit: Only slightly more latency than 64-bit.
>>>>>
>>>>> The best I could find in past testing was seemingly 16-bit chunks 
>>>>> for normal adding. Where, 16-bits seemed to be around the break- 
>>>>> even between the chained CARRY4's and the Carry-Select (CS being 
>>>>> slower below 16 bits).
>>>>>
>>>>> But, for a 64-bit adder, still basically need to give it a clock- 
>>>>> cycle to do its thing. Though, not like 32 is particularly fast 
>>>>> either; hence part of the whole 2 cycle latency on ALU ops thing. 
>>>>> Mostly has to do with ADD/SUB (and CMP, which is based on SUB).
>>>>>
>>>>>
>>>>> Admittedly part of why I have such mixed feelings on full compare- 
>>>>> and- branch:
>>>>>   Pro: It can offer a performance advantage (in terms of per-clock);
>>>>>   Con: Branch is now beholden to the latency of a Subtract.
>>>>
>>>> IIRC your cpu clock speed is about 75 MHz (13.3 ns)
>>>> and you are saying it takes 2 clocks for a 64-bit ADD.
>>>>
>>>
>>> The 75MHz was mostly experimental, mostly I am running at 50MHz 
>>> because it is easier (a whole lot of corners need to be cut for 
>>> 75MHz, so often overall performance ended up being worse).
>>>
>>>
>>> Via the main ALU, which also shares the logic for SUB and CMP and 
>>> similar...
>>>
>>> Generally, I give more or less a full cycle for the ADD to do its 
>>> thing, with the result presented to the outside world on the second 
>>> cycle, where it can go through the register forwarding chains and 
>>> similar.
>>>
>>> This gives it a 2 cycle latency.
>>>
>>> Operations with a 1 cycle latency need to feed their output directly 
>>> into the register forwarding logic.
>>>
>>>
>>> In a pseudocode sense, something like:
>>>   tValB = IsSUB ? ~valB : valB;
>>>   tAddA0={ 1'b0, valA[15:0] } + { 1'b0, tValB[15:0] } + 0;
>>>   tAddA1={ 1'b0, valA[15:0] } + { 1'b0, tValB[15:0] } + 1;
>>>   tAddB0={ 1'b0, valA[31:16] } + { 1'b0, tValB[31:16] } + 0;
>>>   tAddB1={ 1'b0, valA[31:16] } + { 1'b0, tValB[31:16] } + 1;
>>>   tAddC0=...
>>>   ...
>>>   tAddSbA = tCarryIn;
>>>   tAddSbB = tAddSbA ? tAddA1[16] : tAddA0[16];
>>>   tAddSbC = tAddSbB ? tAddB1[16] : tAddB0[16];
>>>   ...
>>>   tAddRes = {
>>>      tAddSbD ? tAddD1[15:0] : tAddD0[15:0],
>>>      tAddSbC ? tAddC1[15:0] : tAddC0[15:0],
>>>      tAddSbB ? tAddB1[15:0] : tAddB0[15:0],
>>>      tAddSbA ? tAddA1[15:0] : tAddA0[15:0]
>>>   };
>>>
>>>
>>> This works, but still need to ideally give it a full clock-cycle to 
>>> do its work.
>>>
>>>
>>>
>>> Note that one has to be careful with logic coupling, as if too many 
>>> things are tied together, one may get a "routing congestion" warning 
>>> message, and generally timing fails in this case...
>>>
>>> Also, "inferring latch" warning is one of those "you really gotta go 
>>> fix this" issues (both generally indicates Verilog bugs, and also 
>>> negatively effects timing).
>>>
>>>
>>>> I don't remember what Xilinx chip you are using but this paper 
>>>> describes
>>>> how to do a 64-bit ADD at between 350 Mhz (2.8 ns) to 400 MHz (2.5 ns)
>>>> on a Virtex-5:
>>>>
>>>> A Fast Carry Chain Adder for Virtex-5 FPGAs, 2010
>>>> https://scholar.archive.org/work/tz6fy2zm4fcobc6k7khsbwskh4/access/ 
>>>> wayback/http://ece.gmu.edu:80/coursewebpages/ECE/ECE645/S11/ 
>>>> projects/ project_1_resources/Adders_MELECON_2010.pdf
>>>>
>>>
>>> As for Virtex: I am not made of money...
>>>
>>> Virtex tends to be absurdly expensive high-end FPGAs.
>>>   Even the older Virtex chips are still absurdly expensive.
>>>
>>>
>>> Kintex is considered mid range, but still too expensive, and mostly 
>>> not usable in the free versions of Vivado (and there are no real 
>>> viable FOSS alternatives to Vivado). When I tried looking at some of 
>>> the "open source" tools for targeting Xilinx chips, they were doing 
>>> the hacky thing of basically invoking Xilinx's tools in the 
>>> background (which, if used to target a Kintex, is essentially piracy).
>>
>> I don't think that it is copyright infringement to have a script or code
>> generator output drive a compiler or tool instead of your hands.

It would be, however, to use it to sidestep Vivado's licensing to try to 
target a Kintex by using the tools in unorthodox ways...


As I see it, hacking the existing tools to sidestep licensing fees is 
essentially piracy.

Whereas, writing ones own tools is fair game, albeit provided a "clean 
room" strategy is used (or, basically, one party reverse engineers and 
documents the bitstream format, and some other party writes the tools 
based on that documentation). In this case, Xilinx would only be 
entitled to profits from the FPGA itself.


But, that said, I also have the opinion that when a user buys a piece of 
hardware, they are entitled to ownership over said hardware. OEM 
restrictions on the use of said hardware (outside of copyright on any 
software running on said hardware) are invalid as far as I am concerned.


Similarly selling devices as "loss leaders" with the intent to regain 
profits via advertising or selling services is also not really 
defensible (and any losses due to customers circumventing the hardware, 
are the fault of the seller, not of the customer).

Well, even as much as companies selling things like cellphones and 
game-consoles would try to disagree (seeing selling the hardware at a 
loss to make it up in licensed game sales or similar as a business 
strategy).

Though, this does still leave things like firmware as a gray area. But, 
realistically, since the firmware is coupled to the hardware, then 
"sale" would also imply the right of the users to treat any dealings 
with the firmware as if it were part of the hardware (with the main 
exception if the user separates the firmware from the hardware, in which 
case copyright would apply; say, if ripping the ROMs and uploading them 
to the internet).


This would not apply to Vivado though, which is more solidly in the 
"software" camp (and the bare FPGA is pretty solidly in the "hardware" 
camp).


>>
>>> Where, a valid FOSS tool would need to be able to do everything and 
>>> generate the bitstream itself.
>>>
>>>
>>>
>>> Mostly I am using Spartan-7 and Artix-7.
>>>   Generally at the -1 speed grade (slowest, but cheapest).
>>
>> The second paper was also on both Spartan-6 and says it has the same
========== REMAINDER OF ARTICLE TRUNCATED ==========