Article <vnosfu$t4ra$1@dont-email.me>

Deutsch English Français Italiano
<vnosfu$t4ra$1@dont-email.me>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!eternal-september.org!.POSTED!not-for-mail
From: "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com>
Newsgroups: comp.arch
Subject: Re: Cost of handling misaligned access
Date: Sun, 2 Feb 2025 14:44:13 -0800
Organization: A noiseless patient Spider
Lines: 70
Message-ID: <vnosfu$t4ra$1@dont-email.me>
References: <5lNnP.1313925$2xE6.991023@fx18.iad>
 <b50b6b125cc92f7711d420a746941f7e@www.novabbs.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sun, 02 Feb 2025 23:44:15 +0100 (CET)
Injection-Info: dont-email.me; posting-host="fac76d2357a12f4c3a94748a8c888ab1";
	logging-data="955242"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX19xMnymeCv8gSBE6OHHThRhod4zt4n6iUs="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:Z0nmLeERFxeAfGHd9I+3BOxWXDU=
Content-Language: en-US
In-Reply-To: <b50b6b125cc92f7711d420a746941f7e@www.novabbs.org>
Bytes: 4278

On 2/2/2025 10:51 AM, MitchAlsup1 wrote:
> On Sun, 2 Feb 2025 16:45:19 +0000, EricP wrote:
> 
>> As you can see in the article below, the cost of NOT handling misaligned
>> accesses in hardware is quite high in cpu clocks.
>>
>> To my eye, the incremental cost of adding hardware support for
>> misaligned
>> to the AGU and cache data path should be quite low. The alignment
>> shifter
>> is basically the same: assuming a 64-byte cache line, LD still has to
>> shift any of the 64 bytes into position 0, and reverse for ST.
> 
> A handful of gates to detect misalignedness and recognize the line and
> page crossing misalignments.
> 
> The alignment shifters are twice as big.
> 
> Now, while I accept these costs, I accept that others may not. I accept
> these costs because of the performance issues when I don't.
> 
>> The incremental cost is in a sequencer in the AGU for handling cache
>> line and possibly virtual page straddles, and a small byte shifter to
>> left shift the high order bytes. The AGU sequencer needs to know if the
>> line straddles a page boundary, if not then increment the 6-bit physical
>> line number within the 4 kB physical frame number, if yes then increment
>> virtual page number and TLB lookup again and access the first line.
>> (Slightly more if multiple page sizes are supported, but same idea.)
>> For a load AGU merges the low and high fragments and forwards.
>>
>> I don't think there are line straddle consequences for coherence because
>> there is no ordering guarantees for misaligned accesses.
> 
> Generally stated as:: Misaligned accesses cannot be considered ATOMIC.

Try it on an x86/x64. Straddle a l2 cache line and use it with a LOCK'ed 
RMW. It should assert the BUS lock.



> 
>> The hardware cost appears trivial, especially within an OoO core.
>> So there doesn't appear to be any reason to not handle this.
>> Am I missing something?
>>
>> https://old.chipsandcheese.com/2025/01/26/inside-sifives-p550- 
>> microarchitecture/
>>
>> [about half way down]
>>
>> "Before accessing cache, load addresses have to be checked against
>> older stores (and vice versa) to ensure proper ordering. If there is a
>> dependency, P550 can only do fast store forwarding if the load and store
>> addresses match exactly and both accesses are naturally aligned.
>> Any unaligned access, dependent or not, confuses P550 for hundreds of
>> cycles. Worse, the unaligned loads and stores don’t proceed in parallel.
>> An unaligned load takes 1062 cycles, an unaligned store takes
>> 741 cycles, and the two together take over 1800 cycles.
>>
>> This terrible unaligned access behavior is atypical even for low power
>> cores. Arm’s Cortex A75 only takes 15 cycles in the worst case of
>> dependent accesses that are both misaligned.
>>
>> Digging deeper with performance counters reveals executing each
>> unaligned
>> load instruction results in ~505 executed instructions. P550 almost
>> certainly doesn’t have hardware support for unaligned accesses.
>> Rather, it’s likely raising a fault and letting an operating system
>> handler emulate it in software."