Path: ...!news.misty.com!weretis.net!feeder9.news.weretis.net!news.nk.ca!rocksolid2!i2pn2.org!.POSTED!not-for-mail
From: mitchalsup@aol.com (MitchAlsup1)
Newsgroups: comp.arch
Subject: Re: Cost of handling misaligned access
Date: Mon, 3 Feb 2025 01:51:09 +0000
Organization: Rocksolid Light
Message-ID: <539a9c6f3a8c0d461f9cd7cb8b2cda49@www.novabbs.org>
References: <5lNnP.1313925$2xE6.991023@fx18.iad> <vnosj6$t5o0$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Info: i2pn2.org;
	logging-data="2419173"; mail-complaints-to="usenet@i2pn2.org";
	posting-account="o5SwNDfMfYu6Mv4wwLiW6e/jbA93UAdzFodw5PEa6eU";
User-Agent: Rocksolid Light
X-Rslight-Posting-User: cb29269328a20fe5719ed6a1c397e21f651bda71
X-Spam-Checker-Version: SpamAssassin 4.0.0
X-Rslight-Site: $2y$10$Bdj9JPqJgLn9J3YQUtkVAu1Ax8qX/PkotN2VbJ.OL9d6dP0npjZaa
Bytes: 5913
Lines: 126

On Sun, 2 Feb 2025 22:45:53 +0000, BGB wrote:

> On 2/2/2025 10:45 AM, EricP wrote:
>> As you can see in the article below, the cost of NOT handling misaligned
>> accesses in hardware is quite high in cpu clocks.
>>
>> To my eye, the incremental cost of adding hardware support for
>> misaligned
>> to the AGU and cache data path should be quite low. The alignment
>> shifter
>> is basically the same: assuming a 64-byte cache line, LD still has to
>> shift any of the 64 bytes into position 0, and reverse for ST.
>>
>> The incremental cost is in a sequencer in the AGU for handling cache
>> line and possibly virtual page straddles, and a small byte shifter to
>> left shift the high order bytes. The AGU sequencer needs to know if the
>> line straddles a page boundary, if not then increment the 6-bit physical
>> line number within the 4 kB physical frame number, if yes then increment
>> virtual page number and TLB lookup again and access the first line.
>> (Slightly more if multiple page sizes are supported, but same idea.)
>> For a load AGU merges the low and high fragments and forwards.
>>
>> I don't think there are line straddle consequences for coherence because
>> there is no ordering guarantees for misaligned accesses.
>>
>
> IMO, the main costs of unaligned access in hardware:
>    Cache may need two banks of cache lines
>      lets call them "even" and "odd".
>    an access crossing a line boundary may need both an even and odd
> line;
>    slightly more expensive extract and insert logic.
>
> The main costs of not having unaligned access in hardware:
>    Code either faults or performs like dog crap;
>    Some pieces of code need convoluted workarounds;
>    Some algorithms have no choice other than to perform like crap.
>
>
> Even if most of the code doesn't need unaligned access, the parts that
> do need it, significantly need it to perform well.
>
> Well, at least excluding wonk in the ISA, say:
> A load/store pair that discards the low-order bits;
> An extract/insert instruction that operates on a register pair using the
> LOB's of the pointer.
>
> In effect, something vaguely akin (AFAIK) to what existed on the DEC
> Alpha.
>
>
>> The hardware cost appears trivial, especially within an OoO core.
>> So there doesn't appear to be any reason to not handle this.
>> Am I missing something?
>>
>
> For an OoO core, any cost difference in the L1 cache here is likely to
> be negligible.
>
>
> For anything much bigger than a small microcontroller, I would assume
> designing a core that handles unaligned access effectively.
>
>
>> https://old.chipsandcheese.com/2025/01/26/inside-sifives-p550-
>> microarchitecture/
>>
>> [about half way down]
>>
>> "Before accessing cache, load addresses have to be checked against
>> older stores (and vice versa) to ensure proper ordering. If there is a
>> dependency, P550 can only do fast store forwarding if the load and store
>> addresses match exactly and both accesses are naturally aligned.
>> Any unaligned access, dependent or not, confuses P550 for hundreds of
>> cycles. Worse, the unaligned loads and stores don’t proceed in parallel.
>> An unaligned load takes 1062 cycles, an unaligned store takes
>> 741 cycles, and the two together take over 1800 cycles.
>>
>> This terrible unaligned access behavior is atypical even for low power
>> cores. Arm’s Cortex A75 only takes 15 cycles in the worst case of
>> dependent accesses that are both misaligned.
>>
>> Digging deeper with performance counters reveals executing each
>> unaligned
>> load instruction results in ~505 executed instructions. P550 almost
>> certainly doesn’t have hardware support for unaligned accesses.
>> Rather, it’s likely raising a fault and letting an operating system
>> handler emulate it in software."
>>
>
> An emulation fault, or something similarly nasty...
>
>
> At that point, even turning any potentially unaligned load or store into
> a runtime call is likely to be a lot cheaper.
>
> Say:
>    __mem_ld_unaligned:
>      ANDI  X15, X10, 7
>      BEQ   .aligned, X15, X0
>      SUB   X14, X10, X15
>      LW    X12, 0(X14)
>      LW    X13, 8(X14)
>      SLLI  X14, X15, 3
>      LI    X17, 64
>      SUB   X16, X17, X14
>      SRL   X12, X12, X14
>      SLL   X13, X13, X16
>      OR    X10, X12, X13
>      RET
>      .aligned:
>      LW    X10, 0(X10)
>      RET
>
> The aligned case being because SRL with 64 will simply give the input
> (since (64&63)==0), causing it to break.
>
>
> Though not supported by GCC or similar, dedicated __aligned and
> __unaligned keywords could help here, to specify which pointers are
> aligned (no function call), unaligned (needs function call) and default
> (probably aligned).

All of which vanish when the HW does misaligned accesses.
{{It makes the job of the programmer easier}}

> ....