Path: ...!news.misty.com!weretis.net!feeder9.news.weretis.net!news.nk.ca!rocksolid2!i2pn2.org!.POSTED!not-for-mail From: mitchalsup@aol.com (MitchAlsup1) Newsgroups: comp.arch Subject: Re: Cost of handling misaligned access Date: Mon, 3 Feb 2025 01:51:09 +0000 Organization: Rocksolid Light Message-ID: <539a9c6f3a8c0d461f9cd7cb8b2cda49@www.novabbs.org> References: <5lNnP.1313925$2xE6.991023@fx18.iad> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Injection-Info: i2pn2.org; logging-data="2419173"; mail-complaints-to="usenet@i2pn2.org"; posting-account="o5SwNDfMfYu6Mv4wwLiW6e/jbA93UAdzFodw5PEa6eU"; User-Agent: Rocksolid Light X-Rslight-Posting-User: cb29269328a20fe5719ed6a1c397e21f651bda71 X-Spam-Checker-Version: SpamAssassin 4.0.0 X-Rslight-Site: $2y$10$Bdj9JPqJgLn9J3YQUtkVAu1Ax8qX/PkotN2VbJ.OL9d6dP0npjZaa Bytes: 5913 Lines: 126 On Sun, 2 Feb 2025 22:45:53 +0000, BGB wrote: > On 2/2/2025 10:45 AM, EricP wrote: >> As you can see in the article below, the cost of NOT handling misaligned >> accesses in hardware is quite high in cpu clocks. >> >> To my eye, the incremental cost of adding hardware support for >> misaligned >> to the AGU and cache data path should be quite low. The alignment >> shifter >> is basically the same: assuming a 64-byte cache line, LD still has to >> shift any of the 64 bytes into position 0, and reverse for ST. >> >> The incremental cost is in a sequencer in the AGU for handling cache >> line and possibly virtual page straddles, and a small byte shifter to >> left shift the high order bytes. The AGU sequencer needs to know if the >> line straddles a page boundary, if not then increment the 6-bit physical >> line number within the 4 kB physical frame number, if yes then increment >> virtual page number and TLB lookup again and access the first line. >> (Slightly more if multiple page sizes are supported, but same idea.) >> For a load AGU merges the low and high fragments and forwards. >> >> I don't think there are line straddle consequences for coherence because >> there is no ordering guarantees for misaligned accesses. >> > > IMO, the main costs of unaligned access in hardware: > Cache may need two banks of cache lines > lets call them "even" and "odd". > an access crossing a line boundary may need both an even and odd > line; > slightly more expensive extract and insert logic. > > The main costs of not having unaligned access in hardware: > Code either faults or performs like dog crap; > Some pieces of code need convoluted workarounds; > Some algorithms have no choice other than to perform like crap. > > > Even if most of the code doesn't need unaligned access, the parts that > do need it, significantly need it to perform well. > > Well, at least excluding wonk in the ISA, say: > A load/store pair that discards the low-order bits; > An extract/insert instruction that operates on a register pair using the > LOB's of the pointer. > > In effect, something vaguely akin (AFAIK) to what existed on the DEC > Alpha. > > >> The hardware cost appears trivial, especially within an OoO core. >> So there doesn't appear to be any reason to not handle this. >> Am I missing something? >> > > For an OoO core, any cost difference in the L1 cache here is likely to > be negligible. > > > For anything much bigger than a small microcontroller, I would assume > designing a core that handles unaligned access effectively. > > >> https://old.chipsandcheese.com/2025/01/26/inside-sifives-p550- >> microarchitecture/ >> >> [about half way down] >> >> "Before accessing cache, load addresses have to be checked against >> older stores (and vice versa) to ensure proper ordering. If there is a >> dependency, P550 can only do fast store forwarding if the load and store >> addresses match exactly and both accesses are naturally aligned. >> Any unaligned access, dependent or not, confuses P550 for hundreds of >> cycles. Worse, the unaligned loads and stores don’t proceed in parallel. >> An unaligned load takes 1062 cycles, an unaligned store takes >> 741 cycles, and the two together take over 1800 cycles. >> >> This terrible unaligned access behavior is atypical even for low power >> cores. Arm’s Cortex A75 only takes 15 cycles in the worst case of >> dependent accesses that are both misaligned. >> >> Digging deeper with performance counters reveals executing each >> unaligned >> load instruction results in ~505 executed instructions. P550 almost >> certainly doesn’t have hardware support for unaligned accesses. >> Rather, it’s likely raising a fault and letting an operating system >> handler emulate it in software." >> > > An emulation fault, or something similarly nasty... > > > At that point, even turning any potentially unaligned load or store into > a runtime call is likely to be a lot cheaper. > > Say: > __mem_ld_unaligned: > ANDI X15, X10, 7 > BEQ .aligned, X15, X0 > SUB X14, X10, X15 > LW X12, 0(X14) > LW X13, 8(X14) > SLLI X14, X15, 3 > LI X17, 64 > SUB X16, X17, X14 > SRL X12, X12, X14 > SLL X13, X13, X16 > OR X10, X12, X13 > RET > .aligned: > LW X10, 0(X10) > RET > > The aligned case being because SRL with 64 will simply give the input > (since (64&63)==0), causing it to break. > > > Though not supported by GCC or similar, dedicated __aligned and > __unaligned keywords could help here, to specify which pointers are > aligned (no function call), unaligned (needs function call) and default > (probably aligned). All of which vanish when the HW does misaligned accesses. {{It makes the job of the programmer easier}} > ....