| Deutsch English Français Italiano |
|
<vnosfu$t4ra$1@dont-email.me> View for Bookmarking (what is this?) Look up another Usenet article |
Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!eternal-september.org!.POSTED!not-for-mail From: "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> Newsgroups: comp.arch Subject: Re: Cost of handling misaligned access Date: Sun, 2 Feb 2025 14:44:13 -0800 Organization: A noiseless patient Spider Lines: 70 Message-ID: <vnosfu$t4ra$1@dont-email.me> References: <5lNnP.1313925$2xE6.991023@fx18.iad> <b50b6b125cc92f7711d420a746941f7e@www.novabbs.org> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Injection-Date: Sun, 02 Feb 2025 23:44:15 +0100 (CET) Injection-Info: dont-email.me; posting-host="fac76d2357a12f4c3a94748a8c888ab1"; logging-data="955242"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19xMnymeCv8gSBE6OHHThRhod4zt4n6iUs=" User-Agent: Mozilla Thunderbird Cancel-Lock: sha1:Z0nmLeERFxeAfGHd9I+3BOxWXDU= Content-Language: en-US In-Reply-To: <b50b6b125cc92f7711d420a746941f7e@www.novabbs.org> Bytes: 4278 On 2/2/2025 10:51 AM, MitchAlsup1 wrote: > On Sun, 2 Feb 2025 16:45:19 +0000, EricP wrote: > >> As you can see in the article below, the cost of NOT handling misaligned >> accesses in hardware is quite high in cpu clocks. >> >> To my eye, the incremental cost of adding hardware support for >> misaligned >> to the AGU and cache data path should be quite low. The alignment >> shifter >> is basically the same: assuming a 64-byte cache line, LD still has to >> shift any of the 64 bytes into position 0, and reverse for ST. > > A handful of gates to detect misalignedness and recognize the line and > page crossing misalignments. > > The alignment shifters are twice as big. > > Now, while I accept these costs, I accept that others may not. I accept > these costs because of the performance issues when I don't. > >> The incremental cost is in a sequencer in the AGU for handling cache >> line and possibly virtual page straddles, and a small byte shifter to >> left shift the high order bytes. The AGU sequencer needs to know if the >> line straddles a page boundary, if not then increment the 6-bit physical >> line number within the 4 kB physical frame number, if yes then increment >> virtual page number and TLB lookup again and access the first line. >> (Slightly more if multiple page sizes are supported, but same idea.) >> For a load AGU merges the low and high fragments and forwards. >> >> I don't think there are line straddle consequences for coherence because >> there is no ordering guarantees for misaligned accesses. > > Generally stated as:: Misaligned accesses cannot be considered ATOMIC. Try it on an x86/x64. Straddle a l2 cache line and use it with a LOCK'ed RMW. It should assert the BUS lock. > >> The hardware cost appears trivial, especially within an OoO core. >> So there doesn't appear to be any reason to not handle this. >> Am I missing something? >> >> https://old.chipsandcheese.com/2025/01/26/inside-sifives-p550- >> microarchitecture/ >> >> [about half way down] >> >> "Before accessing cache, load addresses have to be checked against >> older stores (and vice versa) to ensure proper ordering. If there is a >> dependency, P550 can only do fast store forwarding if the load and store >> addresses match exactly and both accesses are naturally aligned. >> Any unaligned access, dependent or not, confuses P550 for hundreds of >> cycles. Worse, the unaligned loads and stores don’t proceed in parallel. >> An unaligned load takes 1062 cycles, an unaligned store takes >> 741 cycles, and the two together take over 1800 cycles. >> >> This terrible unaligned access behavior is atypical even for low power >> cores. Arm’s Cortex A75 only takes 15 cycles in the worst case of >> dependent accesses that are both misaligned. >> >> Digging deeper with performance counters reveals executing each >> unaligned >> load instruction results in ~505 executed instructions. P550 almost >> certainly doesn’t have hardware support for unaligned accesses. >> Rather, it’s likely raising a fault and letting an operating system >> handler emulate it in software."