Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!eternal-september.org!.POSTED!not-for-mail From: anton@mips.complang.tuwien.ac.at (Anton Ertl) Newsgroups: comp.arch Subject: Re: Cost of handling misaligned access Date: Sun, 02 Feb 2025 17:44:58 GMT Organization: Institut fuer Computersprachen, Technische Universitaet Wien Lines: 50 Message-ID: <2025Feb2.184458@mips.complang.tuwien.ac.at> References: <5lNnP.1313925$2xE6.991023@fx18.iad> Injection-Date: Sun, 02 Feb 2025 19:08:50 +0100 (CET) Injection-Info: dont-email.me; posting-host="f0d8ba63a863aedce302b0b45b92c591"; logging-data="829314"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+6x+zNSQeBw3bPJXDmJExi" Cancel-Lock: sha1:/EBrxWPxdXPAeutKSQlJ9hZz3qs= X-newsreader: xrn 10.11 Bytes: 3234 EricP writes: >The incremental cost is in a sequencer in the AGU for handling cache >line and possibly virtual page straddles, and a small byte shifter to >left shift the high order bytes. The AGU sequencer needs to know if the >line straddles a page boundary, if not then increment the 6-bit physical >line number within the 4 kB physical frame number, if yes then increment >virtual page number and TLB lookup again and access the first line. >(Slightly more if multiple page sizes are supported, but same idea.) >For a load AGU merges the low and high fragments and forwards. .... >The hardware cost appears trivial, especially within an OoO core. >So there doesn't appear to be any reason to not handle this. >Am I missing something? The OS must also be able to keep both pages in physical memory until the access is complete, or there will be no progress. Should not be a problem these days, but the 48 pages or so potentially needed by VAX complicated the OS. Yes, hardware is not hard, there is software that benefits, and as a result, modern architectures (including RISC-V) now support unaligned accesses (except for atomic accesses). >https://old.chipsandcheese.com/2025/01/26/inside-sifives-p550-microarchitecture/ .... >This terrible unaligned access behavior is atypical even for low power >cores. Arm's Cortex A75 only takes 15 cycles in the worst case of >dependent accesses that are both misaligned. > >Digging deeper with performance counters reveals executing each unaligned >load instruction results in ~505 executed instructions. This is similar to what I measured on an U74 core from SiFive <2024May14.073553@mips.complang.tuwien.ac.at>, so they probably use the same solution. >P550 almost >certainly doesn’t have hardware support for unaligned accesses. >Rather, it’s likely raising a fault and letting an operating system >handler emulate it in software." The architecture guarantees that unaligned accesses work, so the OS might not have support for such emulation. Another option would be to trap into some kind of firmware-supplied fixup code, along the lines of Alpha's PALcode. - anton -- 'Anyone trying for "industrial quality" ISA should avoid undefined behavior.' Mitch Alsup,