| Deutsch English Français Italiano |
|
<vp1e4m$1jv4i$1@dont-email.me> View for Bookmarking (what is this?) Look up another Usenet article |
Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!eternal-september.org!.POSTED!not-for-mail From: BGB <cr88192@gmail.com> Newsgroups: comp.arch Subject: Re: Cost of handling misaligned access Date: Tue, 18 Feb 2025 01:50:33 -0600 Organization: A noiseless patient Spider Lines: 316 Message-ID: <vp1e4m$1jv4i$1@dont-email.me> References: <5lNnP.1313925$2xE6.991023@fx18.iad> <vnosj6$t5o0$1@dont-email.me> <2025Feb3.075550@mips.complang.tuwien.ac.at> <volg1m$31ca1$1@dont-email.me> <voobnc$3l2dl$1@dont-email.me> <0fc4cc997441e25330ff5c8735247b54@www.novabbs.org> <vp0m3f$1cth6$1@dont-email.me> <74142fbdc017bc560d75541f3f3c5118@www.novabbs.org> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Injection-Date: Tue, 18 Feb 2025 08:50:47 +0100 (CET) Injection-Info: dont-email.me; posting-host="1fe6835acbe1e7d2aa43c1dadd73de15"; logging-data="1703058"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/14+KqSDo3JoL4scxanR132ulCdjw7lEk=" User-Agent: Mozilla Thunderbird Cancel-Lock: sha1:7N5uy4yzuZuxps3H+KVo37Eh4F0= In-Reply-To: <74142fbdc017bc560d75541f3f3c5118@www.novabbs.org> Content-Language: en-US Bytes: 12860 On 2/17/2025 8:55 PM, MitchAlsup1 wrote: > On Tue, 18 Feb 2025 1:00:18 +0000, BGB wrote: > >> On 2/14/2025 3:52 PM, MitchAlsup1 wrote: > ------------ >>> It would take LESS total man-power world-wide and over-time to >>> simply make HW perform misaligned accesses. >> >> >> I think the usual issue is that on low-end hardware, it is seen as >> "better" to skip out on misaligned access in order to save some cost in >> the L1 cache. >> >> Though, not sure how this mixes with 16/32 ISAs, given if one allows >> misaligned 32-bit instructions, and a misaligned 32-bit instruction to >> cross a cache-line boundary, one still has to deal with essentially the >> same issues. > > Strategy for low end processors:: > a) detect misalignment in AGEN > b) when misaligned, AGEN takes 2 cycles for the two addresses > c) when misaligned, DC is accessed twice > d) When misaligned, LD align is performed twice to merge data > Possibly. I had done it at basically full speed with sets of even and odd addressed cache-lines, but some mechanism to crack the Load/Store into two parts internally could be a different strategy. Possible cracking might only need to be done though if the misaligned access also crosses a line boundary. >> Another related thing I can note is internal store-forwarding within the >> L1 D$ to avoid RAW and WAW penalties for multiple accesses to the same >> cache line. > > IMHO:: Low end processors should not be doing ST->LD forwarding. > Possibly true. This feature adds a bit of cost, and is one of the things I ended up needing to turn off in attempts to boost the clock speed to 75MHz. But, my existing core is currently a little too bulky to try pushing to 75MHz. Using staggered stores in prologs and memcpy does significantly decrease the performance of disabling this forwarding (but does put some hurt on the speed of LZ4 and RP2 decoding). I am left half-thinking it might make sense to try doing something lighter. But, would need to decide on specifics. A full soft-reboot is unlikely. But, might make sense to design a core for a subset of my current design. One possibility could be to design a 2-wide core around a subset of XG3. And, possibly try aiming for a 75MHz target. May drop to 32/64 bit instructions and 64-bit fetch. May not try for RV64G, as some things in RV64G add too much complexity and would likely make a 75MHz target harder. Some things would be TBD, like whether to stay with full compare-and-branch, or drop back to cheaper compare-with-zero-and-branch. Would likely (once again) axe some things that needed to be added for RV64G support (but which remain debatable in terms of hardware cost, 1). 1: Say, for example, 64-bit integer multiply and divide. It being cheaper to do a 64-bit CPU but only provide a 32-bit multiplier (falling back to software for 64-bit multiply). XG2 is also possible, but arguably, XG3 does have a cleaner encoding scheme. Currently, either can be decoded in terms of the other, but there are some amount of special cases (and it might be cleaner to switch to XG3 as the native encoding scheme). I guess another open question is if there is a way to make my Binary64 FPU cheaper and with less timing impact. Not sure, it was already a bit of an exercise in corner cutting. There is also an idle thought of trying to lengthen the pipeline enough to allow fully pipelined FPU ops. But, the issue is doing so cheaply (and without negatively effecting the cost of branch-predictor misses). Say: PF IF ID RF E1 E2 E3 E4 E5 E6 WB Would have steeper cost and increased branch latency. Though, one could possibly only allow forwarding from certain stages, say: E2, E3, and E5 Whereas, if the result is in E1 or E4, it generates an interlock stall, and E6 stalls until WB completes (may or may not allow forwarding from WB). Though, possibly, there could be "pseudo-forwarding" from E4/E5/E6, where if an instruction completed in a prior stage, these stages can still forward the result, but no new results may "arrive" at these stages (dunno how much difference this would make for forwarding cost, could still be expensive to have this many EX stages). Dropping EX1, as-is, mostly effects the performance of Reg-Reg and Imm-Reg MOV (pretty much everything else of note already has a 2-cycle latency), but these instructions are more sensitive to latency (so, 2 cycle MOV is not ideal). With 6 pipeline stages, this could be enough to allow pipelining a Binary64 FMUL or FADD, or a Binary32 FMAC. But, would mean 13 cycle branch miss, ... And possibly also turn the CPU into a turd. Another option could be keep 3 primary EX stages, but have mechanism for registers to be marked as "not yet available" and then to allow longer latency operations to finish at some later stage. Some cores I had looked at had done this (for things like memory accesses, which were put into a FIFO), but this leaves the issue of how to best get results back into the register file (don't want to be handing out register file write ports to function-units, and there is an issue that there is a high probability of multiply FU's wanting to submit results at the same time, which would need to be dealt with). Best option I can think of is that these FUs have a mechanism to hold 1 or 2 values, and a mechanism exists to MUX these over a shared write port, generating pipeline stalls if the port gets backlogged. But, this seems like it would suck. Moving instructions along one stage at a time, and then having the final value appear on the pipeline (for be forwarded back to RF, or eventually reach WB), is cleaner and simpler. Nevermind the issue of needing to stall the pipeline whenever the L1 cache misses or similar. .... But, I guess the more immediate question would be more of coming up with something that has a decent/fast ISA, can run at 75MHz, and fit more easily onto an XC7S50 or similar. Though, the most conservative option is to keep a design similar to my existing core, just try to strip it down a fair bit. > --------------------- >> >> Say, it less convoluted to do, say: >> MOV.X R24, (SP, 0) >> MOV.X R26, (SP, 16) >> MOV.X R28, (SP, 32) >> MOV.X R30, (SP, 48) > > These still look like LDs to me. > My ASM notation is "OP Src, Dst". Which is, granted, backwards from Intel and RV notation. ========== REMAINDER OF ARTICLE TRUNCATED ==========