Path: ...!feeds.phibee-telecom.net!2.eu.feeder.erje.net!feeder.erje.net!newsfeed.bofh.team!paganini.bofh.team!not-for-mail From: antispam@fricas.org (Waldek Hebisch) Newsgroups: comp.arch Subject: Re: Cost of handling misaligned access Date: Thu, 6 Feb 2025 15:58:07 -0000 (UTC) Organization: To protect and to server Message-ID: References: <5lNnP.1313925$2xE6.991023@fx18.iad> <2025Feb3.075550@mips.complang.tuwien.ac.at> <2025Feb6.122124@mips.complang.tuwien.ac.at> Injection-Date: Thu, 6 Feb 2025 15:58:07 -0000 (UTC) Injection-Info: paganini.bofh.team; logging-data="2114227"; posting-host="WwiNTD3IIceGeoS5hCc4+A.user.paganini.bofh.team"; mail-complaints-to="usenet@bofh.team"; posting-account="9dIQLXBM7WM9KzA+yjdR4A"; User-Agent: tin/2.6.2-20221225 ("Pittyvaich") (Linux/6.1.0-9-amd64 (x86_64)) X-Notice: Filtered by postfilter v. 0.9.3 Bytes: 3582 Lines: 53 Anton Ertl wrote: > antispam@fricas.org (Waldek Hebisch) writes: >>Concerning SIMD: trouble here is increasing vector length and >>consequently increasing alignment requirements. > > That is not a necessary consequence, on the contrary: alignment > requirements based on SIMD granularity is hardware designer lazyness, > but means that SIMD cannot be used for many of the applications where > SIMD without that limitation can be used. > > If you want to have alignment checks, then a SIMD instruction should > check for element alignment, not for SIMD alignment. > > But the computer architecture trend is clear: General-purpose > computers do not have alignment restrictions; all that had them have > been discontinued; the last one that had them was SPARC. Trend is clear, but there is a question: is it good trend. You wrot about lazy hardware designers, but there is much more lazy programmers. There are situations when unaligned access is needed, but significant proportion of unaligned accesses is not needed at all. At best such unaligned accesses lead to small performance loss, but they may also be latent bugs. There are cases when unaligned accesses are better than aligned ones, for that architecture should have apropriate instructions. >>A lot of SIMD >>code is memory-bound and current way of doing misaligned >>access leads to worse performance. So really no good way >>to solve this. In principle set of buffers for 2 cache lines >>each and appropriate shifters could give optimal troughput, >>but probably would lead to increased latency. > > AFAIK that's what current microarchitectures do, and in many cases > with small penalties for unaligned accesses; see > https://www.complang.tuwien.ac.at/anton/unaligned-stores/ You call doubling store time 'small penalty'. For me in performance critical loop 10% matter and it is worth aligning things to avoid such loss. And what you present does not look like what I wrote above: AFAICS what Intel do is within single cache line and there is penalty when crossing lines (with 2 cache lines buffers there would be no penalty for line crossing). For me much more important are loads. First, there is more of them. Second, stores can be buffered and latency of store itself is of little importance (latency from store to load matters). For loads extra things in load path increase latency and that may limit program speed. -- Waldek Hebisch