Deutsch English Français Italiano |
<2025Feb6.115939@mips.complang.tuwien.ac.at> View for Bookmarking (what is this?) Look up another Usenet article |
Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!eternal-september.org!.POSTED!not-for-mail From: anton@mips.complang.tuwien.ac.at (Anton Ertl) Newsgroups: comp.arch Subject: Re: Cost of handling misaligned access Date: Thu, 06 Feb 2025 10:59:39 GMT Organization: Institut fuer Computersprachen, Technische Universitaet Wien Lines: 27 Message-ID: <2025Feb6.115939@mips.complang.tuwien.ac.at> References: <5lNnP.1313925$2xE6.991023@fx18.iad> <vnosj6$t5o0$1@dont-email.me> <2025Feb3.075550@mips.complang.tuwien.ac.at> <wi7oP.2208275$FOb4.591154@fx15.iad> <2025Feb4.191631@mips.complang.tuwien.ac.at> <vo061a$2fiql$1@dont-email.me> <20250206000143.00000dd9@yahoo.com> Injection-Date: Thu, 06 Feb 2025 12:17:55 +0100 (CET) Injection-Info: dont-email.me; posting-host="8c722ff18d197a536e9146b6a5660175"; logging-data="3067456"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/D+E7F3devayEpK9Luodoc" Cancel-Lock: sha1:1brScvHQj93Zm4UlSln5lGiFyIs= X-newsreader: xrn 10.11 Bytes: 2272 Michael S <already5chosen@yahoo.com> writes: >> This resulted in just 12 instructions to handle 32 tests. >> > >That sounds suboptimal. >By unrolling outer loop by 2 or 3 you can greatly reduce the number of >memory accesses per comparison. Looking at the inner loop code shown in <2025Feb6.113049@mips.complang.tuwien.ac.at>, the 12 instructions do not include the loop overhead and are already unrolled by a factor of 4 (32 for the scalar code). The loop overhead is 3 instructions, for a total of 15 instructions per iteration. >The speed up would depend on specific >microarchiture, but I would guess that at least 1.2x speedup is here. Even if you completely eliminate the loop overhead, the number of instructions is reduced by at most a factor 1.25, and I expect that the speedup from further unrolling is a factor of at most 1 on most CPUs (factor <1 can come from handling the remaining elements slowly, which does not seem unlikely for code coming out of gcc and clang). - anton -- 'Anyone trying for "industrial quality" ISA should avoid undefined behavior.' Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>