Article <2025Feb6.115939@mips.complang.tuwien.ac.at>

Deutsch English Français Italiano
<2025Feb6.115939@mips.complang.tuwien.ac.at>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!eternal-september.org!.POSTED!not-for-mail
From: anton@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: Cost of handling misaligned access
Date: Thu, 06 Feb 2025 10:59:39 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 27
Message-ID: <2025Feb6.115939@mips.complang.tuwien.ac.at>
References: <5lNnP.1313925$2xE6.991023@fx18.iad> <vnosj6$t5o0$1@dont-email.me> <2025Feb3.075550@mips.complang.tuwien.ac.at> <wi7oP.2208275$FOb4.591154@fx15.iad> <2025Feb4.191631@mips.complang.tuwien.ac.at> <vo061a$2fiql$1@dont-email.me> <20250206000143.00000dd9@yahoo.com>
Injection-Date: Thu, 06 Feb 2025 12:17:55 +0100 (CET)
Injection-Info: dont-email.me; posting-host="8c722ff18d197a536e9146b6a5660175";
	logging-data="3067456"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX1/D+E7F3devayEpK9Luodoc"
Cancel-Lock: sha1:1brScvHQj93Zm4UlSln5lGiFyIs=
X-newsreader: xrn 10.11
Bytes: 2272

Michael S <already5chosen@yahoo.com> writes:
>> This resulted in just 12 instructions to handle 32 tests.
>> 
>
>That sounds suboptimal.
>By unrolling outer loop by 2 or 3 you can greatly reduce the number of 
>memory accesses per comparison.

Looking at the inner loop code shown in
<2025Feb6.113049@mips.complang.tuwien.ac.at>, the 12 instructions do
not include the loop overhead and are already unrolled by a factor of
4 (32 for the scalar code).  The loop overhead is 3 instructions, for
a total of 15 instructions per iteration.

>The speed up would depend on specific
>microarchiture, but I would guess that at least 1.2x speedup is here.

Even if you completely eliminate the loop overhead, the number of
instructions is reduced by at most a factor 1.25, and I expect that
the speedup from further unrolling is a factor of at most 1 on most
CPUs (factor <1 can come from handling the remaining elements slowly,
which does not seem unlikely for code coming out of gcc and clang).

- anton
-- 
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
  Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>