Deutsch English Français Italiano |
<vo2iqq$30elm$1@dont-email.me> View for Bookmarking (what is this?) Look up another Usenet article |
Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!eternal-september.org!.POSTED!not-for-mail From: Terje Mathisen <terje.mathisen@tmsw.no> Newsgroups: comp.arch Subject: Re: Cost of handling misaligned access Date: Thu, 6 Feb 2025 16:00:42 +0100 Organization: A noiseless patient Spider Lines: 76 Message-ID: <vo2iqq$30elm$1@dont-email.me> References: <5lNnP.1313925$2xE6.991023@fx18.iad> <vnosj6$t5o0$1@dont-email.me> <2025Feb3.075550@mips.complang.tuwien.ac.at> <wi7oP.2208275$FOb4.591154@fx15.iad> <2025Feb4.191631@mips.complang.tuwien.ac.at> <vo061a$2fiql$1@dont-email.me> <20250206000143.00000dd9@yahoo.com> <2025Feb6.115939@mips.complang.tuwien.ac.at> <20250206152808.0000058f@yahoo.com> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Injection-Date: Thu, 06 Feb 2025 16:00:42 +0100 (CET) Injection-Info: dont-email.me; posting-host="ff60327de3885ffac12ab9243e481f58"; logging-data="3160758"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18AZ0g3x6BS6ncqoshC2B154qAVmCAG2U5zhlNGZ7NdLQ==" User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:128.0) Gecko/20100101 Firefox/128.0 SeaMonkey/2.53.20 Cancel-Lock: sha1:/PdeAZSFOIST6E6oepeM6AaNipI= In-Reply-To: <20250206152808.0000058f@yahoo.com> Bytes: 4743 Michael S wrote: > On Thu, 06 Feb 2025 10:59:39 GMT > anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote: > >> Michael S <already5chosen@yahoo.com> writes: >>>> This resulted in just 12 instructions to handle 32 tests. >>>> >>> >>> That sounds suboptimal. >>> By unrolling outer loop by 2 or 3 you can greatly reduce the number >>> of memory accesses per comparison. >> >> Looking at the inner loop code shown in >> <2025Feb6.113049@mips.complang.tuwien.ac.at>, the 12 instructions do >> not include the loop overhead and are already unrolled by a factor of >> 4 (32 for the scalar code). The loop overhead is 3 instructions, for >> a total of 15 instructions per iteration. >> >>> The speed up would depend on specific >>> microarchiture, but I would guess that at least 1.2x speedup is >>> here. >> >> Even if you completely eliminate the loop overhead, the number of >> instructions is reduced by at most a factor 1.25, and I expect that >> the speedup from further unrolling is a factor of at most 1 on most >> CPUs (factor <1 can come from handling the remaining elements slowly, >> which does not seem unlikely for code coming out of gcc and clang). >> >> - anton > > The point of my proposal is not reduction of loop overhead and not > reduction of the # of x86 instructions (in fact, with my proposal the # > of x86 instructions is increased), but reduction in # of uOps due to > reuse of loaded values. > The theory behind it is that most typically in code with very high > IPC like the one above the main bottleneck is the # of uOps that flows > through rename stage. Aha! I see what you mean: Yes, this would be better if the VPAND reg,reg,[mem] instructions actually took more than one cycle each, but as the size of the arrays were just 1000 bytes each (250 keys + 250 locks), everything fits easily in $L1. (BTW, I did try to add 6 dummy keys and locks just to avoid any loop end overhead, but that actually ran slower.) > > Not counting loop overhead, an original 1x4 inner loop consists of 12 > instructions, 16 uops. Suppose, we replace it by 2x2 inner loop that > does the same amount of work. New inner loop contains only RISC-like > instructions - 14 instructions, 14 uOps. > With 3x2 inner loop there are 20 instruction, 20 uOps and 1.5x more > work done per iteration. > > Another factor that can contribute to a speedup is increased number > of iterations in the inner loop - from 1..7 iterations in original to > 1..15 in both of my above mentioned variants. > > > Yet another possibility is to follow "work harder not smarter" > principle, i.e. process the whole square rather than just a relevant > triangle. The main gain is that loop detector would be able to predict > the # of iterations in the inner loop, avoiding mispredicted branch at > the end. If we follow this pass then it probably makes sense to > not unroll an inner loop beyond SIMD factor of 8 and instead unroll an > outer loop by 4. > Going by intuition, in this particular application "smarter" wins > over "harder", but we know that intuition sucks. Including mine :( > Yeah, it was the actual measured performance which amazed me. :-) Terje -- - <Terje.Mathisen at tmsw.no> "almost all programming can be viewed as an exercise in caching"