Deutsch English Français Italiano |
<vnq4st$176b4$1@dont-email.me> View for Bookmarking (what is this?) Look up another Usenet article |
Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!eternal-september.org!.POSTED!not-for-mail From: BGB <cr88192@gmail.com> Newsgroups: comp.arch Subject: Re: Cost of handling misaligned access Date: Mon, 3 Feb 2025 04:13:44 -0600 Organization: A noiseless patient Spider Lines: 231 Message-ID: <vnq4st$176b4$1@dont-email.me> References: <5lNnP.1313925$2xE6.991023@fx18.iad> <vnosj6$t5o0$1@dont-email.me> <2025Feb3.075550@mips.complang.tuwien.ac.at> <vnptl6$15pgm$1@dont-email.me> <2025Feb3.093413@mips.complang.tuwien.ac.at> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Injection-Date: Mon, 03 Feb 2025 11:13:50 +0100 (CET) Injection-Info: dont-email.me; posting-host="f4fe680962b51b8bca49b43582a45572"; logging-data="1284452"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+va6853EPboXB02nuw2JZxL3A+kl+oW4o=" User-Agent: Mozilla Thunderbird Cancel-Lock: sha1:8olmSpUA2xSNvSDlpyeAt3fconw= In-Reply-To: <2025Feb3.093413@mips.complang.tuwien.ac.at> Content-Language: en-US Bytes: 8614 On 2/3/2025 2:34 AM, Anton Ertl wrote: > BGB <cr88192@gmail.com> writes: >> On 2/3/2025 12:55 AM, Anton Ertl wrote: >> Rather, have something like an explicit "__unaligned" keyword or >> similar, and then use the runtime call for these pointers. > > There are people who think that it is ok to compile *p to anything if > p is not aligned, even on architectures that support unaligned > accesses. At least one of those people recommended the use of > memcpy(..., ..., sizeof(...)). Let's see what gcc produces on > rv64gc (where unaligned accesses are guaranteed to work): > > [fedora-starfive:/tmp:111378] cat x.c > #include <string.h> > > long uload(long *p) > { > long x; > memcpy(&x,p,sizeof(long)); > return x; > } > [fedora-starfive:/tmp:111379] gcc -O -S x.c > [fedora-starfive:/tmp:111380] cat x.s > .file "x.c" > .option nopic > .text > .align 1 > .globl uload > .type uload, @function > uload: > addi sp,sp,-16 > lbu t1,0(a0) > lbu a7,1(a0) > lbu a6,2(a0) > lbu a1,3(a0) > lbu a2,4(a0) > lbu a3,5(a0) > lbu a4,6(a0) > lbu a5,7(a0) > sb t1,8(sp) > sb a7,9(sp) > sb a6,10(sp) > sb a1,11(sp) > sb a2,12(sp) > sb a3,13(sp) > sb a4,14(sp) > sb a5,15(sp) > ld a0,8(sp) > addi sp,sp,16 > jr ra > .size uload, .-uload > .ident "GCC: (GNU) 10.3.1 20210422 (Red Hat 10.3.1-1)" > .section .note.GNU-stack,"",@progbits > > Oh boy. Godbolt tells me that gcc-14.2.0 still does it the same way, This isn't really the way I would do it, but, granted, it is the way GCC does it... I guess, one can at least be happy it isn't a call into a copy-slide, say: __memcpy_8: lb x13, 7(x11) sb x13, 7(x10) __memcpy_7: lb x13, 6(x11) sb x13, 6(x10) __memcpy_6: lb x13, 5(x11) sb x13, 5(x10) __memcpy_5: lb x13, 4(x11) sb x13, 4(x10) __memcpy_4: lb x13, 3(x11) sb x13, 3(x10) __memcpy_3: lb x13, 2(x11) sb x13, 2(x10) __memcpy_2: lb x13, 1(x11) sb x13, 1(x10) __memcpy_1: lb x13, 0(x11) sb x13, 0(x10) __memcpy_0: jr ra But... then again... In BGBCC for fixed-size "memcpy()": memcpy 0..64: will often generate inline. direct loads/stores, up to 64 bits at a time will use smaller for any tail bytes. memcpy 96..512: will call into an auto-generated slide (for multiples of 32 bytes). Will auto-generate a tail copy and then branch into the slide for non-multiples of 32 bytes, for that specific size. So: __memcpy_512 and __memcpy_480 will go directly to the slide. Whereas, say, __memcpy_488 will generate a more specialized function that copies 8 bytes then branches into the slide. Reason for multiples of 32 bytes being that this is the minimum copy that does not suffer interlock penalties. Say, in XG2: __memcpy_512: MOV.Q (R5, 480), X20 MOV.Q (R5, 488), X21 MOV.Q (R5, 496), X22 MOV.Q (R5, 504), X23 MOV.Q X20, (R4, 480) MOV.Q X21, (R4, 488) MOV.Q X22, (R4, 496) MOV.Q X23, (R4, 504) __memcpy_480: ... Then, say: __memcpy_488: MOV.Q (R5, 480), X20 MOV.Q X20, (R4, 480) BRA __memcpy_480 __memcpy_496: MOV.Q (R5, 480), X20 MOV.Q (R5, 488), X21 MOV.Q X20, (R4, 480) MOV.Q X21, (R4, 488) BRA __memcpy_480 ... Then: __memcpy_492: MOV.L (R5, 488), X20 MOV.L X20, (R4, 488) BRA __memcpy_488 ... For these later cases, it keeps track of them via bitmaps (with a bit for each size), so that it knows which sizes need to be generated. In this case, these special functions and slides were the fastest option that also doesn't waste excessive amounts of space (besides the cost of the slide, but this is why it only copies up to 512 bytes). I ended up not bothering with special aligned cases, as the cost of detecting if the copy was aligned was generally more than that saved from having a separate aligned version. If bigger than 512, it calls the generic memcpy function... Which generally then copies chunks of memory (say, 512 bytes), and uses some smaller loops to clean up whatever is left. Say, chunk sizes (bytes): 512, 128, 32, 16, 8, 4, 1 > whereas clang 9.0.0 and following produce > > [fedora-starfive:/tmp:111383] clang -O -S x.c > [fedora-starfive:/tmp:111384] cat x.s > .text > .attribute 4, 16 > .attribute 5, "rv64i2p0_m2p0_a2p0_f2p0_d2p0_c2p0" > .file "x.c" > .globl uload # -- Begin function uload > .p2align 1 > .type uload,@function > uload: # @uload > .cfi_startproc > # %bb.0: > ld a0, 0(a0) > ret > .Lfunc_end0: > .size uload, .Lfunc_end0-uload > .cfi_endproc > # -- End function ========== REMAINDER OF ARTICLE TRUNCATED ==========