Article <vnq4st$176b4$1@dont-email.me>

Deutsch English Français Italiano
<vnq4st$176b4$1@dont-email.me>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!eternal-september.org!.POSTED!not-for-mail
From: BGB <cr88192@gmail.com>
Newsgroups: comp.arch
Subject: Re: Cost of handling misaligned access
Date: Mon, 3 Feb 2025 04:13:44 -0600
Organization: A noiseless patient Spider
Lines: 231
Message-ID: <vnq4st$176b4$1@dont-email.me>
References: <5lNnP.1313925$2xE6.991023@fx18.iad> <vnosj6$t5o0$1@dont-email.me>
 <2025Feb3.075550@mips.complang.tuwien.ac.at> <vnptl6$15pgm$1@dont-email.me>
 <2025Feb3.093413@mips.complang.tuwien.ac.at>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Mon, 03 Feb 2025 11:13:50 +0100 (CET)
Injection-Info: dont-email.me; posting-host="f4fe680962b51b8bca49b43582a45572";
	logging-data="1284452"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX1+va6853EPboXB02nuw2JZxL3A+kl+oW4o="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:8olmSpUA2xSNvSDlpyeAt3fconw=
In-Reply-To: <2025Feb3.093413@mips.complang.tuwien.ac.at>
Content-Language: en-US
Bytes: 8614

On 2/3/2025 2:34 AM, Anton Ertl wrote:
> BGB <cr88192@gmail.com> writes:
>> On 2/3/2025 12:55 AM, Anton Ertl wrote:
>> Rather, have something like an explicit "__unaligned" keyword or
>> similar, and then use the runtime call for these pointers.
> 
> There are people who think that it is ok to compile *p to anything if
> p is not aligned, even on architectures that support unaligned
> accesses.  At least one of those people recommended the use of
> memcpy(..., ..., sizeof(...)).  Let's see what gcc produces on
> rv64gc (where unaligned accesses are guaranteed to work):
> 
> [fedora-starfive:/tmp:111378] cat x.c
> #include <string.h>
> 
> long uload(long *p)
> {
>    long x;
>    memcpy(&x,p,sizeof(long));
>    return x;
> }
> [fedora-starfive:/tmp:111379] gcc -O -S x.c
> [fedora-starfive:/tmp:111380] cat x.s
>          .file   "x.c"
>          .option nopic
>          .text
>          .align  1
>          .globl  uload
>          .type   uload, @function
> uload:
>          addi    sp,sp,-16
>          lbu     t1,0(a0)
>          lbu     a7,1(a0)
>          lbu     a6,2(a0)
>          lbu     a1,3(a0)
>          lbu     a2,4(a0)
>          lbu     a3,5(a0)
>          lbu     a4,6(a0)
>          lbu     a5,7(a0)
>          sb      t1,8(sp)
>          sb      a7,9(sp)
>          sb      a6,10(sp)
>          sb      a1,11(sp)
>          sb      a2,12(sp)
>          sb      a3,13(sp)
>          sb      a4,14(sp)
>          sb      a5,15(sp)
>          ld      a0,8(sp)
>          addi    sp,sp,16
>          jr      ra
>          .size   uload, .-uload
>          .ident  "GCC: (GNU) 10.3.1 20210422 (Red Hat 10.3.1-1)"
>          .section        .note.GNU-stack,"",@progbits
> 
> Oh boy.  Godbolt tells me that gcc-14.2.0 still does it the same way,

This isn't really the way I would do it, but, granted, it is the way GCC 
does it...

I guess, one can at least be happy it isn't a call into a copy-slide, say:
   __memcpy_8:
     lb x13, 7(x11)
     sb x13, 7(x10)
   __memcpy_7:
     lb x13, 6(x11)
     sb x13, 6(x10)
   __memcpy_6:
     lb x13, 5(x11)
     sb x13, 5(x10)
   __memcpy_5:
     lb x13, 4(x11)
     sb x13, 4(x10)
   __memcpy_4:
     lb x13, 3(x11)
     sb x13, 3(x10)
   __memcpy_3:
     lb x13, 2(x11)
     sb x13, 2(x10)
   __memcpy_2:
     lb x13, 1(x11)
     sb x13, 1(x10)
   __memcpy_1:
     lb x13, 0(x11)
     sb x13, 0(x10)
   __memcpy_0:
     jr ra


But... then again... In BGBCC for fixed-size "memcpy()":
   memcpy 0..64: will often generate inline.
     direct loads/stores, up to 64 bits at a time
     will use smaller for any tail bytes.
   memcpy 96..512: will call into an auto-generated slide
     (for multiples of 32 bytes).

Will auto-generate a tail copy and then branch into the slide for 
non-multiples of 32 bytes, for that specific size.

So: __memcpy_512 and __memcpy_480 will go directly to the slide.
Whereas, say, __memcpy_488 will generate a more specialized function 
that copies 8 bytes then branches into the slide. Reason for multiples 
of 32 bytes being that this is the minimum copy that does not suffer 
interlock penalties.

Say, in XG2:
   __memcpy_512:
     MOV.Q  (R5, 480), X20
     MOV.Q  (R5, 488), X21
     MOV.Q  (R5, 496), X22
     MOV.Q  (R5, 504), X23
     MOV.Q  X20, (R4, 480)
     MOV.Q  X21, (R4, 488)
     MOV.Q  X22, (R4, 496)
     MOV.Q  X23, (R4, 504)
  __memcpy_480:
    ...

Then, say:
  __memcpy_488:
     MOV.Q  (R5, 480), X20
     MOV.Q  X20, (R4, 480)
     BRA     __memcpy_480
  __memcpy_496:
     MOV.Q  (R5, 480), X20
     MOV.Q  (R5, 488), X21
     MOV.Q  X20, (R4, 480)
     MOV.Q  X21, (R4, 488)
     BRA     __memcpy_480
     ...
Then:
  __memcpy_492:
     MOV.L  (R5, 488), X20
     MOV.L  X20, (R4, 488)
     BRA     __memcpy_488
  ...

For these later cases, it keeps track of them via bitmaps (with a bit 
for each size), so that it knows which sizes need to be generated.

In this case, these special functions and slides were the fastest option 
that also doesn't waste excessive amounts of space (besides the cost of 
the slide, but this is why it only copies up to 512 bytes).

I ended up not bothering with special aligned cases, as the cost of 
detecting if the copy was aligned was generally more than that saved 
from having a separate aligned version.


If bigger than 512, it calls the generic memcpy function...

Which generally then copies chunks of memory (say, 512 bytes), and uses 
some smaller loops to clean up whatever is left.

Say, chunk sizes (bytes):
   512, 128, 32, 16, 8, 4, 1


> whereas clang 9.0.0 and following produce
> 
> [fedora-starfive:/tmp:111383] clang -O -S x.c
> [fedora-starfive:/tmp:111384] cat x.s
>          .text
>          .attribute      4, 16
>          .attribute      5, "rv64i2p0_m2p0_a2p0_f2p0_d2p0_c2p0"
>          .file   "x.c"
>          .globl  uload                           # -- Begin function uload
>          .p2align        1
>          .type   uload,@function
> uload:                                  # @uload
>          .cfi_startproc
> # %bb.0:
>          ld      a0, 0(a0)
>          ret
> .Lfunc_end0:
>          .size   uload, .Lfunc_end0-uload
>          .cfi_endproc
>                                          # -- End function
========== REMAINDER OF ARTICLE TRUNCATED ==========