Deutsch   English   Français   Italiano  
<vnrd49$1f52h$2@dont-email.me>

View for Bookmarking (what is this?)
Look up another Usenet article

Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!eternal-september.org!.POSTED!not-for-mail
From: BGB <cr88192@gmail.com>
Newsgroups: comp.arch
Subject: Re: Cost of handling misaligned access
Date: Mon, 3 Feb 2025 15:40:21 -0600
Organization: A noiseless patient Spider
Lines: 155
Message-ID: <vnrd49$1f52h$2@dont-email.me>
References: <5lNnP.1313925$2xE6.991023@fx18.iad> <vnosj6$t5o0$1@dont-email.me>
 <2025Feb3.075550@mips.complang.tuwien.ac.at>
 <wi7oP.2208275$FOb4.591154@fx15.iad> <vnr64m$1e7sb$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 03 Feb 2025 22:40:26 +0100 (CET)
Injection-Info: dont-email.me; posting-host="f4fe680962b51b8bca49b43582a45572";
	logging-data="1545297"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX18TZGU+nyL7N+p7pOSFICahzcTyn7TU6xw="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:UK4IF/ORuovd2h+JnjLdqLN8A6Y=
Content-Language: en-US
In-Reply-To: <vnr64m$1e7sb$1@dont-email.me>
Bytes: 7056

On 2/3/2025 1:41 PM, Thomas Koenig wrote:
> EricP <ThatWouldBeTelling@thevillage.com> schrieb:
> 
>> That is fine for code that is being actively maintained and backward
>> data structure compatibility is not required (like those inside a kernel).
>>
>> However for x86 there was a few billion lines of legacy code that likely
>> assumed 2-byte alignment, or followed the fp64 aligned to 32-bits advice,
>> and a C language that mandates structs be laid out in memory exactly as
>> specified (no automatic struct optimization). Also I seem to recall some
>> amount of squawking about SIMD when it required naturally aligned buffers.
>> As SIMD no longer requires alignment, presumably code no longer does so.
> 
> Looking at Intel's optimization manual, they state in
> "15.6 DATA ALIGNMENT FOR INTEL® AVX"
> 
> "Assembly/Compiler Coding Rule 65. (H impact, M generality) Align
> data to 32-byte boundary when possible. Prefer store alignment
> over load alignment."
> 
> and further down, about AVX-512,
> 
> "18.23.1 Align Data to 64 Bytes"
> 
> "Aligning data to vector length is recommended. For best results,
> when using Intel AVX-512 instructions, align data to 64 bytes.
> 
> When doing a 64-byte Intel AVX-512 unaligned load/store, every
> load/store is a cache-line split, since the cache-line is 64
> bytes. This is double the cache line split rate of Intel AVX2
> code that uses 32-byte registers. A high cache-line split rate in
> memory-intensive code can cause poor performance."
> 
> This sounds reasonable, and good advice if you want to go
> down SIMD lane.
> 

This is, ironically, a place where SIMD via ganged registers has an 
advantage over SIMD via large monolithic registers.


With ganged registers, it means one can load/store them piecewise as 
needed, and use unaligned loads/stores (with the larger forms being able 
to actively require natural alignment).


Though, granted, large monolithic registers are a more popular option vs 
ganged registers.

And, you can make the registers larger without either effectively 
halving the number of longer registers, or needing to double the number 
of shorter registers.

But, at the cost that much of the high-order bits of the registers will 
be essentially wasted for code operating on narrower vectors.


Say, if one has:
   64x 64-bit vectors (group of 1);
   32x 128-bit vectors (group of 2);
   16x 256-bit vectors (group of 4);
   8x 512-bit vectors (group of 8).

If they wanted a 1024-bit vector, they can make a choice:
   Live with only 4 vectors;
   Expand the size of the register file to 128x 64-bit vectors;
   Live with asymmetric wonk
     Parts of the register space only being accessible at larger sizes.
   ...


Though, with monolithic registers, each doubling of the register size 
also effectively mandates either a whole new set of instructions to deal 
with the larger size, or some other way to encode or specify the size 
(or, "who knows, it is whatever it is, software can figure it out"...).

This is less true of ganged registers.
   Say, if the CPU supported it, they could add, say:
     PADDX4.F //256-bit Binary32 ADD
     PSUBX4.F //256-bit Binary32 SUB
     PMULX4.F //256-bit Binary32 MUL
     ...
   While leaving everything else the same as before.
     The addition of wider load/store operations being optional.
     Don't have 256-bit Ld/St, use 128-bit Ld/St.
       Need fully unaligned access, use 64-bit Ld/St's.
     ...


And also making it easy for narrower implementations to simply crack the 
instructions into 128-bit vector operations internally (which may 
actually be implemented as two 64 bit vector ops running in parallel).

But, say, the pipeline could be designed internally around 64-bit vector 
ops, with a 4-wide machine able to do 256-bit vector operations mostly 
by supporting a 64-bit vector operation on each lane.

And, you can more easily "pretend" in the compiler to have whichever 
vector size you want. Code asks for 256 bit vectors but target only has 
128? Just fake it using 128-bit ops.


But, granted, most ISAs aren't doing SIMD this way.

....



>> Also in going from 32 to 64 bits, data structures that contain pointers
>> now could find those 8-byte pointers aligned on 4-byte boundaries.
> 
> This is mandated by the relevant ABI, and ABIs usually mandate
> alignment on natural boundaries.
> 
> 
>> While the Linux kernel may not use many misaligned values,
>> I'd guess there is a lot of application code that does.
> 
> Unless it is generating external binary data (a _very_ bad idea,
> XDR was developed for a reason), there is no big reason to use
> unaligned data, unless somebody is playing fast and loose
> with C pointer types, and that is a bad idea anyway.
> 

Often needed for speed in many cases.


> Alternatively, a compiler could use it to implement somthing like
> memcpy or memmove when it knows that unaligned accesses are safe.
> 

Basically required unless you want them to be slow.

The aligned-only versions will almost invariably be slower, potentially 
significantly slower.


> But it would be really interesting to have a access to a system
> where unaligned accesses trap, in order to find (and fix) ABI
> issues and some undefined behavior on the C side.

It may make sense to add some form of categorical separations:
   Pointers that may be unaligned;
   Pointers that must be aligned.

Trapping on unaligned being a reasonable option for the latter case.
Really needs to be per-pointer or per-access though, and not a global 
flag (which makes it kind of useless).

Some compilers have __aligned and __unaligned keywords.

Something like "[[aligned]]" and "[[unaligned]]" could also make sense, 
with the default likely depending on type and implementation...