Article <vp1ori$1llrm$1@dont-email.me>

Deutsch English Français Italiano
<vp1ori$1llrm$1@dont-email.me>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!eternal-september.org!.POSTED!not-for-mail
From: BGB <cr88192@gmail.com>
Newsgroups: comp.arch
Subject: Re: Cost of handling misaligned access
Date: Tue, 18 Feb 2025 04:53:28 -0600
Organization: A noiseless patient Spider
Lines: 253
Message-ID: <vp1ori$1llrm$1@dont-email.me>
References: <5lNnP.1313925$2xE6.991023@fx18.iad> <vnosj6$t5o0$1@dont-email.me>
 <2025Feb3.075550@mips.complang.tuwien.ac.at> <volg1m$31ca1$1@dont-email.me>
 <voobnc$3l2dl$1@dont-email.me>
 <0fc4cc997441e25330ff5c8735247b54@www.novabbs.org>
 <vp0m3f$1cth6$1@dont-email.me> <vp14j8$1ibtp$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Tue, 18 Feb 2025 11:53:39 +0100 (CET)
Injection-Info: dont-email.me; posting-host="1fe6835acbe1e7d2aa43c1dadd73de15";
	logging-data="1759094"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX1+eu+VBaJ5RgNPjjXAty/bD60Y142hR/BI="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:LU7+z0UBrLBNi+ruR2PDMbINfmA=
In-Reply-To: <vp14j8$1ibtp$1@dont-email.me>
Content-Language: en-US
Bytes: 10005

On 2/17/2025 11:07 PM, Robert Finch wrote:
> On 2025-02-17 8:00 p.m., BGB wrote:
>> On 2/14/2025 3:52 PM, MitchAlsup1 wrote:
>>> On Fri, 14 Feb 2025 21:14:11 +0000, BGB wrote:
>>>
>>>> On 2/13/2025 1:09 PM, Marcus wrote:
>>> -------------
>>>>>
>>>>> The problem arises when the programmer *deliberately* does unaligned
>>>>> loads and stores in order to improve performance. Or rather, if the
>>>>> programmer knows that the hardware supports unaligned loads and 
>>>>> stores,
>>>>> he/she can use that to write faster code in some special cases.
>>>>>
>>>>
>>>> Pretty much.
>>>>
>>>>
>>>> This is partly why I am in favor of potentially adding explicit 
>>>> keywords
>>>> for some of these cases, or to reiterate:
>>>>    __aligned:
>>>>      Inform compiler that a pointer is aligned.
>>>>      May use a faster version if appropriate.
>>>>        If a faster aligned-only variant exists of an instruction.
>>>>        On an otherwise unaligned-safe target.
>>>>    __unaligned: Inform compiler that an access is unaligned.
>>>>      May use a runtime call or similar if necessary,
>>>>        on an aligned-only target.
>>>>      May do nothing on an unaligned-safe target.
>>>>    None: Do whatever is the default.
>>>>      Presumably, assume aligned by default,
>>>>        unless target is known unaligned-safe.
>>>
>>> It would take LESS total man-power world-wide and over-time to
>>> simply make HW perform misaligned accesses.
>>
>>
> 
>> I think the usual issue is that on low-end hardware, it is seen as 
>> "better" to skip out on misaligned access in order to save some cost 
>> in the L1 cache.
>>
> I always include support for unaligned accesses even with a ‘low-end’ 
> CPU. I think it is not that expensive and sure makes some things a lot 
> easier when handled in hardware. For Q+ it just runs two bus cycles if 
> the data spans a cache line and pastes results together as needed.
> 

I had went aligned-only with some 32-bit cores in the past.

Whole CPU core fit into less LUTs than I currently spend on just the L1 
D$...

Granted, some of these used a very minimal L1 cache design:
   Only holds a single cache line.

The smallest cores I had managed had used a simplified SH-based design:
   Fixed-length 16 bit instructions, with 16 registers;
   Only (Reg) and (Reg, R0) addressing;
   Aligned only;
   No shift or multiply;
   ...

Where, say:
   SH-4 -> BJX1-32 (Added features)
   SH-4 -> B32V (Stripped down)
   BJX1-32 -> BJX1-64A (64-bit, Modal Encoding)
   B32V -> B64V (64-bit, Encoding Space Reorganizations)
   B64V ~> BJX1-64C (No longer Modal)

Where, BJX1-64C was the end of this project (before I effectively did a 
soft-reboot).


Then transition phase:
   B64V -> BtSR1 (Dropped to 32-bit, More Encoding Changes)
     Significant reorganization.
     Was trying to get optimize for code density closer to MSP430.
   BtSR1 -> BJX2 (Back to 64-bit, re-adding features from BJX1-64C)
     A few features added for BtSR1 were dropped again in BJX2.

The original form of BJX2 was still a primarily 16-bit ISA encoding, but 
at this point pretty much mutated beyond recognition (and relatively few 
instructions were still in the same places that they were in SH-4).


For example (original 16-bit space):
   0zzz:
     SH-4: Ld/St (Rm,R0); also 0R and 1R spaces, etc.
     BJX2: Ld/St Only (Rm) and (Rm,R0)
   1zzz:
     SH-4: Store (Rn, Disp4)
     BJX2: 2R ALU ops
   2zzz:
     SH-4: Store (@Rn, @-Rn), ALU ops
     BJX2: Branch Ops (Disp8), etc
   3zzz:
     SH-4: ALU ops
     BJX2: 0R and 1R ops
   4zzz:
     SH-4: 1R ops
     BJX2: Ld/St (SP, Disp4); MOV-CR, LEA
   5zzz:
     SH-4: Load (Rm, Disp4)
     BJX2: Load (Unsigned), ALU ops
   6zzz:
     SH-4: Load (@Rm+ and @Rm), ALU
     BJX2: FPU ops, CMP-Imm4
   7zzz:
     SH-4: ADD Imm8, Rn
     BJX2: (XGPR 32-bit Escape Block)
   8zzz:
     SH-4: Branch (Disp8)
     BJX2: Ld/St (Rm, Disp3)
   9zzz:
     SH-4: Load (PC-Rel)
     BJX2: (XGPR 32-bit Escape Block)
   Azzz:
     SH-4: BRA Disp12
     BJX2: MOV Imm12u, R0
   Bzzz:
     SH-4: BSR Disp12
     BJX2: MOV Imm12n, R0
   Czzz:
     SH-4: Some Imm8 ops
     BJX2: ADD Imm8, Rn
   Dzzz:
     SH-4: Load (PC-Rel)
     BJX2: MOV Imm8, Rn
   Ezzz:
     SH-4: MOV Imm8, Rn
     BJX2: (32-bit Escape, Predicated Ops)
   Fzzz:
     SH-4: FPU Ops
     BJX2: (32-bit Escape, Unconditional Ops)

For the 16-bit ops, SH-4 had more addressing modes than BJX2:
   SH-4: @Reg, @Rm+, @-Rn, @(Reg,R0), @(Reg,Disp4) @(PC,Disp8)
   BJX2: (Rm), (Rm,R0), (Rm,Disp3), (SP,Disp4)

Although it may seem like it, I didn't just completely start over on the 
layout, but rather it was sort of an "ant-hill reorganization".


Say, for example:
   1zzz and 5zzz were merged into 8zzz, reducing Disp by 1 bit
   2zzz and 3zzz was partly folded into 0zzz and 1zzz
   8zzz's contents were moved to 2zzz
   4zzz and part of 0zzz were merged into 3zzz
   ...


A few CR's are still in the same places and SR still has a similar 
layout I guess, ...



Early on, there was the idea that the 32-bit ops were prefix-modified 
versions of the 16-bit ops, but early on this symmetry broke and the 16 
and 32-bit encoding spaces became independent of each other.

Though, the 32-bit F0 space still has some amount of similarity to the 
16-bit space.


Later on I did some testing and performance comparisons, and realized 
that using 32-bit encodings primarily (or exclusively) gave 
significantly better performance than relying primarily or exclusively 
on 16-bit ops. And at this point the ISA transitioned from a primarily 
16-bit ISA (with 32-bit extension ops) to a primarily 32-bit ISA with a 
16-bit encoding space. This transition didn't directly effect encodings, 
but did effect how the ISA developed from then going forward (more so, 
there was no longer an idea that the 16-bit ISA would need to be able to 
exist standalone; but now the 32-bit ISA did need to be able to exist 
========== REMAINDER OF ARTICLE TRUNCATED ==========