Warning: mysqli::__construct(): (HY000/1203): User howardkn already has more than 'max_user_connections' active connections in D:\Inetpub\vhosts\howardknight.net\al.howardknight.net\includes\artfuncs.php on line 21
Failed to connect to MySQL: (1203) User howardkn already has more than 'max_user_connections' active connectionsPath: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!eternal-september.org!.POSTED!not-for-mail From: BGB Newsgroups: comp.arch Subject: Re: Cost of handling misaligned access Date: Tue, 18 Feb 2025 04:53:28 -0600 Organization: A noiseless patient Spider Lines: 253 Message-ID: References: <5lNnP.1313925$2xE6.991023@fx18.iad> <2025Feb3.075550@mips.complang.tuwien.ac.at> <0fc4cc997441e25330ff5c8735247b54@www.novabbs.org> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Injection-Date: Tue, 18 Feb 2025 11:53:39 +0100 (CET) Injection-Info: dont-email.me; posting-host="1fe6835acbe1e7d2aa43c1dadd73de15"; logging-data="1759094"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+eu+VBaJ5RgNPjjXAty/bD60Y142hR/BI=" User-Agent: Mozilla Thunderbird Cancel-Lock: sha1:LU7+z0UBrLBNi+ruR2PDMbINfmA= In-Reply-To: Content-Language: en-US Bytes: 10005 On 2/17/2025 11:07 PM, Robert Finch wrote: > On 2025-02-17 8:00 p.m., BGB wrote: >> On 2/14/2025 3:52 PM, MitchAlsup1 wrote: >>> On Fri, 14 Feb 2025 21:14:11 +0000, BGB wrote: >>> >>>> On 2/13/2025 1:09 PM, Marcus wrote: >>> ------------- >>>>> >>>>> The problem arises when the programmer *deliberately* does unaligned >>>>> loads and stores in order to improve performance. Or rather, if the >>>>> programmer knows that the hardware supports unaligned loads and >>>>> stores, >>>>> he/she can use that to write faster code in some special cases. >>>>> >>>> >>>> Pretty much. >>>> >>>> >>>> This is partly why I am in favor of potentially adding explicit >>>> keywords >>>> for some of these cases, or to reiterate: >>>>    __aligned: >>>>      Inform compiler that a pointer is aligned. >>>>      May use a faster version if appropriate. >>>>        If a faster aligned-only variant exists of an instruction. >>>>        On an otherwise unaligned-safe target. >>>>    __unaligned: Inform compiler that an access is unaligned. >>>>      May use a runtime call or similar if necessary, >>>>        on an aligned-only target. >>>>      May do nothing on an unaligned-safe target. >>>>    None: Do whatever is the default. >>>>      Presumably, assume aligned by default, >>>>        unless target is known unaligned-safe. >>> >>> It would take LESS total man-power world-wide and over-time to >>> simply make HW perform misaligned accesses. >> >> > >> I think the usual issue is that on low-end hardware, it is seen as >> "better" to skip out on misaligned access in order to save some cost >> in the L1 cache. >> > I always include support for unaligned accesses even with a ‘low-end’ > CPU. I think it is not that expensive and sure makes some things a lot > easier when handled in hardware. For Q+ it just runs two bus cycles if > the data spans a cache line and pastes results together as needed. > I had went aligned-only with some 32-bit cores in the past. Whole CPU core fit into less LUTs than I currently spend on just the L1 D$... Granted, some of these used a very minimal L1 cache design: Only holds a single cache line. The smallest cores I had managed had used a simplified SH-based design: Fixed-length 16 bit instructions, with 16 registers; Only (Reg) and (Reg, R0) addressing; Aligned only; No shift or multiply; ... Where, say: SH-4 -> BJX1-32 (Added features) SH-4 -> B32V (Stripped down) BJX1-32 -> BJX1-64A (64-bit, Modal Encoding) B32V -> B64V (64-bit, Encoding Space Reorganizations) B64V ~> BJX1-64C (No longer Modal) Where, BJX1-64C was the end of this project (before I effectively did a soft-reboot). Then transition phase: B64V -> BtSR1 (Dropped to 32-bit, More Encoding Changes) Significant reorganization. Was trying to get optimize for code density closer to MSP430. BtSR1 -> BJX2 (Back to 64-bit, re-adding features from BJX1-64C) A few features added for BtSR1 were dropped again in BJX2. The original form of BJX2 was still a primarily 16-bit ISA encoding, but at this point pretty much mutated beyond recognition (and relatively few instructions were still in the same places that they were in SH-4). For example (original 16-bit space): 0zzz: SH-4: Ld/St (Rm,R0); also 0R and 1R spaces, etc. BJX2: Ld/St Only (Rm) and (Rm,R0) 1zzz: SH-4: Store (Rn, Disp4) BJX2: 2R ALU ops 2zzz: SH-4: Store (@Rn, @-Rn), ALU ops BJX2: Branch Ops (Disp8), etc 3zzz: SH-4: ALU ops BJX2: 0R and 1R ops 4zzz: SH-4: 1R ops BJX2: Ld/St (SP, Disp4); MOV-CR, LEA 5zzz: SH-4: Load (Rm, Disp4) BJX2: Load (Unsigned), ALU ops 6zzz: SH-4: Load (@Rm+ and @Rm), ALU BJX2: FPU ops, CMP-Imm4 7zzz: SH-4: ADD Imm8, Rn BJX2: (XGPR 32-bit Escape Block) 8zzz: SH-4: Branch (Disp8) BJX2: Ld/St (Rm, Disp3) 9zzz: SH-4: Load (PC-Rel) BJX2: (XGPR 32-bit Escape Block) Azzz: SH-4: BRA Disp12 BJX2: MOV Imm12u, R0 Bzzz: SH-4: BSR Disp12 BJX2: MOV Imm12n, R0 Czzz: SH-4: Some Imm8 ops BJX2: ADD Imm8, Rn Dzzz: SH-4: Load (PC-Rel) BJX2: MOV Imm8, Rn Ezzz: SH-4: MOV Imm8, Rn BJX2: (32-bit Escape, Predicated Ops) Fzzz: SH-4: FPU Ops BJX2: (32-bit Escape, Unconditional Ops) For the 16-bit ops, SH-4 had more addressing modes than BJX2: SH-4: @Reg, @Rm+, @-Rn, @(Reg,R0), @(Reg,Disp4) @(PC,Disp8) BJX2: (Rm), (Rm,R0), (Rm,Disp3), (SP,Disp4) Although it may seem like it, I didn't just completely start over on the layout, but rather it was sort of an "ant-hill reorganization". Say, for example: 1zzz and 5zzz were merged into 8zzz, reducing Disp by 1 bit 2zzz and 3zzz was partly folded into 0zzz and 1zzz 8zzz's contents were moved to 2zzz 4zzz and part of 0zzz were merged into 3zzz ... A few CR's are still in the same places and SR still has a similar layout I guess, ... Early on, there was the idea that the 32-bit ops were prefix-modified versions of the 16-bit ops, but early on this symmetry broke and the 16 and 32-bit encoding spaces became independent of each other. Though, the 32-bit F0 space still has some amount of similarity to the 16-bit space. Later on I did some testing and performance comparisons, and realized that using 32-bit encodings primarily (or exclusively) gave significantly better performance than relying primarily or exclusively on 16-bit ops. And at this point the ISA transitioned from a primarily 16-bit ISA (with 32-bit extension ops) to a primarily 32-bit ISA with a 16-bit encoding space. This transition didn't directly effect encodings, but did effect how the ISA developed from then going forward (more so, there was no longer an idea that the 16-bit ISA would need to be able to exist standalone; but now the 32-bit ISA did need to be able to exist ========== REMAINDER OF ARTICLE TRUNCATED ==========