Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!eternal-september.org!.POSTED!not-for-mail
From: BGB <cr88192@gmail.com>
Newsgroups: comp.arch
Subject: Re: Cost of handling misaligned access
Date: Thu, 6 Feb 2025 17:34:27 -0600
Organization: A noiseless patient Spider
Lines: 128
Message-ID: <vo3gu6$3617l$1@dont-email.me>
References: <5lNnP.1313925$2xE6.991023@fx18.iad> <vnosj6$t5o0$1@dont-email.me>
 <2025Feb3.075550@mips.complang.tuwien.ac.at>
 <wi7oP.2208275$FOb4.591154@fx15.iad>
 <2025Feb4.191631@mips.complang.tuwien.ac.at> <vo061a$2fiql$1@dont-email.me>
 <20250206000143.00000dd9@yahoo.com>
 <2025Feb6.115939@mips.complang.tuwien.ac.at>
 <20250206152808.0000058f@yahoo.com> <vo2iqq$30elm$1@dont-email.me>
 <vo2p33$31lqn$1@dont-email.me> <20250206211932.00001022@yahoo.com>
 <vo36go$345o3$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Fri, 07 Feb 2025 00:34:31 +0100 (CET)
Injection-Info: dont-email.me; posting-host="d20482a6ffd0405692d09a35ed004562";
	logging-data="3343605"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX1/FYDsdCo9uWbGBLb3QrBF0JACLfGEyDjM="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:3o3xKf8vaXnRU7T3kgKdf8PBOgc=
Content-Language: en-US
In-Reply-To: <vo36go$345o3$1@dont-email.me>
Bytes: 6774

On 2/6/2025 2:36 PM, Terje Mathisen wrote:
> Michael S wrote:
>> On Thu, 6 Feb 2025 17:47:30 +0100
>> Terje Mathisen <terje.mathisen@tmsw.no> wrote:
>>
>>> Terje Mathisen wrote:
>>>> Michael S wrote:
>>>>> The point of my proposal is not reduction of loop overhead and not
>>>>> reduction of the # of x86 instructions (in fact, with my proposal
>>>>> the # of x86 instructions is increased), but reduction in # of
>>>>> uOps due to reuse of loaded values.
>>>>> The theory behind it is that most typically in code with very high
>>>>> IPC like the one above the main bottleneck is the # of uOps that
>>>>> flows through rename stage.
>>>>
>>>> Aha! I see what you mean: Yes, this would be better if the
>>>>
>>>>   Â  VPAND reg,reg,[mem]
>>>>
>>>> instructions actually took more than one cycle each, but as the
>>>> size of the arrays were just 1000 bytes each (250 keys + 250
>>>> locks), everything fits easily in $L1. (BTW, I did try to add 6
>>>> dummy keys and locks just to avoid any loop end overhead, but that
>>>> actually ran slower.)
>>>
>>> I've just tested it by running either 2 or 4 locks in parallel in the
>>> inner loop: The fastest time I saw actually did drop a smidgen, from
>>> 5800 ns to 5700 ns (for both 2 and 4 wide), with 100 ns being the
>>> timing resolution I get from the Rust run_benchmark() function.
>>>
>>> So yes, it is slightly better to run a stripe instead of just a
>>> single row in each outer loop.
>>>
>>> Terje
>>>
>>
>> Assuming that your CPU is new and runs at decent frequency (4-4.5 GHz),
>> the results are 2-3 times slower than expected. I would guess that it
>> happens because there are too few iterations in the inner loop.
>> Turning unrolling upside down, as I suggested in the previous post,
>> should fix it.
>> Very easy to do in C with intrinsic. Probably not easy in Rust.
> 
> I did mention that this is a (cheap) laptop? It is about 15 months old, 
> and with a base frequency of 2.676 GHz. I guess that would explain most 
> of the difference between what I see and what you expected?
> 
> BTW, when I timed 1000 calls to that 5-6 us program, to get around teh 
> 100 ns timer resolution, each iteration ran in 5.23 us.
> 

FWIW: The idea of running a CPU at 4+ GHz seems a bit much (IME, CPUs 
tend to run excessively hot at these kinds of clock speeds; 3.2 to 3.6 
seemingly more reasonable so that it "doesn't melt", or have thermal 
throttling or stability issues).


But, then again, I guess "modern" is relative, and most of the PC 
hardware I do end up buying tends to be roughly "2 generations" behind, 
mostly as in this case, it is significantly cheaper (actual new hardware 
tending to be a lot more expensive).



Can note though that on my PC, enabling AVX in the compiler (where it 
actually tries to use it in the program) tends to put a significant hurt 
on performance, so better off not used (it is new enough to support AVX, 
but not actually doing the 256-bit stuff natively as apparently it is 
still using 128-bit SIMD internally).

Well, and the slight wonk that it can accept 112GB of RAM, but as soon 
as I try to put in a full 128 it boot-loops a few times, then concludes 
that there is only 4GB (not entirely sure of the MOBO chipset, don't 
have the box around anymore, and not clearly listed anywhere; can note 
that BIOS date is from 2018, seemingly the newest version supported).


A lot of the "less modern" PC hardware around here is mostly XP and 
Vista era (eg, 2002-2009 mostly). This being the era of hardware that 
most readily appears (sometimes there being a slight value-add though 
for stuff old enough to still have a parallel port and a 3.5" FDD; PATA 
support sometimes still also useful, ...).

Still not crossed over into the world of newfangled M.2 SSDs...


My PC has a SATA SSD for the OS, but mostly using 5400 RPM HDDs for the 
other drives. With 1TB 7200RPM drives (WD Black, *1) mostly being used 
to hold the pagefiles and similar; and two larger 4TB and 6TB WD Red 
drives (for copying large files, generally around 75-100 MB/sec).

Say, 112GB RAM + 400GB swap.

*1: WD Seemingly using a color scheme:
   Black: 7200 RPM speed-oriented drives.
     Usually lower capacity (eg, 1TB).
     2x 1TB Drives: WD1003FZEX, WD1002FAEX
       Drives get ~ 150 MB/sec or so, both CMR.
       As noted, pagefile is on these drives.
   Red: 5400 RPM NAS oriented drives;
     4TB Drive: CMR (WD40EFRX)
       Mostly for file storage.
     6TB Drive: SMR (WD60EFAX)
       Mostly for bulk files.
   Blue: 5400/7200 end-user oriented drives.
     May be bigger and/or cheaper, but typically use SMR.
     No Blue drives ATM.


A smaller pagefile still exists on the SSD, but mostly because Windows 
is unhappy if there is no pagefile on 'C'. Don't generally want a 
pagefile on an SSD though as it is worse for lifespan (but, it is 8GB, 
which Windows accepts; with around 192GB each on the other drives, for ~ 
400GB of swap space).

Not sure how well Windows load-balances swap, apparently not very well 
though (when it starts paging, most of the load seems to be on one 
drive; better if it could give a more even spread).

The SSD seems to get ~ 300 MB/sec.

....


> Terje
> 
>