Deutsch   English   Français   Italiano  
<20250206233200.00001fc3@yahoo.com>

View for Bookmarking (what is this?)
Look up another Usenet article

Path: ...!weretis.net!feeder9.news.weretis.net!news.quux.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!eternal-september.org!.POSTED!not-for-mail
From: Michael S <already5chosen@yahoo.com>
Newsgroups: comp.arch
Subject: Re: Cost of handling misaligned access
Date: Thu, 6 Feb 2025 23:32:00 +0200
Organization: A noiseless patient Spider
Lines: 77
Message-ID: <20250206233200.00001fc3@yahoo.com>
References: <5lNnP.1313925$2xE6.991023@fx18.iad>
	<vnosj6$t5o0$1@dont-email.me>
	<2025Feb3.075550@mips.complang.tuwien.ac.at>
	<wi7oP.2208275$FOb4.591154@fx15.iad>
	<2025Feb4.191631@mips.complang.tuwien.ac.at>
	<vo061a$2fiql$1@dont-email.me>
	<20250206000143.00000dd9@yahoo.com>
	<2025Feb6.115939@mips.complang.tuwien.ac.at>
	<20250206152808.0000058f@yahoo.com>
	<vo2iqq$30elm$1@dont-email.me>
	<vo2p33$31lqn$1@dont-email.me>
	<20250206211932.00001022@yahoo.com>
	<vo36go$345o3$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
Injection-Date: Thu, 06 Feb 2025 22:32:03 +0100 (CET)
Injection-Info: dont-email.me; posting-host="c56742e47efb69d8e4c2d1a3629b738b";
	logging-data="3300746"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX19SOe//7K58PVqCEedfcupQmteIBna6Kdk="
Cancel-Lock: sha1:mvzs1uJ4ohxTsKatWBml90619jc=
X-Newsreader: Claws Mail 4.1.1 (GTK 3.24.34; x86_64-w64-mingw32)
Bytes: 4343

On Thu, 6 Feb 2025 21:36:38 +0100
Terje Mathisen <terje.mathisen@tmsw.no> wrote:

> Michael S wrote:
> > On Thu, 6 Feb 2025 17:47:30 +0100
> > Terje Mathisen <terje.mathisen@tmsw.no> wrote:
> >  =20
> >> Terje Mathisen wrote: =20
> >>> Michael S wrote: =20
> >>>> The point of my proposal is not reduction of loop overhead and
> >>>> not reduction of the # of x86 instructions (in fact, with my
> >>>> proposal the # of x86 instructions is increased), but reduction
> >>>> in # of uOps due to reuse of loaded values.
> >>>> The theory behind it is that most typically in code with very
> >>>> high IPC like the one above the main bottleneck is the # of uOps
> >>>> that flows through rename stage. =20
> >>>
> >>> Aha! I see what you mean: Yes, this would be better if the
> >>>
> >>>   =C2=A0 VPAND reg,reg,[mem]
> >>>
> >>> instructions actually took more than one cycle each, but as the
> >>> size of the arrays were just 1000 bytes each (250 keys + 250
> >>> locks), everything fits easily in $L1. (BTW, I did try to add 6
> >>> dummy keys and locks just to avoid any loop end overhead, but that
> >>> actually ran slower.) =20
> >>
> >> I've just tested it by running either 2 or 4 locks in parallel in
> >> the inner loop: The fastest time I saw actually did drop a
> >> smidgen, from 5800 ns to 5700 ns (for both 2 and 4 wide), with 100
> >> ns being the timing resolution I get from the Rust run_benchmark()
> >> function.
> >>
> >> So yes, it is slightly better to run a stripe instead of just a
> >> single row in each outer loop.
> >>
> >> Terje
> >> =20
> >=20
> > Assuming that your CPU is new and runs at decent frequency (4-4.5
> > GHz), the results are 2-3 times slower than expected. I would guess
> > that it happens because there are too few iterations in the inner
> > loop. Turning unrolling upside down, as I suggested in the previous
> > post, should fix it.
> > Very easy to do in C with intrinsic. Probably not easy in Rust. =20
>=20
> I did mention that this is a (cheap) laptop? It is about 15 months
> old, and with a base frequency of 2.676 GHz.

You describe it different ways but omit the only one that will give us
sufficient information - CPU model number.

> I guess that would
> explain most of the difference between what I see and what you
> expected?
>=20
> BTW, when I timed 1000 calls to that 5-6 us program, to get around
> teh 100 ns timer resolution, each iteration ran in 5.23 us.
>=20
> Terje
>=20
>

That measurement could be good enough on desktop. Or not.
It certainly not good enough on laptop and even less so on server.
On laptop I wouldn't be sutisfied before I lok my program to
particualr core, then do something like 21 measurements with 100K calls
in each measurement (~10 sec total) and report median of 21.