Warning: mysqli::__construct(): (HY000/1203): User howardkn already has more than 'max_user_connections' active connections in D:\Inetpub\vhosts\howardknight.net\al.howardknight.net\includes\artfuncs.php on line 21
Failed to connect to MySQL: (1203) User howardkn already has more than 'max_user_connections' active connectionsPath: eternal-september.org!mx02.eternal-september.org!.POSTED!not-for-mail
From: anton@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: The cost of unaligned accesses
Date: Tue, 31 Mar 2015 14:03:16 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 63
Message-ID: <2015Mar31.160316@mips.complang.tuwien.ac.at>
References: <2015Mar31.135537@mips.complang.tuwien.ac.at>
Injection-Info: mx02.eternal-september.org; posting-host="d47d3421039fe8026514328ad0ebacae";
logging-data="20879"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19GtNNGYUOpXvAygTipYb85"
X-newsreader: xrn 10.00-beta-3
Cancel-Lock: sha1:G10pZZVoVXmgNmkq8NHGPP0eiJo=
anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
>I was playing around with my hashing code, and noticed that unaligned
>accesses can cost much more than I expected. So I wrote a
>microbenchmark for excercising unaligned accesses. You can find the
>source code at
>
>http://www.complang.tuwien.ac.at/anton/tmp/unaligned.c
>
>and a Linux binary at
>
>http://www.complang.tuwien.ac.at/anton/tmp/unaligned
>
>You call it with
>
>unaligned 0
>
>where the offset specifies how far from a page start the 8-byte access
>happens. The benchmark performs 100M such accesses and the inner loop
>is:
>
> 4006bc: 49 8b 55 00 mov 0x0(%r13),%rdx
> 4006c0: 83 c0 01 add $0x1,%eax
> 4006c3: 3d 00 e1 f5 05 cmp $0x5f5e100,%eax
> 4006c8: 75 f2 jne 4006bc
>
>The results for different offsets are (in cycles per loop iteration):
>
>0 1 56 57 63 64 4088 4089 4095 4096
>3 3 3 3 3 3 3 3 3 3 Opteron 270 (Italy)
>1 1 1 13 13 1 1 164 164 1 Core 2 Duo E8400 (Wolfdale)
>1 1 1 4.8 4.8 1 1 27 27 1 Xeon E31220 (Sandy Bridge)
>1 1 1 1.98 1.98 1 1 31 31 1 Core i3-3227U (Ivy Bridge)
Ok, now I have made a variant that uses dependent loads, and uses
unrolling (by 2) to reduce the loop overhead, turning the inner loop
into:
4006a8: 4d 8b 24 24 mov (%r12),%r12
4006ac: 4d 8b 24 24 mov (%r12),%r12
4006b0: 83 c0 01 add $0x1,%eax
4006b3: 3d 80 f0 fa 02 cmp $0x2faf080,%eax
4006b8: 75 ee jne 4006a8
The files are at:
http://www.complang.tuwien.ac.at/anton/tmp/unaligned2.c
http://www.complang.tuwien.ac.at/anton/tmp/unaligned2
Results:
0 1 56 57 63 64 4088 4089 4095 4096
3 6 3 6 6 3 3 6 6 3 Opteron 270 (Italy)
3 3 3 15 15 3 3 159 159 3 Core 2 Duo E8400 (Wolfdale)
4 4 4 9 9 4 4 28 28 4 Xeon E31220 (Sandy Bridge)
4 4 4 9 9 4 4 32 32 4 Core i3-3227U (Ivy Bridge)
So on the AMD K8 misalignment costs latency, on Intels P6 descendants
only at cache line and page boundaries, but there it costs more.
- anton
--
M. Anton Ertl Some things have to be seen to be believed
anton@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html