Path: eternal-september.org!mx02.eternal-september.org!.POSTED!not-for-mail
From: anton@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: The cost of unaligned accesses
Date: Tue, 31 Mar 2015 14:03:16 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 63
Message-ID: <2015Mar31.160316@mips.complang.tuwien.ac.at>
References: <2015Mar31.135537@mips.complang.tuwien.ac.at>
Injection-Info: mx02.eternal-september.org; posting-host="d47d3421039fe8026514328ad0ebacae";
	logging-data="20879"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX19GtNNGYUOpXvAygTipYb85"
X-newsreader: xrn 10.00-beta-3
Cancel-Lock: sha1:G10pZZVoVXmgNmkq8NHGPP0eiJo=

anton@mips.complang.tuwien.ac.at (Anton Ertl) writes:
>I was playing around with my hashing code, and noticed that unaligned
>accesses can cost much more than I expected.  So I wrote a
>microbenchmark for excercising unaligned accesses.  You can find the
>source code at
>
>http://www.complang.tuwien.ac.at/anton/tmp/unaligned.c
>
>and a Linux binary at
>
>http://www.complang.tuwien.ac.at/anton/tmp/unaligned
>
>You call it with
>
>unaligned <offset> 0
>
>where the offset specifies how far from a page start the 8-byte access
>happens.  The benchmark performs 100M such accesses and the inner loop
>is:
>
>  4006bc:       49 8b 55 00             mov    0x0(%r13),%rdx
>  4006c0:       83 c0 01                add    $0x1,%eax
>  4006c3:       3d 00 e1 f5 05          cmp    $0x5f5e100,%eax
>  4006c8:       75 f2                   jne    4006bc <main+0xa8>
>
>The results for different offsets are (in cycles per loop iteration):
>
>0   1    56   57   63   64   4088 4089 4095 4096
>3   3    3    3    3    3    3    3    3    3       Opteron 270 (Italy)
>1   1    1    13   13   1    1    164  164  1       Core 2 Duo E8400 (Wolfdale)
>1   1    1    4.8  4.8  1    1    27   27   1       Xeon E31220 (Sandy Bridge)
>1   1    1    1.98 1.98 1    1    31   31   1       Core i3-3227U (Ivy Bridge)

Ok, now I have made a variant that uses dependent loads, and uses
unrolling (by 2) to reduce the loop overhead, turning the inner loop
into:

  4006a8:       4d 8b 24 24             mov    (%r12),%r12
  4006ac:       4d 8b 24 24             mov    (%r12),%r12
  4006b0:       83 c0 01                add    $0x1,%eax
  4006b3:       3d 80 f0 fa 02          cmp    $0x2faf080,%eax
  4006b8:       75 ee                   jne    4006a8 <main+0x94>

The files are at:
http://www.complang.tuwien.ac.at/anton/tmp/unaligned2.c
http://www.complang.tuwien.ac.at/anton/tmp/unaligned2

Results:

0   1    56   57   63   64   4088 4089 4095 4096
3   6    3    6    6    3    3    6    6    3     Opteron 270 (Italy)
3   3    3    15   15   3    3    159  159  3     Core 2 Duo E8400 (Wolfdale)
4   4    4    9    9    4    4    28   28   4     Xeon E31220 (Sandy Bridge)
4   4    4    9    9    4    4    32   32   4     Core i3-3227U (Ivy Bridge)

So on the AMD K8 misalignment costs latency, on Intels P6 descendants
only at cache line and page boundaries, but there it costs more.

- anton
-- 
M. Anton Ertl                    Some things have to be seen to be believed
anton@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html