Path: eternal-september.org!mx02.eternal-september.org!.POSTED!not-for-mail From: anton@mips.complang.tuwien.ac.at (Anton Ertl) Newsgroups: comp.arch Subject: Re: The cost of unaligned accesses Date: Tue, 31 Mar 2015 14:03:16 GMT Organization: Institut fuer Computersprachen, Technische Universitaet Wien Lines: 63 Message-ID: <2015Mar31.160316@mips.complang.tuwien.ac.at> References: <2015Mar31.135537@mips.complang.tuwien.ac.at> Injection-Info: mx02.eternal-september.org; posting-host="d47d3421039fe8026514328ad0ebacae"; logging-data="20879"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19GtNNGYUOpXvAygTipYb85" X-newsreader: xrn 10.00-beta-3 Cancel-Lock: sha1:G10pZZVoVXmgNmkq8NHGPP0eiJo= anton@mips.complang.tuwien.ac.at (Anton Ertl) writes: >I was playing around with my hashing code, and noticed that unaligned >accesses can cost much more than I expected. So I wrote a >microbenchmark for excercising unaligned accesses. You can find the >source code at > >http://www.complang.tuwien.ac.at/anton/tmp/unaligned.c > >and a Linux binary at > >http://www.complang.tuwien.ac.at/anton/tmp/unaligned > >You call it with > >unaligned 0 > >where the offset specifies how far from a page start the 8-byte access >happens. The benchmark performs 100M such accesses and the inner loop >is: > > 4006bc: 49 8b 55 00 mov 0x0(%r13),%rdx > 4006c0: 83 c0 01 add $0x1,%eax > 4006c3: 3d 00 e1 f5 05 cmp $0x5f5e100,%eax > 4006c8: 75 f2 jne 4006bc > >The results for different offsets are (in cycles per loop iteration): > >0 1 56 57 63 64 4088 4089 4095 4096 >3 3 3 3 3 3 3 3 3 3 Opteron 270 (Italy) >1 1 1 13 13 1 1 164 164 1 Core 2 Duo E8400 (Wolfdale) >1 1 1 4.8 4.8 1 1 27 27 1 Xeon E31220 (Sandy Bridge) >1 1 1 1.98 1.98 1 1 31 31 1 Core i3-3227U (Ivy Bridge) Ok, now I have made a variant that uses dependent loads, and uses unrolling (by 2) to reduce the loop overhead, turning the inner loop into: 4006a8: 4d 8b 24 24 mov (%r12),%r12 4006ac: 4d 8b 24 24 mov (%r12),%r12 4006b0: 83 c0 01 add $0x1,%eax 4006b3: 3d 80 f0 fa 02 cmp $0x2faf080,%eax 4006b8: 75 ee jne 4006a8 The files are at: http://www.complang.tuwien.ac.at/anton/tmp/unaligned2.c http://www.complang.tuwien.ac.at/anton/tmp/unaligned2 Results: 0 1 56 57 63 64 4088 4089 4095 4096 3 6 3 6 6 3 3 6 6 3 Opteron 270 (Italy) 3 3 3 15 15 3 3 159 159 3 Core 2 Duo E8400 (Wolfdale) 4 4 4 9 9 4 4 28 28 4 Xeon E31220 (Sandy Bridge) 4 4 4 9 9 4 4 32 32 4 Core i3-3227U (Ivy Bridge) So on the AMD K8 misalignment costs latency, on Intels P6 descendants only at cache line and page boundaries, but there it costs more. - anton -- M. Anton Ertl Some things have to be seen to be believed anton@mips.complang.tuwien.ac.at Most things have to be believed to be seen http://www.complang.tuwien.ac.at/anton/home.html