Path: ...!weretis.net!feeder6.news.weretis.net!feeder8.news.weretis.net!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail
From: anton@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.lang.forth
Subject: Efficient CMOVE
Date: Wed, 01 Sep 2021 21:34:40 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 71
Message-ID: <2021Sep1.233440@mips.complang.tuwien.ac.at>
References: <sghpg8$1sh6$1@gioia.aioe.org> <sgiipl$bbc$1@dont-email.me> <sgiouu$k89$1@dont-email.me> <612d0520$0$705$14726298@news.sunsite.dk> <sgj7q8$t36$1@dont-email.me> <87mtoytul6.fsf@nightsong.com> <sgjuci$7hb$1@dont-email.me> <2021Aug31.091034@mips.complang.tuwien.ac.at> <sgl7dk$6q5$1@dont-email.me> <2021Aug31.181205@mips.complang.tuwien.ac.at> <sgo4qk$ton$1@dont-email.me> <2021Sep1.175835@mips.complang.tuwien.ac.at>
Injection-Info: reader02.eternal-september.org; posting-host="b8d65dc0511cc8ebe9ae0a45acf55f13";
	logging-data="22210"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX18fCbCz2ICI2rM/jXWHlham"
Cancel-Lock: sha1:cJNE1axDIy83ZMU2XUq978HvQaA=
X-newsreader: xrn 10.00-beta-3
Bytes: 4197

I suggested that using a single CMOVE for filling a memory region with
a pattern is inefficient and that we could use an approach that
doubles the filled buffer size at each step (and without overlaps in
copying, so one could use either CMOVE, MOVE, or CMOVE>, as
convenient).  So I wanted to test this.

As it turns out, the resulting word also satisfies the requirements
for CMOVE, so I call it CMOVE, too:

: cmove {: afrom ato u | u1 -- :}
    ato afrom - u u< if \ overlapping case
	ato afrom - u over + to u1 begin
	    afrom afrom 2 pick + 2 pick u1 over - min cmove
	    2* dup u u>= until
	drop
    else \ non-overlapping case
	ato afrom u cmove
    then ;

This CMOVE calls the simple CMOVE, because that is the fastest variant
for VFX and lxf.

Eliminating the locals is left as an exercise to the puristic readers.

I benchmarked this with:

: bench {: usize ulength -- :}
    usize ulength + allocate throw
    ulength 0 ?do 'a' i + over i + c! loop
    10000000 0 ?do dup dup ulength + usize cmove loop ;

usize is the size of the yet-unfilled buffer, ulength is the length of
the pattern.  I used usize=1000 and ulength from 1..5.

Here are the results in cycles/iteration on Zen 3 (Ryzen 5800X):

  VFX64       lxf       VFX32
orig  new  orig  new  orig  new  ulength
7474 1125  7624 1095  7807  423     1
3360  965  3140  933  4273  420     2
2367  957  2380  922  3018  411     3
2068  790  2147  761  2418  400     4
1645  779  1766  754  2083  426     5

As can be seen, the new CMOVE is substantially faster than the REP
MOVSB based one for this use case.  Of course, the question remains
how relevant this use case is; but then, this is the only use case
where CMOVE behaves differently from MOVE, so it may be a
justification for keeping CMOVE.

We see that, the larger ulength is, the faster does the REP MOVSB
perform, because it can copy more bytes during the latency of the
store-load-store chain that determines performance.  For the new
CMOVE, we also see a slight improvement with increasing ulength,
because there are then fewer iterations of the loop in the new CMOVE;
the cost of such an iteration at ~160 cycles is surprisingly high;
this may have to do with the cache behaviour of REP MOVSB.

VFX32 uses a simple byte-copying loop rather than REP MOVSB for the
overlapping case (orig), and performs similar to the REP MOVSB
implemenation.  VFX32 uses a cell-copying loop in the non-overlapping
case (new), resulting in even faster performance; also, we see that
the number of iterations within the new CMOVE does not have as big an
influence as for VFX64 and lxf.

- anton
-- 
M. Anton Ertl  http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
     New standard: http://www.forth200x.org/forth200x.html
   EuroForth 2021: https://euro.theforth.net/2021