Path: ...!weretis.net!feeder6.news.weretis.net!feeder8.news.weretis.net!eternal-september.org!reader02.eternal-september.org!.POSTED!not-for-mail From: anton@mips.complang.tuwien.ac.at (Anton Ertl) Newsgroups: comp.lang.forth Subject: Efficient CMOVE Date: Wed, 01 Sep 2021 21:34:40 GMT Organization: Institut fuer Computersprachen, Technische Universitaet Wien Lines: 71 Message-ID: <2021Sep1.233440@mips.complang.tuwien.ac.at> References: <612d0520$0$705$14726298@news.sunsite.dk> <87mtoytul6.fsf@nightsong.com> <2021Aug31.091034@mips.complang.tuwien.ac.at> <2021Aug31.181205@mips.complang.tuwien.ac.at> <2021Sep1.175835@mips.complang.tuwien.ac.at> Injection-Info: reader02.eternal-september.org; posting-host="b8d65dc0511cc8ebe9ae0a45acf55f13"; logging-data="22210"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18fCbCz2ICI2rM/jXWHlham" Cancel-Lock: sha1:cJNE1axDIy83ZMU2XUq978HvQaA= X-newsreader: xrn 10.00-beta-3 Bytes: 4197 I suggested that using a single CMOVE for filling a memory region with a pattern is inefficient and that we could use an approach that doubles the filled buffer size at each step (and without overlaps in copying, so one could use either CMOVE, MOVE, or CMOVE>, as convenient). So I wanted to test this. As it turns out, the resulting word also satisfies the requirements for CMOVE, so I call it CMOVE, too: : cmove {: afrom ato u | u1 -- :} ato afrom - u u< if \ overlapping case ato afrom - u over + to u1 begin afrom afrom 2 pick + 2 pick u1 over - min cmove 2* dup u u>= until drop else \ non-overlapping case ato afrom u cmove then ; This CMOVE calls the simple CMOVE, because that is the fastest variant for VFX and lxf. Eliminating the locals is left as an exercise to the puristic readers. I benchmarked this with: : bench {: usize ulength -- :} usize ulength + allocate throw ulength 0 ?do 'a' i + over i + c! loop 10000000 0 ?do dup dup ulength + usize cmove loop ; usize is the size of the yet-unfilled buffer, ulength is the length of the pattern. I used usize=1000 and ulength from 1..5. Here are the results in cycles/iteration on Zen 3 (Ryzen 5800X): VFX64 lxf VFX32 orig new orig new orig new ulength 7474 1125 7624 1095 7807 423 1 3360 965 3140 933 4273 420 2 2367 957 2380 922 3018 411 3 2068 790 2147 761 2418 400 4 1645 779 1766 754 2083 426 5 As can be seen, the new CMOVE is substantially faster than the REP MOVSB based one for this use case. Of course, the question remains how relevant this use case is; but then, this is the only use case where CMOVE behaves differently from MOVE, so it may be a justification for keeping CMOVE. We see that, the larger ulength is, the faster does the REP MOVSB perform, because it can copy more bytes during the latency of the store-load-store chain that determines performance. For the new CMOVE, we also see a slight improvement with increasing ulength, because there are then fewer iterations of the loop in the new CMOVE; the cost of such an iteration at ~160 cycles is surprisingly high; this may have to do with the cache behaviour of REP MOVSB. VFX32 uses a simple byte-copying loop rather than REP MOVSB for the overlapping case (orig), and performs similar to the REP MOVSB implemenation. VFX32 uses a cell-copying loop in the non-overlapping case (new), resulting in even faster performance; also, we see that the number of iterations within the new CMOVE does not have as big an influence as for VFX64 and lxf. - anton -- M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html New standard: http://www.forth200x.org/forth200x.html EuroForth 2021: https://euro.theforth.net/2021