Article <2025Jul16.132504@mips.complang.tuwien.ac.at>

Deutsch English Français Italiano
<2025Jul16.132504@mips.complang.tuwien.ac.at>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: nntp.eternal-september.org!news.eternal-september.org!eternal-september.org!.POSTED!not-for-mail
From: anton@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.lang.forth
Subject: Re: Parsing =?UTF-8?B?dGltZXN0YW1wcz8=?=
Date: Wed, 16 Jul 2025 11:25:04 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 145
Message-ID: <2025Jul16.132504@mips.complang.tuwien.ac.at>
References: <1f433fabcb4d053d16cbc098dedc6c370608ac01@i2pn2.org> <91e8859d9cb678b7ce7a8a5f341de513@www.novabbs.com> <2025Jul11.122254@mips.complang.tuwien.ac.at> <954cf34891bed0677fd79af0b676c50613dc1443@i2pn2.org> <2025Jul13.110141@mips.complang.tuwien.ac.at> <2d6811168025a74b3ff51a78efb75947d36a0146@i2pn2.org> <2025Jul14.080413@mips.complang.tuwien.ac.at> <063d4a116fb394a776b1e9313f9903cf@www.novabbs.com> <2025Jul14.095004@mips.complang.tuwien.ac.at> <a449857495e02b4d35627f9f31d37fd8@www.novabbs.com>
Injection-Date: Wed, 16 Jul 2025 17:01:51 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="585c480a77ea571ad015f54f65ea6703";
	logging-data="842348"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX1+m8CXk78EDvg+/Asrno4Ql"
Cancel-Lock: sha1:WJadZZqBUbtPmRaOpt4S6Qs1RoI=
X-newsreader: xrn 10.11

mhx@iae.nl (mhx) writes:
>On Mon, 14 Jul 2025 7:50:04 +0000, Anton Ertl wrote:
>> C) Perform tree addition
>>
>>   a) Using 80-bit addition.  This will be faster than sequential
>>      addition because in many cases several additions can run in
>>      parallel.  It will also be quite accurate because it uses 80-bit
>>      addition, and because the addition chains are reduced to
>>      ld(length(vector)).
>
>This looks very interesting. I can find Kahan and Neumaier, but
>"tree addition" didn't turn up (There is a suspicious looking
>reliability paper about the approach which surely is not what
>you meant). Or is it pairwise addition what I should look for?

Yes, "tree addition" is not a common term, and Wikipedia calls it
pairwise addition.  Except that unlike suggeseted in
<https://en.wikipedia.org/wiki/Pairwise_summation> I would not switch to
a sequential approach for small n, for both accuracy and performance.
In any case the idea is to turn the evaluation tree from a degenerate
tree into a balanced tree.  E.g., if you add up a, b, c, and d, then
the naive evaluation

a b f+ c f+ d f+

has the evaluation tree

a   b
 \ /
 f+   c
   \ /
   f+   d
     \ /
     f+

with the three F+ each depending on the previous one, and also
increasing the rounding errors.  If you balance the tree

a   b c   d
 \ /   \ /
 f+    f+
   \  /
    f+

corresponding to

a b f+ c d f+ f+

the first two f+ can run in parallel (increasing performance), and the
rounding errors tend to be less.

So how to implement this for an arbitrary N?  We had an extensive
discussion of a similar problem in the thread on the subject "balanced
REDUCE: a challenge for the brave", and you can find that discussion
at
<https://comp.lang.forth.narkive.com/GIg9V9HK/balanced-reduce-a-challenge-for-the-brave>

But I decided to use a recursive approach (recursive-sum, REC) that
uses the largest 2^k<n as the left child and the rest as the right
child, and as base cases for the recursion use a straight-line
balanced-tree evaluation for 2^k with k<=7 (and combine these for n
that are not 2^k).  For systems with tiny FP stacks, I added the
option to save intermediate results on a software stack in the
recursive word.  Concerning the straight-line code, it turned out that
the highest k I could use on sf64 and vfx64 is 5 (corresponding to 6
FP stack items); it's not clear to me why; on lxf I can use k=7 (and
it uses the 387 stack, too).

I also coded the shift-reduce-sum algorithm (shift-reduce-sum, SR)
described in <https://en.wikipedia.org/wiki/Pairwise_summation> in
Forth, because it can make use of Forth's features (such as the FP
stack) where the C code has to hand-code it.  It uses the FP stack
beyond 8 elements if there are more than 128 elements in the array, so
it does not work for the benchmark (with 100_000 elements in the
array) on lxf, sf64, and vfx64.  As you will see, this is no loss.

I also coded the naive, sequential approach (naive-sum, NAI).

One might argue that the straight-line stuff in REC puts REC at an
advantage, so i also produced an unrolled version of the naive code
(unrolled-sum, UNR) that uses straight-line sequences for adding up to
2^7 elements to the intermediate result.

You can find a file containing all these versions, compatibility
configurations for various Forth systems, and testing and benchmarking
code and data, on

https://www.complang.tuwien.ac.at/forth/programs/pairwise-sum.4th

I did not do any accuracy measurements, but I did performance
measurements on a Ryzen 5800X:

                                cycles:u
  gforth-fast     iforth           lxf          SwiftForth     VFX
 3_057_979_501  6_482_017_334  6_087_130_593  6_021_777_424  6_034_560_441 NAI
 6_601_284_920  6_452_716_125  7_001_806_497  6_606_674_147  6_713_703_069 UNR
 3_787_327_724  2_949_273_264  1_641_710_689  7_437_654_901  1_298_257_315 REC
 9_150_679_812 14_634_786_781                                              SR 
                                                                              
                                cycles:u                                      
  gforth-fast     iforth           lxf          SwiftForth     VFX            
13_113_842_702  6_264_132_870  9_011_308_923 11_011_828_048  8_072_637_768 NAI
 6_802_702_884  2_553_418_501  4_238_099_417 11_277_658_203  3_244_590_981 UNR
 9_370_432_755  4_489_562_792  4_955_679_285 12_283_918_226  3_915_367_813 REC
51_113_853_111 29_264_267_850                                              SR 

The versions used are:
Gforth 0.7.9_20250625
iForth 5.1-mini
lxf 1.7-172-983
SwiftForth x64-Linux 4.0.0-RC89
VFX Forth 64 5.43 [build 0199] 2023-11-09

The ":u" means that I measured what happened at the user-level, not at
the kernel-level.

Each benchmark run performs 1G f@ and f+ operations, and the naive
approach performs 1G iterations of the loop.

The NAIve and UNRolled results show that performance in both is
limited by the latency of the F+: 3 cycles for the DP SSE2 operation
in Gforth-fast, 6 cycles for the 80-bit 387 fadd on the other systems.
It's unclear to me why UNR is much slower on gforth-fast compared to
NAI.

The RECursive balanced-tree sum is faster on iForth, lxf and VFX than
the NAIve and UNRolled versions.  It is slower on Gforth: My guess is
that, despite all hardware advances, the lack of multi-state stack
caching in Gforth means that the hardware of the Ryzen 5800X does not
just see the real data flow, but a lot of additional dependences; or
it may be related to whatever causes the slowdown for UNRolled.

The SR (shift-reduce) sum looks cute, but performs so many additional
instructions, even on iForth, that it is uncompetetive.  It's unclear
to me what slows it down so much on iForth, however.

I expect that vectorized implementations using AVX will be several
times faster than the fastest scalar stuff we see here.

- anton
-- 
M. Anton Ertl  http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
     New standard: https://forth-standard.org/
EuroForth 2025 CFP: http://www.euroforth.org/ef25/cfp.html