| Deutsch English Français Italiano |
|
<2025Jul16.132504@mips.complang.tuwien.ac.at> View for Bookmarking (what is this?) Look up another Usenet article |
Path: nntp.eternal-september.org!news.eternal-september.org!eternal-september.org!.POSTED!not-for-mail
From: anton@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.lang.forth
Subject: Re: Parsing =?UTF-8?B?dGltZXN0YW1wcz8=?=
Date: Wed, 16 Jul 2025 11:25:04 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 145
Message-ID: <2025Jul16.132504@mips.complang.tuwien.ac.at>
References: <1f433fabcb4d053d16cbc098dedc6c370608ac01@i2pn2.org> <91e8859d9cb678b7ce7a8a5f341de513@www.novabbs.com> <2025Jul11.122254@mips.complang.tuwien.ac.at> <954cf34891bed0677fd79af0b676c50613dc1443@i2pn2.org> <2025Jul13.110141@mips.complang.tuwien.ac.at> <2d6811168025a74b3ff51a78efb75947d36a0146@i2pn2.org> <2025Jul14.080413@mips.complang.tuwien.ac.at> <063d4a116fb394a776b1e9313f9903cf@www.novabbs.com> <2025Jul14.095004@mips.complang.tuwien.ac.at> <a449857495e02b4d35627f9f31d37fd8@www.novabbs.com>
Injection-Date: Wed, 16 Jul 2025 17:01:51 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="585c480a77ea571ad015f54f65ea6703";
logging-data="842348"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+m8CXk78EDvg+/Asrno4Ql"
Cancel-Lock: sha1:WJadZZqBUbtPmRaOpt4S6Qs1RoI=
X-newsreader: xrn 10.11
mhx@iae.nl (mhx) writes:
>On Mon, 14 Jul 2025 7:50:04 +0000, Anton Ertl wrote:
>> C) Perform tree addition
>>
>> a) Using 80-bit addition. This will be faster than sequential
>> addition because in many cases several additions can run in
>> parallel. It will also be quite accurate because it uses 80-bit
>> addition, and because the addition chains are reduced to
>> ld(length(vector)).
>
>This looks very interesting. I can find Kahan and Neumaier, but
>"tree addition" didn't turn up (There is a suspicious looking
>reliability paper about the approach which surely is not what
>you meant). Or is it pairwise addition what I should look for?
Yes, "tree addition" is not a common term, and Wikipedia calls it
pairwise addition. Except that unlike suggeseted in
<https://en.wikipedia.org/wiki/Pairwise_summation> I would not switch to
a sequential approach for small n, for both accuracy and performance.
In any case the idea is to turn the evaluation tree from a degenerate
tree into a balanced tree. E.g., if you add up a, b, c, and d, then
the naive evaluation
a b f+ c f+ d f+
has the evaluation tree
a b
\ /
f+ c
\ /
f+ d
\ /
f+
with the three F+ each depending on the previous one, and also
increasing the rounding errors. If you balance the tree
a b c d
\ / \ /
f+ f+
\ /
f+
corresponding to
a b f+ c d f+ f+
the first two f+ can run in parallel (increasing performance), and the
rounding errors tend to be less.
So how to implement this for an arbitrary N? We had an extensive
discussion of a similar problem in the thread on the subject "balanced
REDUCE: a challenge for the brave", and you can find that discussion
at
<https://comp.lang.forth.narkive.com/GIg9V9HK/balanced-reduce-a-challenge-for-the-brave>
But I decided to use a recursive approach (recursive-sum, REC) that
uses the largest 2^k<n as the left child and the rest as the right
child, and as base cases for the recursion use a straight-line
balanced-tree evaluation for 2^k with k<=7 (and combine these for n
that are not 2^k). For systems with tiny FP stacks, I added the
option to save intermediate results on a software stack in the
recursive word. Concerning the straight-line code, it turned out that
the highest k I could use on sf64 and vfx64 is 5 (corresponding to 6
FP stack items); it's not clear to me why; on lxf I can use k=7 (and
it uses the 387 stack, too).
I also coded the shift-reduce-sum algorithm (shift-reduce-sum, SR)
described in <https://en.wikipedia.org/wiki/Pairwise_summation> in
Forth, because it can make use of Forth's features (such as the FP
stack) where the C code has to hand-code it. It uses the FP stack
beyond 8 elements if there are more than 128 elements in the array, so
it does not work for the benchmark (with 100_000 elements in the
array) on lxf, sf64, and vfx64. As you will see, this is no loss.
I also coded the naive, sequential approach (naive-sum, NAI).
One might argue that the straight-line stuff in REC puts REC at an
advantage, so i also produced an unrolled version of the naive code
(unrolled-sum, UNR) that uses straight-line sequences for adding up to
2^7 elements to the intermediate result.
You can find a file containing all these versions, compatibility
configurations for various Forth systems, and testing and benchmarking
code and data, on
https://www.complang.tuwien.ac.at/forth/programs/pairwise-sum.4th
I did not do any accuracy measurements, but I did performance
measurements on a Ryzen 5800X:
cycles:u
gforth-fast iforth lxf SwiftForth VFX
3_057_979_501 6_482_017_334 6_087_130_593 6_021_777_424 6_034_560_441 NAI
6_601_284_920 6_452_716_125 7_001_806_497 6_606_674_147 6_713_703_069 UNR
3_787_327_724 2_949_273_264 1_641_710_689 7_437_654_901 1_298_257_315 REC
9_150_679_812 14_634_786_781 SR
cycles:u
gforth-fast iforth lxf SwiftForth VFX
13_113_842_702 6_264_132_870 9_011_308_923 11_011_828_048 8_072_637_768 NAI
6_802_702_884 2_553_418_501 4_238_099_417 11_277_658_203 3_244_590_981 UNR
9_370_432_755 4_489_562_792 4_955_679_285 12_283_918_226 3_915_367_813 REC
51_113_853_111 29_264_267_850 SR
The versions used are:
Gforth 0.7.9_20250625
iForth 5.1-mini
lxf 1.7-172-983
SwiftForth x64-Linux 4.0.0-RC89
VFX Forth 64 5.43 [build 0199] 2023-11-09
The ":u" means that I measured what happened at the user-level, not at
the kernel-level.
Each benchmark run performs 1G f@ and f+ operations, and the naive
approach performs 1G iterations of the loop.
The NAIve and UNRolled results show that performance in both is
limited by the latency of the F+: 3 cycles for the DP SSE2 operation
in Gforth-fast, 6 cycles for the 80-bit 387 fadd on the other systems.
It's unclear to me why UNR is much slower on gforth-fast compared to
NAI.
The RECursive balanced-tree sum is faster on iForth, lxf and VFX than
the NAIve and UNRolled versions. It is slower on Gforth: My guess is
that, despite all hardware advances, the lack of multi-state stack
caching in Gforth means that the hardware of the Ryzen 5800X does not
just see the real data flow, but a lot of additional dependences; or
it may be related to whatever causes the slowdown for UNRolled.
The SR (shift-reduce) sum looks cute, but performs so many additional
instructions, even on iForth, that it is uncompetetive. It's unclear
to me what slows it down so much on iForth, however.
I expect that vectorized implementations using AVX will be several
times faster than the fastest scalar stuff we see here.
- anton
--
M. Anton Ertl http://www.complang.tuwien.ac.at/anton/home.html
comp.lang.forth FAQs: http://www.complang.tuwien.ac.at/forth/faq/toc.html
New standard: https://forth-standard.org/
EuroForth 2025 CFP: http://www.euroforth.org/ef25/cfp.html