Article <2025Feb25.100719@mips.complang.tuwien.ac.at>

Deutsch English Français Italiano
<2025Feb25.100719@mips.complang.tuwien.ac.at>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!eternal-september.org!.POSTED!not-for-mail
From: anton@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.lang.forth
Subject: Re: Stack vs stackless operation
Date: Tue, 25 Feb 2025 09:07:19 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 238
Message-ID: <2025Feb25.100719@mips.complang.tuwien.ac.at>
References: <591e7bf58ebb1f90bd34fba20c730b83@www.novabbs.com> <2025Feb24.225021@mips.complang.tuwien.ac.at> <aa770ad25cdcbae657eaf1d07ab09b53@www.novabbs.com> <2025Feb25.082658@mips.complang.tuwien.ac.at> <263758f7d46bb65dbfbab41c4bc2fcd3@www.novabbs.com>
Injection-Date: Tue, 25 Feb 2025 11:34:01 +0100 (CET)
Injection-Info: dont-email.me; posting-host="025acfbbe1d78c9898f57037ac05cd9b";
	logging-data="2013918"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX1/x9fGFFmru6Bm5uiR6t9HJ"
Cancel-Lock: sha1:Ur3FsxHd68IIe6MnKtAd9J5TA28=
X-newsreader: xrn 10.11
Bytes: 10397

zbigniew2011@gmail.com (LIT) writes:
[Anton Ertl:]
>> Earlier you wrote about performance, now you switch to clarity of the
>> code.  What is the goal?
>
>Both — one isn't contrary to another.

Sometimes the clearer code is slower and the faster code is less clear
(as in the FAXPY-NOSTRIDE example).

>What I have in mind is: by performing OOS operation
>we don't have to employ the whole "Forth machine" to
>do the usual things (I mean, of course, the usual
>steps described by Brad Rodriguez in his "Moving
>Forth" paper).

What does "OOS" stand for?  What do you mean with "the usual steps"; I
am not going to read the whole paper and guess which of the code shown
there you have in mind.

>It comes with a cost: usual Forth words, that use
>the stack, are versatile, while such OOS words
>aren't that versatile anymore — yet (at least in
>the case of ITC non-optimizing Forths) they should
>be faster.

One related thing is the work on "register"-based virtual machines.
For interpreted implementations the VM registers are in memory, but
they are accessed by "register number"; these usually correspond to
locals slots on machines line the JavaVM.  A well-known example of
that is the switch of the Lua VM from stack-based to register-based.
A later example is Android's Dalvik VM for Java, in contrast to the
stack-based JavaVM.

There is a paper [shi+08] that provides an academic justification for
this approach.  The gist of it is that, with some additional compiler
complexity, the register-based machine can reduce the number of NEXTs
(in Forth threaded-code terminology); depending on the implementation
approach and the hardware, the NEXTs could be the major cost at the
time.  However, already at the time dynamic superinstructions (in
implementation technique for virtual-machine interpreters) reduced the
number of NEXTs to one per basic block, and VM registers did nothing
to reduce NEXTs in that case; Shi et al. also showed that with a lot
of compiler sophistication (data flow analysis etc.) VM registers can
be as fast as stacks even with dynamic superinstructions.

However, given that dynamic superinstructions are easier to implement
and the VM registers do not give a benefit when they are employed, why
would one go for VM registers?  Of course, in the Forth setting one
could offload the optimization onto the programmer, but even Chuck
Moore did not go there.

In any case, here's an example extracted from Figure 6 of the paper:

   Java VM (stack)     VM registers
19 iload_1
20 bipush #31          iconst #31 -> r1
21 imul                imul r6 r1 -> r3
22 aload_0
23 getfield value      getfield r0.value -> r5
24 iload_3
26 caload              caload r5 r7 -> r5
27 iadd                iadd r3 r5 -> r6
28 istore_1

So yes, the VM register code contains fewer VM instructions.  Is it
clearer?

The corresponding Gforth code is the stuff between IF and THEN in the
following:

0
value: some-field
value: value
constant some-struct

: foo
  {: r0 r1 r3 :}
  if
    r1 31 * r0 value r3 + c@ + to r1
  then
  r1 ;

The code that Gforth produces for the basic block under consideration
is:

$7FC624AA0958 @local1    1->1 
7FC62464A5BA:   mov     [r10],r13
7FC62464A5BD:   sub     r10,$08
7FC62464A5C1:   mov     r13,$08[rbp]
$7FC624AA0960 lit    1->2 
$7FC624AA0968 #31 
7FC62464A5C5:   sub     rbx,$50
7FC62464A5C9:   mov     r15,-$08[rbx]
$7FC624AA0970 *    2->1 
7FC62464A5CD:   imul    r13,r15
$7FC624AA0978 @local0    1->2 
7FC62464A5D1:   mov     r15,$00[rbp]
$7FC624AA0980 lit+    2->2 
$7FC624AA0988 #8 
7FC62464A5D5:   add     r15,$18[rbx]
$7FC624AA0990 @    2->2 
7FC62464A5D9:   mov     r15,[r15]
$7FC624AA0998 @local2    2->3 
7FC62464A5DC:   mov     r9,$10[rbp]
$7FC624AA09A0 +    3->2 
7FC62464A5E0:   add     r15,r9
$7FC624AA09A8 c@    2->2 
7FC62464A5E3:   movzx   r15d,byte PTR [r15]
$7FC624AA09B0 +    2->1 
7FC62464A5E7:   add     r13,r15
$7FC624AA09B8 !local1    1->1 
7FC62464A5EA:   add     r10,$08
7FC62464A5EE:   mov     $08[rbp],r13
7FC62464A5F2:   mov     r13,[r10]
7FC62464A5F5:   add     rbx,$50

There are 8 loads and 2 stores in that code.  If the VM registers are
held in memory (as they usually are, and as the Gforth locals are),
the VM register code performs at least 9 loads (7 register accesses,
the getfield, and the caload) and 5 stores.  Of course, in Forth one
would write the block as:

: foo1 ( n3 a0 n1 -- n )
  31 * swap value rot + c@ + ;

and the code for that is (without the ";"):

$7FC624AA0A10 lit    1->2 
$7FC624AA0A18 #31 
7FC62464A617:   mov     r15,$08[rbx]
$7FC624AA0A20 *    2->1 
7FC62464A61B:   imul    r13,r15
$7FC624AA0A28 swap    1->2 
7FC62464A61F:   mov     r15,$08[r10]
7FC62464A623:   add     r10,$08
$7FC624AA0A30 lit+    2->2 
$7FC624AA0A38 #8 
7FC62464A627:   add     r15,$28[rbx]
$7FC624AA0A40 @    2->2 
7FC62464A62B:   mov     r15,[r15]
$7FC624AA0A48 rot    2->3 
7FC62464A62E:   mov     r9,$08[r10]
7FC62464A632:   add     r10,$08
$7FC624AA0A50 +    3->2 
7FC62464A636:   add     r15,r9
$7FC624AA0A58 c@    2->2 
7FC62464A639:   movzx   r15d,byte PTR [r15]
$7FC624AA0A60 +    2->1 
7FC62464A63D:   add     r13,r15

6 loads, 0 stores.

And if we feed the equivalent standard code

0
field: some-field
field: value-addr
constant some-struct

: foo1 ( n3 a0 n1 -- n )
  31 * swap value-addr @ rot + c@ + ;

into other Forth systems, some produce even better code:

VFX Forth 64 5.43 [build 0199] 2023-11-09 for Linux x64
FOO1 
( 0050A310    486BDB1F )              IMUL    RBX, RBX, # 1F
( 0050A314    488B5500 )              MOV     RDX, [RBP]
( 0050A318    488B4D08 )              MOV     RCX, [RBP+08]
( 0050A31C    48034A08 )              ADD     RCX, [RDX+08]
( 0050A320    480FB609 )              MOVZX   RCX, Byte 0 [RCX]
( 0050A324    4803D9 )                ADD     RBX, RCX
( 0050A327    488D6D10 )              LEA     RBP, [RBP+10]
( 0050A32B    C3 )                    RET/NEXT
( 28 bytes, 8 instructions )

5 loads, 0 stores.  And VFX does not do data-flow analysis across
basic blocks, unlike the Java VM -> VM register compiler that Shi
used; i.e., VFX is probably simpler than the compiler Shi used.

@Article{shi+08,
  author =       {Yunhe Shi and Kevin Casey and M. Anton Ertl and
                  David Gregg},
========== REMAINDER OF ARTICLE TRUNCATED ==========