| Deutsch English Français Italiano |
|
<2025Mar1.091806@mips.complang.tuwien.ac.at> View for Bookmarking (what is this?) Look up another Usenet article |
Path: ...!weretis.net!feeder9.news.weretis.net!news.quux.org!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!eternal-september.org!.POSTED!not-for-mail From: anton@mips.complang.tuwien.ac.at (Anton Ertl) Newsgroups: comp.lang.forth Subject: Stack caching (: Stack vs stackless operation) Date: Sat, 01 Mar 2025 08:18:06 GMT Organization: Institut fuer Computersprachen, Technische Universitaet Wien Lines: 214 Message-ID: <2025Mar1.091806@mips.complang.tuwien.ac.at> References: <591e7bf58ebb1f90bd34fba20c730b83@www.novabbs.com> <34df278ef0a52d0eab9d035f45795389@www.novabbs.com> <a6c9d8a0aa7e1046af7948093e07cff0@www.novabbs.com> <2025Feb26.153250@mips.complang.tuwien.ac.at> <2025Feb26.184613@mips.complang.tuwien.ac.at> <2025Feb28.225505@mips.complang.tuwien.ac.at> <87wmd9n0xx.fsf@nightsong.com> <2025Mar1.083209@mips.complang.tuwien.ac.at> Injection-Date: Sat, 01 Mar 2025 10:45:47 +0100 (CET) Injection-Info: dont-email.me; posting-host="8853c40709c89d32b754b24417cf3527"; logging-data="190185"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1+bSm4G8TKDiYNmt+o9z+rH" Cancel-Lock: sha1:EHQSeB/egKjYxeTnNHo+v1U+GGE= X-newsreader: xrn 10.11 Bytes: 10148 anton@mips.complang.tuwien.ac.at (Anton Ertl) writes: >see-code exchange see-code exchange4 see-code exchange2 >over 1->2 dup 1->2 dup >r 1->1 > mov r15,$08[r12] mov r15,r8 >r 1->1 >@ 2->2 @ 2->2 mov -$08[r13],r8 > mov r15,[r15] mov r15,[r15] sub r13,$08 >swap 2->3 rot 2->3 @ 1->1 > add r12,$08 mov r9,$08[r12] mov r8,[r8] > mov r9,r8 add r12,$08 over 1->2 > mov r8,[r12] !@ 3->2 mov r15,$08[r12] >!@ 3->2 mov rax,r15 @ 2->2 > mov rax,r15 mov r15,[r9] mov r15,[r15] > mov r15,[r9] mov [r9],rax r> 2->3 > mov [r9],rax swap 2->3 mov r9,$00[r13] >swap 2->3 add r12,$08 add r13,$08 > add r12,$08 mov r9,r8 ! 3->1 > mov r9,r8 mov r8,[r12] mov [r9],r15 > mov r8,[r12] ! 3->1 swap 1->2 >! 3->1 mov [r9],r15 mov r15,$08[r12] > mov [r9],r15 ;s 1->1 add r12,$08 >;s 1->1 mov rbx,$00[r13] ! 2->0 > mov rbx,$00[r13] add r13,$08 mov [r15],r8 > add r13,$08 mov rax,[rbx] ;s 0->1 > mov rax,[rbx] jmp eax mov r8,$08[r12] > jmp eax add r12,$08 > mov rbx,$00[r13] > add r13,$08 > mov rax,[rbx] > jmp eax The difference between exchange and exchange4 shows how stack caching can have a hard-to-predict effect. Gforth searches for the shortest path through the available stack-cache states, where shortness is defined by the native-code length. E.g., it starts with state 1, and from there it can use any of the dup variants starting in state 1, or first transition to another state and use a dup variant starting from there. For SWAP and ROT gforth-fast has the following variants: primitive in-out # code bytes swap 1-1 132 len= 4+ 13+ 3 swap 2-2 37 len= 4+ 9+ 3 swap 3-3 4 len= 4+ 9+ 3 swap 0-2 8 len= 4+ 14+ 3 swap 1-2 82 len= 4+ 9+ 3 swap 2-1 74 len= 4+ 8+ 3 swap 2-3 30 len= 4+ 11+ 3 swap 3-2 3 len= 4+ 11+ 3 swap 2-0 20 len= 4+ 13+ 3 rot 1-1 46 len= 4+ 23+ 3 rot 3-3 6 len= 4+ 12+ 3 rot 3-1 24 len= 4+ 13+ 3 rot 2-3 15 len= 4+ 9+ 3 rot 1-3 17 len= 4+ 17+ 3 rot 0-3 1 len= 4+ 19+ 3 You get these data (in a rawer form) with gforth-fast --print-prim -e bye |& grep ^swap gforth-fast --print-prim -e bye |& grep ^rot The in column is the stack-cache state on entering the word, the out column is the stack-cache state on leaving the word. The # column shows how many times this variant of the primitive is used (static counts). The code length colum shows three parts, the middle of which is the part that's copied to dynamic superinstructions like the ones shown above, and this length is what is used in the search for the shortest path. In EXCHANGE4, the shortest variant of ROT is used: ROT 2->3; and the primitives of the other variants are selected to also result in the shortest overall code. In EXCHANGE and EXCHANGE4, SWAP 2->3 is not the shortest variant of SWAP, not even the shortest variant starting from state 2, but ending in state 3 allows to use cheap variants of further primitives such as !@ and !, resulting in the overall shortest code for this sequence. In EXCHANGE2, we see the selection of a shorter version of SWAP, but one of the costs is that ;s becomes longer (but in this case the overall savings from using a shorter version of SWAP and shorter versions of earlier instructions make up for that). Why am I looking at this? For stack-shuffling primitives like SWAP and ROT, it's not obvious which variant is how long and which variant should be selected. These stack-shiffling words therefore are also good candidates for performing stack-cache state transitions that would otherwise require inserting extra transition code: E.g., EXCHANGE and its variants consume two stack items, but need to start in stack-cache state 1 and end in stack-cache state 1 (gforth is currently not smart enough to deal with other states at basic-block boundaries), so not everything can be done in the stack cache; the stack pointer needs to be increased by two cells, and there need to be accesses to the memory part of the stack for two stack items. In EXCHANGE, the adjustment by one cell and memory access for one stack item is done in the first SWAP 2->3, and another one in the second one. In EXCHANGE4, ROT 2->3 and SWAP 2->3 perform these tasks. In EXCHANGE2, the SWAP 1->3 does it for one cell, and the stack-cache state transition 0->1 in the first two instructions of ;s do it for the other cell (gforth-fast actually does not have a built-in variant ;S 0->1 and the code shown as ;S 0->1 by SEE-CODE is actually composed of a transition 0->1 and the ;S 1->1 variant). I wanted to know how often which variant of these stack-shuffling primitives is used, and how this relates to their length. One interesting result is that ROT 1->3 is used relatively frequently despite having relatively long code. Apparently the code that comes before these 17 instances of ROT benefits a lot from being in stack-cache state 1 and this amortizes the longer code for the ROT 1->3 compared to ROT 2->3. Another interesting result is the low usage of SWAP 3->2 compared to SWAP 2->3. This may say something about how SWAP is used in Forth programs. Or it may be an artifact of tie-breaking: If two paths have the same length, one is chosen rather arbitrarily, but consistently, and this may make one variant appear more useful than merited by the benefit that the existence of the variant has on code length. For those interested, here's the code for the various variants shown above: r12: stack pointer r8: stack cache register a (tos in state 1, 2nd in state 2, 3rd in state 3) r15: stack-cache register b (tos in state 2, 2nd in state 3) r9: stack-cache register c (tos in state 3) swap 1-1 559E7F769425: mov rax,$08[r12] 559E7F76942A: mov $08[r12],r8 559E7F76942F: mov r8,rax swap 2-2 559E7F76E5B1: mov rax,r8 559E7F76E5B4: mov r8,r15 559E7F76E5B7: mov r15,rax swap 3-3 559E7F76E5C3: mov rax,r15 559E7F76E5C6: mov r15,r9 559E7F76E5C9: mov r9,rax swap 0-2 559E7F76E5D5: mov r15,$10[r12] 559E7F76E5DA: mov r8,$08[r12] 559E7F76E5DF: add r12,$10 swap 1-2 559E7F76E5EC: mov r15,$08[r12] 559E7F76E5F1: add r12,$08 swap 2-1 559E7F76E5FE: mov [r12],r15 559E7F76E602: sub r12,$08 swap 2-3 559E7F76E60F: add r12,$08 559E7F76E613: mov r9,r8 559E7F76E616: mov r8,[r12] swap 3-2 559E7F76E623: mov [r12],r8 559E7F76E627: mov r8,r9 559E7F76E62A: sub r12,$08 swap 2-0 559E7F76E637: mov [r12],r15 559E7F76E63B: sub r12,$10 559E7F76E63F: mov $08[r12],r8 rot 1-1 559E7F76944C: mov rdx,$08[r12] 559E7F769451: mov rax,$10[r12] 559E7F769456: mov $08[r12],r8 559E7F76945B: mov $10[r12],rdx 559E7F769460: mov r8,rax rot 3-3 559E7F76EDC0: mov rax,r8 559E7F76EDC3: mov r8,r15 559E7F76EDC6: mov r15,r9 ========== REMAINDER OF ARTICLE TRUNCATED ==========