Path: ...!eternal-september.org!feeder2.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: anton@mips.complang.tuwien.ac.at (Anton Ertl) Newsgroups: comp.arch Subject: Interpreters and indirect-branch prediction (was: Reverse ...) Date: Thu, 14 Nov 2024 08:35:41 GMT Organization: Institut fuer Computersprachen, Technische Universitaet Wien Lines: 261 Message-ID: <2024Nov14.093541@mips.complang.tuwien.ac.at> References: <8928500a87002966d6282465c037003e@www.novabbs.org> <18bfd4ce1048f68ad4d39d972c459e99@www.novabbs.org> Injection-Date: Thu, 14 Nov 2024 11:34:06 +0100 (CET) Injection-Info: dont-email.me; posting-host="27a9388c3b9f84a9c241ad944e183c04"; logging-data="2916060"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/JDHuz69sgVt6f64jsplTY" Cancel-Lock: sha1:SZppZSW7RNeKtqHoWBhSNs11DJY= X-newsreader: xrn 10.11 Bytes: 13704 Thomas Koenig writes: [Why indirect-branch predictors work in "bytecode" interpreters] >To me, that looks like the bytecode is insufficiently expressive, if >an often-repeating sequence occurs :-) As already mentioned, what a hardware branch predictor works on is the dynamic sequence of branches. This might come from a straight-line sequence of VM instructions that occurs in only once in one program, and nowhere else; e.g., in a loop (as mentioned earlier), or in a subroutine that's called from several places. If that one program is not among the programs that the VM designer used for determining frequent code sequences that should be turned into static VM superinstructions, on what basis should that VM superinstruction be added? Even if that program is among those programs, should this VM superinstruction be added? It's only useful in one program. But the hardware branch prediction is also based on dynamic sequences that are not statically sequential. E.g., there might be a VM branch (possibly conditional or indirect), a VM call or a VM return in that sequence. The hardware branch predictor will use the repetition of the sequence for prediction, but it's usually impractical to form static VM superinstructions from such sequences. And in any case, whatever VM superinstructions you use, at some point every VM superinstruction ends with an indirect branch, and then you need a good hardware indirect-branch predictor, or the performance will suffer. If you have a good indirect-branch predictor, there are no such boundaries; it predicts the indirect branch of every VM instruction implementation based on the preceeding branches and their targets. >In the presence of a good predictor, it would still be an interesting >question if a more compressed version would actually perform better. There is a benefit to using VM superinstructions: the number of instruction dispatches is reduced, and consequently less code has to be executed; there are also less branches performed, so the instruction fetcher can be more effective. You can see the effect in the difference between "--no-super" and "default" in . Here's that data extracted for your convenience; result on Haswell: |Run time in seconds: | | |--no- |super default |2.276 1.468 benchgc |1.064 0.544 brainless |2.288 1.520 brew |2.232 1.232 cd16sim |1.484 0.864 fcp |9.280 7.840 lexex | |For mispredictions the picture was as follows: | | --no- | super default |12,584,698 9,426,835 benchgc |51,375,412 18,391,920 brainless |47,928,843 21,390,209 brew |70,879,605 22,105,620 cd16sim |33,496,775 17,280,944 fcp | 8,269,104 6,885,712 lexex "default" here is dynamic superinstructions, i.e., every straight-line sequence of code (even across conditional branches) is turned into a VM superinstruction; i.e., the only indirect branches are for taken VM-level branches. As you can see, the speedup from dynamic superinstructions is quite nice, and only a little of it is due to better branch predictions. E.., in the cd16sim case (with the largest reduction in mispredictions, the mispredictions account for about ((71-22)M mispredictions)*(20cycles/misprediction)/(4400M cycles/s)=0.22s out of 1s, so about 0.78s are due to the reduction in dispatch code. These dynamic superinstructions are formed by concatenating the code snippets coming from the individual VM instructions, so eliminating dispatch code is the only benefit for them. In recent work [ertl&paysan24] we have also reduced VM-instruction pointer updates in Gforth, and found speedups by a factor >2 on loop-dominated benchmarks on recent high-performance microarchitectures; our explanation is that the VM instruction pointer updates are the critical dependence path in executing these benchmarks unless we use our reduction techniques. In the early 2000s we looked at using static VM superinstructions. At the time the main effect was the reduction in mispredicted indirect branches, but they can also have the effect of reducing other instructions. E.g., in Gforth (engine gforth-fast) the VM instruction F@ has the following code snippet (Intel-style syntax): 560BC7692D89: add r13,$08 560BC7692D8D: movsd [r12],xmm15 560BC7692D93: movsd xmm15,[r8] 560BC7692D98: sub r12,$08 560BC7692D9C: mov r8,$00[r13] and the VM instruction F+ has the following code snippet: 560BC7692E3B: mov rax,r12 560BC7692E3E: lea r12,$08[r12] 560BC7692E43: addsd xmm15,$08[rax] Register assignments: r13: data-stack pointer r8: top of data stack r12: FP-stack pointer xmm15: top of FP stack (Don't ask me why gcc introduced the mov in 560BC7692E3B rather than doing the lea afterwards or using the result of the lea in the addsd.) Gforth has a static VM superinstruction for the sequence "F@ F+". It's code is: 7FC65EA14E20: add r13,$08 7FC65EA14E24: addsd xmm15,[r8] 7FC65EA14E29: mov r8,$00[r13] In this case, the static VM superinstruction avoided a push and pop from the FP stack, and the associated code (FP-stack pointer update, and storing and loading the top of the FP stack from memory). We first implemented static superinstructions and tried up to 1600 of them [ertl&gregg03], mainly for improved branch prediction on the CPUs of the time (which used the BTB for indirect-branch prediction). When dynamic superinstructions (with replication) solved the branch prediction problem, the remaining benefit of static superinstructions is the one above, but it only helps for some sequences, but not others. In Gforth 0.7, we have 13 static superinstructions, in the current development version we have 41. Among the reasons to be stingy with adding static superinstructions are: * The more VM instruction implementations we add, the longer the compile time of the VM becomes. * The effect of static superinstructions in addition to dynamic superinstructions is quite small (see "with static super" in Figure 10 [ertl&gregg03]); and the first few provide most of the benefit (see below). * At one point we had 400, but the engine crashed (probably not with all gcc versions, or we would not have been able to produc numbers for up to 1600). So we switched around, and only had static superinstructions for the most frequent sequences where we also expect a benefit like the one shown above. Looking at our 2003 paper [ertl&gregg03] again, I also see an answer for your first question: Look at Figure 13. There you see the the effect of static superinstructions and static replicas of VM instructions. Static replicas are just that: Instead of + you now have +A +B +C, and use them round-robin whenever an instance of + occurs in the VM program; the idea here was just to improve the branch prediction on microarchitectures with BTB-based indirect-branch prediction. It turns out that the balance between static superinstructions and static replicas was that having a relatively small number of static superinstructions (e.g., about 200 out of a total 1600 additional VM instruction implementations, or 35 out of a total 400) provides a benefit, but if you increase the number of superinstructions more, you get a slowdown. My explanation is that the additional superinstructions reduce the number of replicas, but help in only a few places, whereas the additional replicas help in a lot of places. Conversely, there are a lot of sequences that occur throughout the code that are not covered by 1600 static superinstructions or less. So how many static superinstructions would you want to add before you consider the VM to be "sufficiently expressive"? - anton @InProceedings{ertl&paysan24, author = {Ertl, M. Anton and Paysan, Bernd}, title = {{The Performance Effects of Virtual-Machine Instruction Pointer Updates}}, booktitle = {38th European Conference on Object-Oriented Programming (ECOOP 2024)}, ========== REMAINDER OF ARTICLE TRUNCATED ==========