Deutsch English Français Italiano |
<uv56ec$ooj6$1@dont-email.me> View for Bookmarking (what is this?) Look up another Usenet article |
Path: ...!2.eu.feeder.erje.net!feeder.erje.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: "Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> Newsgroups: comp.arch Subject: Re: "Mini" tags to reduce the number of op codes Date: Tue, 9 Apr 2024 22:01:00 -0700 Organization: A noiseless patient Spider Lines: 146 Message-ID: <uv56ec$ooj6$1@dont-email.me> References: <uuk100$inj$1@dont-email.me> <6mqu0j1jf5uabmm6r2cb2tqn6ng90mruvd@4ax.com> <15d1f26c4545f1dbae450b28e96e79bd@www.novabbs.org> <lf441jt9i2lv7olvnm9t7bml2ib19eh552@4ax.com> <uuv1ir$30htt$1@dont-email.me> <d71c59a1e0342d0d01f8ce7c0f449f9b@www.novabbs.org> <uv02dn$3b6ik$1@dont-email.me> <uv415n$ck2j$1@dont-email.me> <uv46rg$e4nb$1@dont-email.me> <a81256dbd4f121a9345b151b1280162f@www.novabbs.org> <uv4ghh$gfsv$1@dont-email.me> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Injection-Date: Wed, 10 Apr 2024 05:01:01 +0200 (CEST) Injection-Info: dont-email.me; posting-host="1e0154287d270c974cd6798ddf950547"; logging-data="811622"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX1/CyFgGugYkm242dRVEyXCRk9xl9KrwAQs=" User-Agent: Mozilla Thunderbird Cancel-Lock: sha1:LgSgyQw7ccsy/Cp+bxtc2xFNgWY= Content-Language: en-US In-Reply-To: <uv4ghh$gfsv$1@dont-email.me> Bytes: 7236 On 4/9/2024 3:47 PM, BGB-Alt wrote: > On 4/9/2024 4:05 PM, MitchAlsup1 wrote: >> BGB wrote: >> >>> On 4/9/2024 1:24 PM, Thomas Koenig wrote: >>>> I wrote: >>>> >>>>> MitchAlsup1 <mitchalsup@aol.com> schrieb: >>>>>> Thomas Koenig wrote: >>>>>> >>>> Maybe one more thing: In order to justify the more complex encoding, >>>> I was going for 64 registers, and that didn't work out too well >>>> (missing bits). >>>> >>>> Having learned about M-Core in the meantime, pure 32-register, >>>> 21-bit instruction ISA might actually work better. >> >> >>> For 32-bit instructions at least, 64 GPRs can work out OK. >> >>> Though, the gain of 64 over 32 seems to be fairly small for most >>> "typical" code, mostly bringing a benefit if one is spending a lot of >>> CPU time in functions that have large numbers of local variables all >>> being used at the same time. >> >> >>> Seemingly: >>> 16/32/48 bit instructions, with 32 GPRs, seems likely optimal for >>> code density; >>> 32/64/96 bit instructions, with 64 GPRs, seems likely optimal for >>> performance. >> >>> Where, 16 GPRs isn't really enough (lots of register spills), and 128 >>> GPRs is wasteful (would likely need lots of monster functions with >>> 250+ local variables to make effective use of this, *, which probably >>> isn't going to happen). >> >> 16 GPRs would be "almost" enough if IP, SP, FP, TLS, GOT were not part >> of GPRs AND you have good access to constants. >> > > On the main ISA's I had tried to generate code for, 16 GPRs was kind of > a pain as it resulted in fairly high spill rates. > > Though, it would probably be less bad if the compiler was able to use > all of the registers at the same time without stepping on itself (such > as dealing with register allocation involving scratch registers while > also not conflicting with the use of function arguments, ...). > > > My code generators had typically only used callee save registers for > variables in basic blocks which ended in a function call (in my compiler > design, both function calls and branches terminating the current > basic-block). > > On SH, the main way of getting constants (larger than 8 bits) was via > PC-relative memory loads, which kinda sucked. > > > This is slightly less bad on x86-64, since one can use memory operands > with most instructions, and the CPU tends to deal fairly well with code > that has lots of spill-and-fill. This along with instructions having > access to 32-bit immediate values. > > >>> *: Where, it appears it is most efficient (for non-leaf functions) if >>> the number of local variables is roughly twice that of the number of >>> CPU registers. If more local variables than this, then spill/fill >>> rate goes up significantly, and if less, then the registers aren't >>> utilized as effectively. >> >>> Well, except in "tiny leaf" functions, where the criteria is instead >>> that the number of local variables be less than the number of scratch >>> registers. However, for many/most small leaf functions, the total >>> number of variables isn't all that large either. >> >> The vast majority of leaf functions use less than 16 GPRs, given one has >> a SP not part of GPRs {including arguments and return values}. Once >> one starts placing things like memove(), memset(), sin(), cos(), >> exp(), log() >> in the ISA, it goes up even more. >> > > Yeah. > > Things like memcpy/memmove/memset/etc, are function calls in cases when > not directly transformed into register load/store sequences. > > Did end up with an intermediate "memcpy slide", which can handle medium > size memcpy and memset style operations by branching into a slide. > > > > As noted, on a 32 GPR machine, most leaf functions can fit entirely in > scratch registers. On a 64 GPR machine, this percentage is slightly > higher (but, not significantly, since there are few leaf functions > remaining at this point). > > > If one had a 16 GPR machine with 6 usable scratch registers, it is a > little harder though (as typically these need to cover both any > variables used by the function, and any temporaries used, ...). There > are a whole lot more leaf functions that exceed a limit of 6 than of 14. > > But, say, a 32 GPR machine could still do well here. > > > Note that there are reasons why I don't claim 64 GPRs as a large > performance advantage: > On programs like Doom, the difference is small at best. > > > It mostly effects things like GLQuake in my case, mostly because TKRA-GL > has a lot of functions with a large numbers of local variables (some > exceeding 100 local variables). > > Partly though this is due to code that is highly inlined and unrolled > and uses lots of variables tending to perform better in my case (and > tightly looping code, with lots of small functions, not so much...). > > >> >>> Where, function categories: >>> Tiny Leaf: >>> Everything fits in scratch registers, no stack frame, no calls. >>> Leaf: >>> No function calls (either explicit or implicit); >>> Will have a stack frame. >>> Non-Leaf: >>> May call functions, has a stack frame. >> >> You are forgetting about FP, GOT, TLS, and whatever resources are >> required >> to do try-throw-catch stuff as demanded by the source language. >> > > Yeah, possibly true. > > In my case: > There is no frame pointer, as BGBCC doesn't use one; > All stack-frames are fixed size, VLA's and alloca use the heap; > GOT, N/A in my ABI (stuff is GBR relative, but GBR is not a GPR); > TLS, accessed via TBR.[...] alloca using the heap? Strange to me...