Article <uv5fqf$qs8a$1@dont-email.me>

Warning: mysqli::__construct(): (HY000/1203): User howardkn already has more than 'max_user_connections' active connections in D:\Inetpub\vhosts\howardknight.net\al.howardknight.net\includes\artfuncs.php on line 21
Failed to connect to MySQL: (1203) User howardkn already has more than 'max_user_connections' active connections
Warning: mysqli::query(): Couldn't fetch mysqli in D:\Inetpub\vhosts\howardknight.net\al.howardknight.net\index.php on line 66
Article <uv5fqf$qs8a$1@dont-email.me>
Deutsch English Français Italiano
<uv5fqf$qs8a$1@dont-email.me>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: BGB <cr88192@gmail.com>
Newsgroups: comp.arch
Subject: Re: "Mini" tags to reduce the number of op codes
Date: Wed, 10 Apr 2024 02:41:01 -0500
Organization: A noiseless patient Spider
Lines: 167
Message-ID: <uv5fqf$qs8a$1@dont-email.me>
References: <uuk100$inj$1@dont-email.me>
 <6mqu0j1jf5uabmm6r2cb2tqn6ng90mruvd@4ax.com>
 <15d1f26c4545f1dbae450b28e96e79bd@www.novabbs.org>
 <lf441jt9i2lv7olvnm9t7bml2ib19eh552@4ax.com> <uuv1ir$30htt$1@dont-email.me>
 <d71c59a1e0342d0d01f8ce7c0f449f9b@www.novabbs.org>
 <uv02dn$3b6ik$1@dont-email.me> <uv415n$ck2j$1@dont-email.me>
 <uv46rg$e4nb$1@dont-email.me>
 <a81256dbd4f121a9345b151b1280162f@www.novabbs.org>
 <uv4ghh$gfsv$1@dont-email.me> <uv56ec$ooj6$1@dont-email.me>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Wed, 10 Apr 2024 07:41:04 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="adf62e5ff09325073b660a4ffaf2aa0c";
	logging-data="880906"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX1+vDGnq4NyCa+j10Fl8rn+XdoHfd36L0gM="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:xc3CRBHath7ojB8TlAXH03gRI6w=
Content-Language: en-US
In-Reply-To: <uv56ec$ooj6$1@dont-email.me>
Bytes: 8110

On 4/10/2024 12:01 AM, Chris M. Thomasson wrote:
> On 4/9/2024 3:47 PM, BGB-Alt wrote:
>> On 4/9/2024 4:05 PM, MitchAlsup1 wrote:
>>> BGB wrote:
>>>
>>>> On 4/9/2024 1:24 PM, Thomas Koenig wrote:
>>>>> I wrote:
>>>>>
>>>>>> MitchAlsup1 <mitchalsup@aol.com> schrieb:
>>>>>>> Thomas Koenig wrote:
>>>>>>>
>>>>> Maybe one more thing: In order to justify the more complex encoding,
>>>>> I was going for 64 registers, and that didn't work out too well
>>>>> (missing bits).
>>>>>
>>>>> Having learned about M-Core in the meantime, pure 32-register,
>>>>> 21-bit instruction ISA might actually work better.
>>>
>>>
>>>> For 32-bit instructions at least, 64 GPRs can work out OK.
>>>
>>>> Though, the gain of 64 over 32 seems to be fairly small for most 
>>>> "typical" code, mostly bringing a benefit if one is spending a lot 
>>>> of CPU time in functions that have large numbers of local variables 
>>>> all being used at the same time.
>>>
>>>
>>>> Seemingly:
>>>> 16/32/48 bit instructions, with 32 GPRs, seems likely optimal for 
>>>> code density;
>>>> 32/64/96 bit instructions, with 64 GPRs, seems likely optimal for 
>>>> performance.
>>>
>>>> Where, 16 GPRs isn't really enough (lots of register spills), and 
>>>> 128 GPRs is wasteful (would likely need lots of monster functions 
>>>> with 250+ local variables to make effective use of this, *, which 
>>>> probably isn't going to happen).
>>>
>>> 16 GPRs would be "almost" enough if IP, SP, FP, TLS, GOT were not 
>>> part of GPRs AND you have good access to constants.
>>>
>>
>> On the main ISA's I had tried to generate code for, 16 GPRs was kind 
>> of a pain as it resulted in fairly high spill rates.
>>
>> Though, it would probably be less bad if the compiler was able to use 
>> all of the registers at the same time without stepping on itself (such 
>> as dealing with register allocation involving scratch registers while 
>> also not conflicting with the use of function arguments, ...).
>>
>>
>> My code generators had typically only used callee save registers for 
>> variables in basic blocks which ended in a function call (in my 
>> compiler design, both function calls and branches terminating the 
>> current basic-block).
>>
>> On SH, the main way of getting constants (larger than 8 bits) was via 
>> PC-relative memory loads, which kinda sucked.
>>
>>
>> This is slightly less bad on x86-64, since one can use memory operands 
>> with most instructions, and the CPU tends to deal fairly well with 
>> code that has lots of spill-and-fill. This along with instructions 
>> having access to 32-bit immediate values.
>>
>>
>>>> *: Where, it appears it is most efficient (for non-leaf functions) 
>>>> if the number of local variables is roughly twice that of the number 
>>>> of CPU registers. If more local variables than this, then spill/fill 
>>>> rate goes up significantly, and if less, then the registers aren't 
>>>> utilized as effectively.
>>>
>>>> Well, except in "tiny leaf" functions, where the criteria is instead 
>>>> that the number of local variables be less than the number of 
>>>> scratch registers. However, for many/most small leaf functions, the 
>>>> total number of variables isn't all that large either.
>>>
>>> The vast majority of leaf functions use less than 16 GPRs, given one has
>>> a SP not part of GPRs {including arguments and return values}. Once 
>>> one starts placing things like memove(), memset(), sin(), cos(), 
>>> exp(), log()
>>> in the ISA, it goes up even more.
>>>
>>
>> Yeah.
>>
>> Things like memcpy/memmove/memset/etc, are function calls in cases 
>> when not directly transformed into register load/store sequences.
>>
>> Did end up with an intermediate "memcpy slide", which can handle 
>> medium size memcpy and memset style operations by branching into a slide.
>>
>>
>>
>> As noted, on a 32 GPR machine, most leaf functions can fit entirely in 
>> scratch registers. On a 64 GPR machine, this percentage is slightly 
>> higher (but, not significantly, since there are few leaf functions 
>> remaining at this point).
>>
>>
>> If one had a 16 GPR machine with 6 usable scratch registers, it is a 
>> little harder though (as typically these need to cover both any 
>> variables used by the function, and any temporaries used, ...). There 
>> are a whole lot more leaf functions that exceed a limit of 6 than of 14.
>>
>> But, say, a 32 GPR machine could still do well here.
>>
>>
>> Note that there are reasons why I don't claim 64 GPRs as a large 
>> performance advantage:
>> On programs like Doom, the difference is small at best.
>>
>>
>> It mostly effects things like GLQuake in my case, mostly because 
>> TKRA-GL has a lot of functions with a large numbers of local variables 
>> (some exceeding 100 local variables).
>>
>> Partly though this is due to code that is highly inlined and unrolled 
>> and uses lots of variables tending to perform better in my case (and 
>> tightly looping code, with lots of small functions, not so much...).
>>
>>
>>>
>>>> Where, function categories:
>>>>    Tiny Leaf:
>>>>      Everything fits in scratch registers, no stack frame, no calls.
>>>>    Leaf:
>>>>      No function calls (either explicit or implicit);
>>>>      Will have a stack frame.
>>>>    Non-Leaf:
>>>>      May call functions, has a stack frame.
>>>
>>> You are forgetting about FP, GOT, TLS, and whatever resources are 
>>> required
>>> to do try-throw-catch stuff as demanded by the source language.
>>>
>>
>> Yeah, possibly true.
>>
>> In my case:
>>    There is no frame pointer, as BGBCC doesn't use one;
>>      All stack-frames are fixed size, VLA's and alloca use the heap;
>>    GOT, N/A in my ABI (stuff is GBR relative, but GBR is not a GPR);
>>    TLS, accessed via TBR.[...]
> 
> alloca using the heap? Strange to me...
> 

Well, in this case:
The alloca calls are turned into calls which allocate the memory blob 
and add it to a linked list;
when the function returns, everything in the linked list is freed;
Then, it internally pulls this off via malloc and free.

Also the typical default stack size in this case is 128K, so trying to 
put big allocations on the stack is more liable to result in a stack 
overflow.

Bigger stack needs more memory, so is not ideal for NOMMU use. Luckily 
heap allocation is not too slow in this case.


Though, at the same time, ideally one limits use of language features 
where the code-generation degenerates into a mess of hidden runtime 
calls. These cases are not ideal for performance...