Article <v9dnmv$3efnj$1@dont-email.me>

Deutsch English Français Italiano
<v9dnmv$3efnj$1@dont-email.me>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: BGB <cr88192@gmail.com>
Newsgroups: comp.arch
Subject: Re: My 66000 and High word facility
Date: Mon, 12 Aug 2024 14:27:22 -0500
Organization: A noiseless patient Spider
Lines: 327
Message-ID: <v9dnmv$3efnj$1@dont-email.me>
References: <v98asi$rulo$1@dont-email.me>
 <38055f09c5d32ab77b9e3f1c7b979fb4@www.novabbs.org>
 <v991kh$vu8g$1@dont-email.me> <2024Aug11.163333@mips.complang.tuwien.ac.at>
 <v9ath5$2qgnb$1@dont-email.me> <2024Aug12.082936@mips.complang.tuwien.ac.at>
 <130df049c4c97984986767736b5b037a@www.novabbs.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Mon, 12 Aug 2024 21:27:28 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="fdd5012966402c1d824c86f16771acca";
	logging-data="3620595"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX1+v55FKqq9dEMrFVMWRZK7Hirx6BuF3U9E="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:42hCswvVtoR/d5cozArM6I+4QSs=
In-Reply-To: <130df049c4c97984986767736b5b037a@www.novabbs.org>
Content-Language: en-US
Bytes: 10511

On 8/12/2024 12:36 PM, MitchAlsup1 wrote:
> On Mon, 12 Aug 2024 6:29:36 +0000, Anton Ertl wrote:
> 
>> Brett <ggtgp@yahoo.com> writes:
>>> Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
>>>> Brett <ggtgp@yahoo.com> writes:
>>>>> The lack of CPU’s with 64 registers is what makes for a market, 
>>>>> that 4%
>>>>> that could benefit have no options to pick from.
>>>>
>>>> They had:
>>>>
>>>> SPARC: Ok, only 32 GPRs available at a time, but more in hardware
>>>> through the Window mechanism.
>>>>
>>>> AMD29K: IIRC a 128-register stack and 64 additional registers
>>>>
>>>> IA-64: 128 GPRs and 128 FPRs with register stack and rotating register
>>>> files to make good use of them.
>>>
>>> All antiques no longer available.
>>
>> SPARC is still available: <https://en.wikipedia.org/wiki/SPARC> says:
>>
>> |Fujitsu will also discontinue their SPARC production [...] end-of-sale
>> |in 2029, of UNIX servers and a year later for their mainframe.
>>
>> No word of when Oracle will discontinue (or has discontinued) sales,
>> but both companies introduced their last SPARC CPUs in 2017.
>>
>> In any case, my point still stands: these architectures were
>> available, and the large number of registers failed to give them a
>> decisive advantage.  Maybe it even gave them a decisive disadvantage:
>> AMD29K and IA-64 never had OoO implementations, and SPARC got them
>> only with the Fujitsu SPARC64 V in 2002 and the Oracle SPARC T4 in
>> 2011, years after Intel, MIPS, HP switched to OoO im 1995/1996 and
>> Power and Alpha switched in 1998 (POWER3, 21264).
>>
>>>> Where is your 4% number coming from?
>>>
>>> The 4% number is poor memory and a guess.
>>> Here is an antique paper on the issue:
>>>
>>> https://www.eecs.umich.edu/techreports/cse/00/CSE-TR-434-00.pdf
>>
>> Interesting.  I only skimmed the paper, but I read a lot about
>> inlining and interprocedural register allocation.  SPARCs register
>> windows and AMD29K's and IA-64's register stacks were intended to be
>> useful for that, but somehow the other architectures did not suffer a
>> big-enough disadvantage to make them adopt one of these concepts, and
>> that's despite register windows/stacks working even for indirect calls
>> (e.g., method calls in the general case), where interprocedural
>> register allocation or inlining don't help.
>>
>> It seems to me that with OoO the cycle cost of spilling and refilling
>> on call boundaries was lowered: the spills can be delayed until the
>> computation is complete, and the refills can start early because the
>> stack pointer tends to be available early.
>>
>> And recent OoO CPUs even have zero-cycle store-to-load forwarding, so
>> even if the called function is short, the spilling and refilling
>> around it (if any) does not increase the latency of the value that's
>> spilled and refilled.  But that consideration is only relevant for
>> Intel APX, ARM A64 and RISC-V went for 32 registers several years
>> before zero-cycle store-to-load-forwarding was implemented.
>>
>> One other optimization that they use the additional registers for is
>> "register promotion", i.e., putting values from memory into registers
>> for a while (if absence of aliasing can be proven).  One interesting
>> aspect here is that register promotion with 64 or 256 registers (RP-64
>> and RP-256) is usually not much better (if better at all) than
>> register promotion with 32 registers (RP-32); see Figure 1.  So
>> register promotion does not make a strong case for more registers,
>> either, at least in this paper.
> 
> With full access to constants, there is even less need to promote
> addresses or immediates into registers as you can simply poof them
> up anything you want one.


There are tradeoffs still, if constants need space to encode...

Inline is still better than a memory load, granted.

May make sense to consolidate multiple uses of a value into a register 
rather than try encoding them as an immediate each time.

....


For example, when I was working on adding the code to display HDR pixels 
to the screen (need conversion to RGB555).

First attempt:
TKGDI_CopyPixelSpan_GetRGB24H:
	MOVU.L		(R4), R6
	PSHUF.W		R5, 0x00, R20	//word shuffle
	PSHUF.W		R5, 0x55, R21

	PLDCM8UH	R6, R16		//FP8U to Binary16
	PMUL.H		R16, R20, R16	//Scale
	PADD.H		R16, R21, R16	//Bias

	MOV		0x3C003C003C003C00, R17	// 4x 1.0
	PADD.H		R16, R17, R18	// Map to 1.0 .. 1.999

	TSTQ		0x0000C000C000C000, R18
	BF		.L1
	.L0:

	MOV		0xFFFF000000000000, R7	//alpha ones
	PCVTH2UW	R18, R19	//Convert to packed word
	OR		R19, R7, R5	//Set alpha all ones
	RGB5PCK64	R5, R2		//convert to RGB555

	RTS

	.L1:

	TSTQ		0x00000000C000, R18
	AND?F		0xFFFFFFFF0000, R18
	OR?F		0x000000003BFF, R18

	TSTQ		0x0000C0000000, R18
	AND?F		0xFFFF0000FFFF, R18
	OR?F		0x00003BFF0000, R18

	TSTQ		0xC00000000000, R18
	AND?F		0x0000FFFFFFFF, R18
	OR?F		0x3BFF00000000, R18

	BRA			.L0

Which was valid ASM in my case, but the constants are still bulky.

Unsigned SIMD convert works over the 1.0 to 1.999 range or so. The 
RGB555 converter needs alpha set so that it knows pixel is opaque 
(otherwise, it may try to use the alpha encoding and reduce color fidelity).

For now, it assumes opaque images for screen and window framebuffers.



Then noted that I already had a few instructions for the purpose of 
range clamping, so it became:

TKGDI_CopyPixelSpan_GetRGB24H:
	MOVU.L		(R4), R6
	PSHUF.W		R5, 0x00, R20
	PSHUF.W		R5, 0x55, R21

	PLDCM8UH	R6, R16
	MOV		0x3C003C003C003C00, R17
	PMUL.H		R16, R20, R16
	MOV		0x3FFF3FFF3FFF3FFF, R22
	PADD.H		R16, R21, R16

	PADD.H		R16, R17, R18
	
	PCMPGT.H	R22, R18
	PCSELT.W	R22, R18, R18

	PCMPGT.H	R18, R17
	PCSELT.W	R17, R18, R18

	MOV		0xFFFF000000000000, R7
	PCVTH2UW	R18, R19
	OR		R19, R7, R5
	RGB5PCK64	R5, R2

	RTS

Worked a little better...
But, still, not very fast.

========== REMAINDER OF ARTICLE TRUNCATED ==========