Deutsch English Français Italiano |
<uv7l00$1fc2u$2@dont-email.me> View for Bookmarking (what is this?) Look up another Usenet article |
Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: BGB <cr88192@gmail.com> Newsgroups: comp.arch Subject: Re: "Mini" tags to reduce the number of op codes Date: Wed, 10 Apr 2024 22:21:33 -0500 Organization: A noiseless patient Spider Lines: 137 Message-ID: <uv7l00$1fc2u$2@dont-email.me> References: <uuk100$inj$1@dont-email.me> <6mqu0j1jf5uabmm6r2cb2tqn6ng90mruvd@4ax.com> <15d1f26c4545f1dbae450b28e96e79bd@www.novabbs.org> <lf441jt9i2lv7olvnm9t7bml2ib19eh552@4ax.com> <uuv1ir$30htt$1@dont-email.me> <d71c59a1e0342d0d01f8ce7c0f449f9b@www.novabbs.org> <uv02dn$3b6ik$1@dont-email.me> <uv415n$ck2j$1@dont-email.me> <uv46rg$e4nb$1@dont-email.me> <a81256dbd4f121a9345b151b1280162f@www.novabbs.org> <uv4ghh$gfsv$1@dont-email.me> <8e61b7c856aff15374ab3cc55956be9d@www.novabbs.org> <uv7h9k$1ek3q$1@dont-email.me> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit Injection-Date: Thu, 11 Apr 2024 05:21:36 +0200 (CEST) Injection-Info: dont-email.me; posting-host="059e35bc5e274e101eeeb06f16103042"; logging-data="1552478"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19y7LfSOloJZ6oHCUo6Vbjn5bhFepfZDUw=" User-Agent: Mozilla Thunderbird Cancel-Lock: sha1:4JT0sjweHF+HrsfwzcOmN1+QaVs= In-Reply-To: <uv7h9k$1ek3q$1@dont-email.me> Content-Language: en-US Bytes: 7148 On 4/10/2024 9:18 PM, Paul A. Clayton wrote: > On 4/9/24 8:28 PM, MitchAlsup1 wrote: >> BGB-Alt wrote: > [snip] >>> Things like memcpy/memmove/memset/etc, are function calls in cases >>> when not directly transformed into register load/store sequences. >> >> My 66000 does not convert them into LD-ST sequences, MM is a single >> instruction. > > I wonder if it would be useful to have an immediate count form of > memory move. Copying fixed-size structures would be able to use an > immediate. Aside from not having to load an immediate for such > cases, there might be microarchitectural benefits to using a > constant. Since fixed-sized copies would likely be limited to > smaller regions (with the possible exception of 8 MiB page copies) > and the overhead of loading a constant for large sizes would be > tiny, only providing a 16-bit immediate form might be reasonable. > As noted, in my case, the whole thing of Ld/St sequences, and memcpy slides, mostly applies to constant cases. If the copy size is variable, the compiler merely calls "memcpy()", which will then generally figure out which loop to use, and one has to pay the penalty of the runtime overhead of memcpy needing to figure out what it needs to do. >>> Did end up with an intermediate "memcpy slide", which can handle >>> medium size memcpy and memset style operations by branching into a >>> slide. >> >> MMs and MSs that do not cross page boundaries are ATOMIC. The entire >> system >> sees only the before or only the after state and nothing in between. > > I still feel that this atomicity should somehow be included with > ESM just because they feel related, but the benefit seems likely > to be extremely small. How often would software want to copy > multiple regions atomically or combine region copying with > ordinary ESM atomicity?? There *might* be some use for an atomic > region copy and an updating of a separate data structure (moving a > structure and updating one or a very few pointers??). For > structures three cache lines in size where only one region > occupies four cache lines, ordinary ESM could be used. > > My feeling based on "relatedness" is not a strong basis for such > an architectural design choice. > > (Simple page masking would allow false conflicts when smaller > memory moves are used. If there is a separate pair of range > registers that is checked for coherence of memory moves, this > issue would only apply for multiple memory moves _and_ all eight > of the buffer entries could be used for smaller accesses.) > All seems a bit complicated to me. But, as noted, I went for a model of weak memory coherence and leaving most of this stuff for software to sort out. > [snip] >>> As noted, on a 32 GPR machine, most leaf functions can fit entirely >>> in scratch registers. >> >> Which is why one can blow GPRs for SP, FP, GOT, TLS, ... without >> getting totally screwed. > > I wonder how many instructions would have to have access to such a > set of "special registers" and if a larger number of extra > registers would be useful. (One of the issues — in my opinion — > with PowerPC's link register and count register was that they > could not be directly loaded from or stored to memory [or loaded \ > with a constant from the instruction stream]. For counted loops, > loading the count register from the instruction stream would > presumably have allowed early branch determination even for deep > pipelines and small loop counts.) SP, FP, GOT, and TLS hold > "stable values", which might facilitate some microarchitectural > optimizations compared to more frequently modified register names. > > (I am intrigued by the possibility of small contexts for some > multithreaded workloads, similar to how some GPUs allow variable context > sizes.) In my case, yeah, there are two semi-separate register spaces here: GPRs: R0..R63 R0, R1, and R15 are Special R0/DLR: Hard-coded register for some instructions; Assembler may stomp without warning for pseudo-instructions. R1/DHR: Was originally intended similar to DLR; Now mostly used as an auxiliary link register. R15/SP: Stack Pointer. CRs: C0..C63 Various special purpose registers; Most are privileged only. LR, GBR, etc, are in CR space. Though, internally, GPRs and CRs both exist within a combined register space in the CPU: 00..3F: Mostly GPR space 40..7F: CR and SPR space. Generally, CRs may only be accessed by certain register ports though. By default, the only way to save/restore CRs is by shuffling them through GPRs. There is an optional MOV.C instruction for this, but generally it is not enabled as it isn't clear that it saves enough to be worth the added LUT cost. There is a subset version, where MOV.C exists, but is only really able to be used with LR and GBR and similar. Generally, this version exists as RISC-V Mode needs to be able to save/restore these registers (they exist in the GPR space in RISC-V). As I can note, if I did a new ISA, most likely the register assignment scheme would differ, say: R0: ZR / PC R1: LR / TP (TBR) R2: SP R3: GP (GBR) Where the interpretation of R0 and R1 would depend on context (ZR and LR for most instructions, PC and TP when used as a Ld/St base address). Though, some ideas had involved otherwise keeping a similar register space layout to my existing ABI, mostly because significant ABI changes would not be easy for my compiler as-is.