Deutsch English Français Italiano |
<9fb548d5b81e65bf1ececd070d8085c9@www.novabbs.org> View for Bookmarking (what is this?) Look up another Usenet article |
Path: ...!weretis.net!feeder6.news.weretis.net!i2pn.org!i2pn2.org!.POSTED!not-for-mail From: mitchalsup@aol.com (MitchAlsup1) Newsgroups: comp.arch Subject: Re: "Mini" tags to reduce the number of op codes Date: Wed, 10 Apr 2024 21:19:20 +0000 Organization: Rocksolid Light Message-ID: <9fb548d5b81e65bf1ececd070d8085c9@www.novabbs.org> References: <uuk100$inj$1@dont-email.me> <6mqu0j1jf5uabmm6r2cb2tqn6ng90mruvd@4ax.com> <15d1f26c4545f1dbae450b28e96e79bd@www.novabbs.org> <lf441jt9i2lv7olvnm9t7bml2ib19eh552@4ax.com> <uuv1ir$30htt$1@dont-email.me> <d71c59a1e0342d0d01f8ce7c0f449f9b@www.novabbs.org> <uv02dn$3b6ik$1@dont-email.me> <uv415n$ck2j$1@dont-email.me> <uv46rg$e4nb$1@dont-email.me> <a81256dbd4f121a9345b151b1280162f@www.novabbs.org> <uv4ghh$gfsv$1@dont-email.me> <8e61b7c856aff15374ab3cc55956be9d@www.novabbs.org> <uv5err$ql29$1@dont-email.me> <e43623eb10619eb28a68b2bd7af93390@www.novabbs.org> <uv6u3r$16g41$1@dont-email.me> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8; format=flowed Content-Transfer-Encoding: 8bit Injection-Info: i2pn2.org; logging-data="752667"; mail-complaints-to="usenet@i2pn2.org"; posting-account="PGd4t4cXnWwgUWG9VtTiCsm47oOWbHLcTr4rYoM0Edo"; User-Agent: Rocksolid Light X-Rslight-Site: $2y$10$PSPWq4gcrq63W2WbkuW/tOlqiGx8eqb3fwwEM60b4KfNykT5iAXhu X-Rslight-Posting-User: ac58ceb75ea22753186dae54d967fed894c3dce8 X-Spam-Checker-Version: SpamAssassin 4.0.0 Bytes: 8726 Lines: 208 BGB-Alt wrote: > On 4/10/2024 12:12 PM, MitchAlsup1 wrote: >> BGB wrote: >> >>> On 4/9/2024 7:28 PM, MitchAlsup1 wrote: >>>> BGB-Alt wrote: >>>> >> >>> Also the blob of constants needed to be within 512 bytes of the load >>> instruction, which was also kind of an evil mess for branch handling >>> (and extra bad if one needed to spill the constants in the middle of a >>> basic block and then branch over it). >> >> In My 66000 case, the constant is the word following the instruction. >> Easy to find, easy to access, no register pollution, no DCache pollution. >> > Yeah. > This was why some of the first things I did when I started extending > SH-4 were: > Adding mechanisms to build constants inline; > Adding Load/Store ops with a displacement (albeit with encodings > borrowed from SH-2A); > Adding 3R and 3RI encodings (originally Imm8 for 3RI). My suggestion is that:: "Now that you have screwed around for a while, Why not take that experience and do a new ISA without any of those mistakes in it" ?? > Did have a mess when I later extended the ISA to 32 GPRs, as (like with > BJX2 Baseline+XGPR) only part of the ISA had access to R16..R31. >>> Usually they were spilled between basic-blocks, with the basic-block >>> needing to branch to the following basic-block in these cases. >> >>> Also 8-bit branch displacements are kinda lame, ... >> >> Why do that to yourself ?? >> > I didn't design SuperH, Hitachi did... But you did not fix them en massé, and you complain about them at least once a week. There comes a time when it takes less time and less courage to do that big switch and clean up all that mess. > But, with BJX1, I had added Disp16 branches. > With BJX2, they were replaced with 20 bit branches. These have the merit > of being able to branch anywhere within a Doom or Quake sized binary. >>> And, if one wanted a 16-bit branch: >>> MOV.W (PC, 4), R0 //load a 16-bit branch displacement >>> BRA/F R0 >>> .L0: >>> NOP // delay slot >>> .WORD $(Label - .L0) >> >>> Also kinda bad... >> >> Can you say Yech !! >> > Yeah. > This sort of stuff created strong incentive for ISA redesign... Maybe consider now as the appropriate time to strt. > Granted, it is possible had I instead started with RISC-V instead of > SuperH, it is probable BJX2 wouldn't exist. > Though, at the time, the original thinking was that SuperH having > smaller instructions meant it would have better code density than RV32I > or similar. Turns out not really, as the penalty of the 16 bit ops was > needing almost twice as many on average. My 66000 only requires 70% the instruction count of RISC-V, Yours could too ................ >>>>> Things like memcpy/memmove/memset/etc, are function calls in cases >>>>> when not directly transformed into register load/store sequences. >>>> >>>> My 66000 does not convert them into LD-ST sequences, MM is a single >>>> inst- >>>> ruction. >>>> >> >>> I have no high-level memory move/copy/set instructions. >>> Only loads/stores... >> >> You have the power to fix it......... >> > But, at what cost... You would not have to spend hours a week defending the indefensible !! > I had generally avoided anything that will have required microcode or > shoving state-machines into the pipeline or similar. Things as simple as IDIV and FDIV require sequencers. But LDM, STM, MM require sequencers simpler than IDIV and FDIV !! > Things like Load/Store-Multiple or If you like polluted ICaches.............. >>> For small copies, can encode them inline, but past a certain size this >>> becomes too bulky. >> >>> A copy loop makes more sense for bigger copies, but has a high >>> overhead for small to medium copy. >> >> >>> So, there is a size range where doing it inline would be too bulky, >>> but a loop caries an undesirable level of overhead. >> >> All the more reason to put it (a highly useful unit of work) into an >> instruction. >> > This is an area where "slides" work well, the main cost is mostly the > bulk that the slide adds to the binary (albeit, it is one-off). Consider that the predictor getting into the slide the first time always mispredicts !! > Which is why it is a 512B memcpy slide vs, say, a 4kB memcpy slide... What if you only wanted to copy 63 bytes ?? Your DW slide fails miserably, yet a HW sequencer only has to avoid asserting a single byte write enable once. > For looping memcpy, it makes sense to copy 64 or 128 bytes per loop > iteration or so to try to limit looping overhead. On low end machines, you want to operate at cache port width, On high end machines, you want to operate at cache line widths per port. This is essentially impossible using slides.....here, the same code is not optimal across a line of implementations. > Though, leveraging the memcpy slide for the interior part of the copy > could be possible in theory as well. What do you do when the STAT drive wants to write a whole page ?? > For LZ memcpy, it is typically smaller, as LZ copies tend to be a lot > shorter (a big part of LZ decoder performance mostly being in > fine-tuning the logic for the match copies). > Though, this is part of why my runtime library had added a > "_memlzcpy(dst, src, len)" and "_memlzcpyf(dst, src, len)" functions, > which can consolidate this rather than needing to do it one-off for each > LZ decoder (as I see it, it is a similar issue to not wanting code to > endlessly re-roll stuff for functions like memcpy or malloc/free, *). > *: Though, nevermind that the standard C interface for malloc is > annoyingly minimal, and ends up requiring most non-trivial programs to > roll their own memory management. >>> Ended up doing these with "slides", which end up eating roughly >>> several kB of code space, but was more compact than using larger >>> inline copies. >> >> >>> Say (IIRC): >>> 128 bytes or less: Inline Ld/St sequence >>> 129 bytes to 512B: Slide >>> Over 512B: Call "memcpy()" or similar. >> >> Versus:: ========== REMAINDER OF ARTICLE TRUNCATED ==========