Path: ...!feeds.phibee-telecom.net!news.mixmin.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: Michael S Newsgroups: comp.arch Subject: Re: 80286 protected mode Date: Mon, 14 Oct 2024 19:08:56 +0300 Organization: A noiseless patient Spider Lines: 218 Message-ID: <20241014190856.00003a58@yahoo.com> References: <2024Oct6.150415@mips.complang.tuwien.ac.at> <2024Oct7.093314@mips.complang.tuwien.ac.at> <7c8e5c75ce0f1e7c95ec3ae4bdbc9249@www.novabbs.org> <2024Oct8.092821@mips.complang.tuwien.ac.at> <73e776d6becb377b484c5dcc72b526dc@www.novabbs.org> <2b31e1343b1f3fadd55ad6b87d879b78@www.novabbs.org> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Injection-Date: Mon, 14 Oct 2024 18:08:25 +0200 (CEST) Injection-Info: dont-email.me; posting-host="b3e063db664c626e2a7d1761c39b6d49"; logging-data="1204195"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX18kp9YqcllsUTqVlMNIuZwmuC5kn8VSqW8=" Cancel-Lock: sha1:vq8WWUGzrlrBRhYzV4S/4L1ZMdA= X-Newsreader: Claws Mail 3.19.1 (GTK+ 2.24.33; x86_64-w64-mingw32) Bytes: 11747 On Mon, 14 Oct 2024 17:19:40 +0200 David Brown wrote: > On 14/10/2024 16:40, Terje Mathisen wrote: > > David Brown wrote: =20 > >> On 13/10/2024 21:21, Terje Mathisen wrote: =20 > >>> David Brown wrote: =20 > >>>> On 10/10/2024 20:38, MitchAlsup1 wrote: =20 > >>>>> On Thu, 10 Oct 2024 6:31:52 +0000, David Brown wrote: > >>>>> =20 > >>>>>> On 09/10/2024 23:37, MitchAlsup1 wrote: =20 > >>>>>>> On Wed, 9 Oct 2024 20:22:16 +0000, David Brown wrote: > >>>>>>> =20 > >>>>>>>> On 09/10/2024 20:10, Thomas Koenig wrote: =20 > >>>>>>>>> David Brown schrieb: > >>>>>>>>> =20 > >>>>>>>>>> When would you ever /need/ to compare pointers to > >>>>>>>>>> different objects? > >>>>>>>>>> For almost all C programmers, the answer is "never". =20 > >>>>>>>>> > >>>>>>>>> Sometimes, it is handy to encode certain conditions in > >>>>>>>>> pointers, rather than having only a valid pointer or > >>>>>>>>> NULL.=C3=83=E2=80=9A=C3=82=C2=A0 A compiler, for example, might= want to store the > >>>>>>>>> fact that an error occurred while parsing a subexpression > >>>>>>>>> as a special pointer constant. > >>>>>>>>> > >>>>>>>>> Compilers often have the unfair advantage, though, that > >>>>>>>>> they can rely on what application programmers cannot, their > >>>>>>>>> implementation details.=C3=83=E2=80=9A=C3=82=C2=A0 (Some do not= , such as f2c). =20 > >>>>>>>> > >>>>>>>> Standard library authors have the same superpowers, so that > >>>>>>>> they can > >>>>>>>> implement an efficient memmove() even though a pure standard > >>>>>>>> C programmer cannot (other than by simply calling the > >>>>>>>> standard library > >>>>>>>> memmove() function!). =20 > >>>>>>> > >>>>>>> This is more a symptom of bad ISA design/evolution than of > >>>>>>> libc writers needing superpowers. =20 > >>>>>> > >>>>>> No, it is not.=C3=83=E2=80=9A=C3=82=C2=A0 It has absolutely /nothi= ng/ to do with the > >>>>>> ISA. =20 > >>>>> > >>>>> For example, if ISA contains an MM instruction which is the > >>>>> embodiment of memmove() then absolutely no heroics are needed > >>>>> of desired in the libc call. > >>>>> =20 > >>>> > >>>> The existence of a dedicated assembly instruction does not let > >>>> you write an efficient memmove() in standard C.=C3=82=C2=A0 That's w= hy I > >>>> said there was no connection between the two concepts. > >>>> > >>>> For some targets, it can be helpful to write memmove() in > >>>> assembly or using inline assembly, rather than in non-portable C > >>>> (which is the common case). > >>>> =20 > >>>>> Thus, it IS a symptom of ISA evolution that one has to rewrite > >>>>> memmove() every time wider SIMD registers are available. =20 > >>>> > >>>> It is not that simple. > >>>> > >>>> There can often be trade-offs between the speed of memmove() and=20 > >>>> memcpy() on large transfers, and the overhead in setting things > >>>> up that is proportionally more costly for small transfers.=C3=82 > >>>> Often that can be eliminated when the compiler optimises the > >>>> functions inline - when the compiler knows the size of the > >>>> move/copy, it can optimise directly. =20 > >>> > >>> What you are missing here David is the fact that Mitch's MM is a=20 > >>> single instruction which does the entire memmove() operation, and > >>> has the inside knowledge about cache (residency at level x? width > >>> in bytes)/memory ranges/access rights/etc needed to do so in a > >>> very close to optimal manner, for both short and long transfers. =20 > >> > >> I am not missing that at all.=C2=A0 And I agree that an advanced > >> hardware MM instruction could be a very efficient way to implement > >> both memcpy and memmove.=C2=A0 (For my own kind of work, I'd worry > >> about such looping instructions causing an unbounded increased in > >> interrupt latency, but that too is solvable given enough hardware > >> effort.) > >> > >> And I agree that once you have an "MM" (or similar) instruction, > >> you don't need to re-write the implementation for your memmove() > >> and memcpy() library functions for every new generation of > >> processors of a given target family. > >> > >> What I /don't/ agree with is the claim that you /do/ need to keep=20 > >> re-writing your implementations all the time.=C2=A0 You will > >> /sometimes/ get benefits from doing so, but it is not as simple as > >> Mitch made out.=20 > >>> > >>> I.e. totally removing the need for compiler tricks or wide > >>> register operations. > >>> > >>> Also apropos the compiler library issue: > >>> > >>> You start by teaching the compiler about the MM instruction, and > >>> to recognize common patterns (just as most compilers already do > >>> today), and then the memmove() calls will usually be inlined. > >>> =20 > >> > >> The original compile library issue was that it is impossible to > >> write an efficient memmove() implementation using pure portable > >> standard C. That is independent of any ISA, any specialist > >> instructions for memory moves, and any compiler optimisations. > >> And it is independent of the fact that some good compilers can > >> inline at least some calls to memcpy() and memmove() today, using > >> whatever instructions are most efficient for the target. =20 > >=20 > > David, you and Mitch are among my most cherished writers here on > > c.arch, I really don't think any of us really disagree, it is just > > that we have been discussing two (mostly) orthogonal issues. =20 >=20 > I agree. It's a "god dag mann, =C3=B8kseskaft" situation. >=20 > I have a huge respect for Mitch, his knowledge and experience, and > his willingness to share that freely with others. That's why I have > found this very frustrating. >=20 > >=20 > > a) memmove/memcpy are so important that people have been spending a > > lot of time & effort trying to make it faster, with the > > complication that in general it cannot be implemented in pure C > > (which disallows direct comparison of arbitrary pointers). > > =20 >=20 > Yes. >=20 > (Unlike memmov(), memcpy() can be implemented in standard C as a > simple byte-copy loop, without needing to compare pointers. But an=20 > implementation that copies in larger blocks than a byte requires=20 > implementation dependent behaviour to determine alignments, or it > must rely on unaligned accesses being allowed by the implementation.) >=20 > > b) Mitch have, like Andy ("Crazy") Glew many years before, realized > > that if a cpu architecture actually has an instruction designed to > > do this particular job, it behooves cpu architects to make sure > > that it is in fact so fast that it obviates any need for tricky > > coding to replace it.=20 >=20 > Yes. >=20 > > Ideally, it should be able to copy a single object, up to a cache > > line in size, in the same or less time needed to do so manually > > with a SIMD 512-bit load followed by a 512-bit store (both ops > > masked to not touch anything it shouldn't) > > =20 >=20 > Yes. >=20 > > REP MOVSB on x86 does the canonical memcpy() operation, originally > > by moving single bytes, and this was so slow that we also had REP > > MOVSW (moving 16-bit entities) and then REP MOVSD on the 386 and > > REP MOVSQ on 64-bit cpus. > >=20 > > With a suitable chunk of logic, the basic MOVSB operation could in > > fact handle any kinds of alignments and sizes, while doing the > > actual transfer at maximum bus speeds, i.e. at least one cache > > line/cycle for things already in $L1. > > =20 >=20 > I agree on all of that. ========== REMAINDER OF ARTICLE TRUNCATED ==========