Article <2024May30.135409@mips.complang.tuwien.ac.at>

Warning: mysqli::__construct(): (HY000/1203): User howardkn already has more than 'max_user_connections' active connections in D:\Inetpub\vhosts\howardknight.net\al.howardknight.net\includes\artfuncs.php on line 21
Failed to connect to MySQL: (1203) User howardkn already has more than 'max_user_connections' active connections
Warning: mysqli::query(): Couldn't fetch mysqli in D:\Inetpub\vhosts\howardknight.net\al.howardknight.net\index.php on line 66
Article <2024May30.135409@mips.complang.tuwien.ac.at>

Deutsch English Français Italiano

<2024May30.135409@mips.complang.tuwien.ac.at>

View for Bookmarking (what is this?)
Look up another Usenet article

Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: anton@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: Byte Addressability And Beyond
Date: Thu, 30 May 2024 11:54:09 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 86
Message-ID: <2024May30.135409@mips.complang.tuwien.ac.at>
References: <v0s17o$2okf4$2@dont-email.me> <v31c4r$3u28v$1@dont-email.me> <v327n3$1use$1@gal.iecc.com> <BM25O.40665$HBac.4762@fx15.iad> <v32lpv$1u25$1@gal.iecc.com> <v33bqg$9cst$11@dont-email.me> <v34v62$ln01$1@dont-email.me> <v36bva$10k3v$2@dont-email.me> <2024May29.090435@mips.complang.tuwien.ac.at> <v39dpj$1k4hm$1@dont-email.me>
Injection-Date: Thu, 30 May 2024 14:26:36 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="46db1f935b2b8b941d470a80861909b8";
	logging-data="1786467"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX1/PWl3Re/P1+GAkCUryIgtk"
Cancel-Lock: sha1:nqBft6yxyBZ+8RE7UTBUv61B+fI=
X-newsreader: xrn 10.11
Bytes: 4789

Terje Mathisen <terje.mathisen@tmsw.no> writes:
>Anton Ertl wrote:
>> Anyway, such instructions can be done in a RISCy way (pure
>> register-to-register instructions) or in a CISCy way
>> (memory-to-memory).
>>=20
>> A RISCy way to do UTF-8 -> UTF-32 would be to have the first 4 bytes
>> of the remaining string in a register and producing an UTF-32 code
>> point in another register and a length in a third register (or in the
>> high part of the destination register to reduce write port
>> requirements).  Similarly for UTF-32->UTF-8, with the length
>> specifying the length of the result; that would need to be combined
>> with a length masked store to make it easy to store the result.
>>=20
>> This approach can also be SIMDified, converting regbits/32 code points
>> in one representation to the same number of code points in the other
>> representation plus a length of the UTF-8 representation.
>>=20
>> The disadvantage of this approach exists particularly for
>> UTF-8->UTF-32: this is a very sequential approach full of dependences:
>> each use of the conversion instruction is followed by a dependent load
>> of the next input fragment, and the next use of the conversion
>> instruction depends on that load.
>
>Rather the opposite:
>
>UTF8->UTF32 looks a _lot_ like an easier example of a byte-oriented=20
>variable length (x86?) instruction decoder, but with the big=20
>simplification that the first byte directly tells you how long the=20
>sequence is.

The SIMD version of the RISCy instruction is no problem.  So you can
process regbits/32 code points in one go.  But what I wrote above
still applies: You use this instruction in a loop like

# s* are SIMD registers, g* are GPRs
l: s0= load(g0)
   s1,g1= cu14(s0)
   store (g2)<-s1
   g0 = g0+g1
   g2 = g2+SIMD_width
   if g0>=input_end goto end
   if g2<output_limit goto l
end:

(probably some fine tuning of the last iteration and the termination
is necessary).

And here you have a dependence chain from load to cu14 to the g0+g1 to
the load of the next iteration.  With cu14 and the addition as
single-cycle operations and the load taking 5 cycles as for D-cache
hits on recent Intel CPUs, that's 7 cycles per iteration, limiting the
throughput of your conversion routine to 1/7th of what your cu14 and
your load/store unit would be capable of in throughput-limited code.

With a byte-stream buffer as architectural feature, and a CU14 that
takes its utf-8 input from that and automatically advances the stream,
this could be quite a bit more efficient.  Something like:

.... set up stream buffer ...
l: s1 = cu14(stream-buffer)
   store (g2)<-s1
   g2 = g2+SIMD_width
   if streambuffer empty goto end
   if g2<output_limit goto l
end:

(again with some fine-tuning for the last iteration and termination).

For a technically unnecessary marketing gimick like CU14 one probably
won't add a stream buffer, but, e.g., compression and decompression
are probably more relevant and may also benefit from such a feature.

>Doing a SIMD version corresponds to a superscalar x86 in that the=20
>decoder needs to grab a variable number of bytes for each instruction,=20
>starting the next immediately after.

The instructions are fetched into a stream buffer rather than waiting
for the decoder to produce a length result before starting the next
instruction fetch (and of course the instruction fetcher also has to
deal with branches).

- anton
-- 
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
  Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>