Article <v098so$1rp16$1@dont-email.me>

Deutsch English Français Italiano
<v098so$1rp16$1@dont-email.me>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!weretis.net!feeder6.news.weretis.net!feeder8.news.weretis.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: BGB <cr88192@gmail.com>
Newsgroups: comp.arch
Subject: Re: Short Vectors Versus Long Vectors
Date: Tue, 23 Apr 2024 16:23:33 -0500
Organization: A noiseless patient Spider
Lines: 206
Message-ID: <v098so$1rp16$1@dont-email.me>
References: <v06vdb$17r2v$1@dont-email.me>
 <5451dcac941e1f569397a5cc7818f68f@www.novabbs.org>
 <v078td$1df76$4@dont-email.me> <2024Apr23.082238@mips.complang.tuwien.ac.at>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Injection-Date: Tue, 23 Apr 2024 23:23:37 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="e4c9793258b9d913f28848bbc0503c1c";
	logging-data="1958950"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX19PiXobpn+u9AATQmZSRLeg9tNgYH04Zsw="
User-Agent: Mozilla Thunderbird
Cancel-Lock: sha1:LRX87wD7CKaf0eSTJQHnTsMVZ70=
In-Reply-To: <2024Apr23.082238@mips.complang.tuwien.ac.at>
Content-Language: en-US
Bytes: 9947

On 4/23/2024 1:22 AM, Anton Ertl wrote:
> Lawrence D'Oliveiro <ldo@nz.invalid> writes:
>> On Tue, 23 Apr 2024 02:14:32 +0000, MitchAlsup1 wrote:
>>
>>> CRAY machines stayed "in style" as long as memory latency remained
>>> smaller than the length of a vector (64 cycles) and fell out of favor
>>> when the cores got fast enough that memory could no longer keep up.
> 
> Mitch Alsup repeatedly makes this claim without giving any
> justification.  Your question may shed some light on that.
> 
>> So why would conventional short vectors work better, then? Surely the
>> latency discrepancy would be even worse for them.
> 
> Thinking about it, they probably don't work better.  They just don't
> work worse, so why spend area on 4096-bit vector registers like the
> Cray-1 did when 128-512-bit vector registers work just as well?  Plus,
> they have 200 or so of these registers, so 4096-bit registers would be
> really expensive.  How many vector registers does the Cray-1 (and its
> successors) have?
> 

Yeah.

Or if you can already saturate the RAM bandwidth with 128-bit vectors, 
why go wider?...

Or, one may find that even the difference between 64 and 128 bit vectors 
goes away once one's working set exceeds 1/3 to 1/2 the size of the L1 
cache.

Meanwhile, it remains an issue that, wider vectors are more expensive.


Though, unclear even if 64-bit is a clear win over 32-bit in terms of 
performance. Arguably, many uses of 64-bit could have been served with a 
primarily 32-bit machine that allows paired registers for things like 
memory addressing and similar.

Though, OTOH, 64/128 allows unifying GPRs, FPU, and SIMD, into a single 
register space.

Also, 32/64/128 bit splitting/pairing isn't really workable as it would 
end up needing 8 or 12 register read ports (so, would be more expensive 
than the "use 64-bit registers and effectively waste half the register 
for 32-bit operations" option).

Well, unless the number of register ports remain constant (with 64-bit 
ports), and the 32-bit registers are effectively faked (by splitting the 
registers in half, and merging halves on write-back). But, there is 
little obvious advantage to this over the "just waste half the register" 
option (and it would be more expensive than just wasting half the register).


> On modern machines OoO machinery bridges the latency gap between the
> L2 cache, maybe even the L3 cache and the core for data-parallel code.
> For the latency gap to main memory there are the hardware prefetchers,
> and they use the L1 or L2 cache as intermediate buffer, while the
> Cray-1 and followons use vector registers.
> 

On my current PC, while it hides latency, one is hard-pressed to exceed 
roughly 4GB/sec of memory bandwidth (per core), though the overall 
system memory bandwidth seems to be higher.

   Say: 8C/16T (memcpy)
     Each core has a local peak of ~ 4GB/s;
     System seems to be ~ 12-16 GB/s
       Seemingly ~ 6-8 GB/s per group of 4 cores.

Peak memcpy bandwidth (L1 local) being in the area of 24 GB/s.


Latency is hidden fairly well, granted, but doesn't make as big of a 
difference if the task is bandwidth limited.


> So what's the benefit of using vector/SIMD instructions at all rather
> than doing it with scalar code?  A SIMD instruction that replaces n
> scalar instructions consumes fewer resources for instruction fetching,
> decoding, register renaming, administering the instruction in the OoO
> engine, and in retiring the instruction.
> 

In my case, with my custom CPU core:
The elements are packaged in a way to make them easier to work with, for 
either parallel or pipeline execution.

For the low-precision unit, it can work on all 4 at the same time, if 4 
are available. This unit does Binary16 or (optionally) Binary32.

In the main FPU, the SIMD packaging allows the FPU to pipeline the 
operations despite the FPU having too high of a latency to be pipelined 
normally.

The advantage of SIMD would be reduced if the pipeline were long enough 
to handle Binary64 values directly, but 6 EX stages would be asking a 
bit much (further increasing either pipeline length or width having a 
significant impact on the cost of the register-forwarding logic).


Arguably, one could have a separate (and longer) pipeline for FPU, but 
this would add complexity with a shared register space.


> So why not use SIMD instructions with longer vector registers?  The
> progression from 128-bit SSE through AVX-256 to AVX-512 by Intel
> suggests that this is happening, but with every doubling the cost in
> area doubles but the returns are diminishing thanks to Amdahl's law.
> So at some point you stop.  Intel introduced AVX-512 for Larrabee (a
> special-purpose machine), and now is backpedaling with desktop, laptop
> and small-server CPUs (even though only the Golden/Raptor Cove cores
> are enabled on the small-server CPUs) only supporting AVX, and with
> AVX10 only guaranteeing 256-bit vector registers, so maybe 512-bit
> vector registers are already too costly for the benefit they give in
> general-purpose computing.
> 

Yeah.

As I see it, in general, 128-bit SIMD seems to be the local optimum.
Both 256 and 512 end up having more drawbacks than merits as I see it.



> Back to old-style vector processors.  There have been machines that
> supported longer vector registers and AFAIK also memory-to-memory
> machines.  The question is why have they not been the answer of the
> vector-processor community to the problem of covering the latency?  Or
> maybe they have?  AFAIK NEC SX has been available in some form even in
> recent years, maybe still.
> 

Going outside of Load/Store adds has a lot of hair for comparably little 
benefit.


Like, technically, I could go Load-Op / Op-Store for a subset of 
operations, as I ended up with the logic to support it. But, it doesn't 
seem like it would bring enough benefit to really be worth it (would not 
improve code-density as they require 64-bit encodings in my case, and in 
most cases seem unlikely to bring a performance advantage either; and 
given some limitations of the WEXifier, using them might actually make 
performance worse by interfering with shuffle-and-bundle).

The main merit they would have is if the CPU were register-pressure 
limited, but in my case, with 64 GPRs, this isn't really the case either.


> Anyway, after thinking about this, the reason behind Mitch Alsup's
> statement is that in a
> 
> doall(load process store)
> 
> computation (like what SIMD is good at), the loads precede the
> corresponding processing by the load latency (i.e., memory latency on
> the Cray machines).  If your OoO capabilities are limited (and I think
> they are on the Cray machines), you cannot start the second iteration
> of the doall loop before the processing step of the first iteration
> has finished with the register.  You can do a bit of software
> pipelining and software register renaming by transforming this into
> 
> load1 doall(load2 process1 store1 load1 process2 store2)
> 
> but at some point you run out of vector registers.
> 
> One thing that comes to mind is tracking individual parts of the
> vector registers, which allows to starting the next iteration as soon
> as the first part of the vector register no longer has any readers.
> However, it's probably not that far off in complexity to tracking
> shorter vector registers in an OoO engine.  And if you support
> exceptions (the Crays probably don't), this becomes messy, while with
> short vector registers it's easier to implement the (ISA)
> architecture.
> 


========== REMAINDER OF ARTICLE TRUNCATED ==========