Deutsch English Français Italiano |
<v098so$1rp16$1@dont-email.me> View for Bookmarking (what is this?) Look up another Usenet article |
Path: ...!weretis.net!feeder6.news.weretis.net!feeder8.news.weretis.net!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail From: BGB <cr88192@gmail.com> Newsgroups: comp.arch Subject: Re: Short Vectors Versus Long Vectors Date: Tue, 23 Apr 2024 16:23:33 -0500 Organization: A noiseless patient Spider Lines: 206 Message-ID: <v098so$1rp16$1@dont-email.me> References: <v06vdb$17r2v$1@dont-email.me> <5451dcac941e1f569397a5cc7818f68f@www.novabbs.org> <v078td$1df76$4@dont-email.me> <2024Apr23.082238@mips.complang.tuwien.ac.at> MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Injection-Date: Tue, 23 Apr 2024 23:23:37 +0200 (CEST) Injection-Info: dont-email.me; posting-host="e4c9793258b9d913f28848bbc0503c1c"; logging-data="1958950"; mail-complaints-to="abuse@eternal-september.org"; posting-account="U2FsdGVkX19PiXobpn+u9AATQmZSRLeg9tNgYH04Zsw=" User-Agent: Mozilla Thunderbird Cancel-Lock: sha1:LRX87wD7CKaf0eSTJQHnTsMVZ70= In-Reply-To: <2024Apr23.082238@mips.complang.tuwien.ac.at> Content-Language: en-US Bytes: 9947 On 4/23/2024 1:22 AM, Anton Ertl wrote: > Lawrence D'Oliveiro <ldo@nz.invalid> writes: >> On Tue, 23 Apr 2024 02:14:32 +0000, MitchAlsup1 wrote: >> >>> CRAY machines stayed "in style" as long as memory latency remained >>> smaller than the length of a vector (64 cycles) and fell out of favor >>> when the cores got fast enough that memory could no longer keep up. > > Mitch Alsup repeatedly makes this claim without giving any > justification. Your question may shed some light on that. > >> So why would conventional short vectors work better, then? Surely the >> latency discrepancy would be even worse for them. > > Thinking about it, they probably don't work better. They just don't > work worse, so why spend area on 4096-bit vector registers like the > Cray-1 did when 128-512-bit vector registers work just as well? Plus, > they have 200 or so of these registers, so 4096-bit registers would be > really expensive. How many vector registers does the Cray-1 (and its > successors) have? > Yeah. Or if you can already saturate the RAM bandwidth with 128-bit vectors, why go wider?... Or, one may find that even the difference between 64 and 128 bit vectors goes away once one's working set exceeds 1/3 to 1/2 the size of the L1 cache. Meanwhile, it remains an issue that, wider vectors are more expensive. Though, unclear even if 64-bit is a clear win over 32-bit in terms of performance. Arguably, many uses of 64-bit could have been served with a primarily 32-bit machine that allows paired registers for things like memory addressing and similar. Though, OTOH, 64/128 allows unifying GPRs, FPU, and SIMD, into a single register space. Also, 32/64/128 bit splitting/pairing isn't really workable as it would end up needing 8 or 12 register read ports (so, would be more expensive than the "use 64-bit registers and effectively waste half the register for 32-bit operations" option). Well, unless the number of register ports remain constant (with 64-bit ports), and the 32-bit registers are effectively faked (by splitting the registers in half, and merging halves on write-back). But, there is little obvious advantage to this over the "just waste half the register" option (and it would be more expensive than just wasting half the register). > On modern machines OoO machinery bridges the latency gap between the > L2 cache, maybe even the L3 cache and the core for data-parallel code. > For the latency gap to main memory there are the hardware prefetchers, > and they use the L1 or L2 cache as intermediate buffer, while the > Cray-1 and followons use vector registers. > On my current PC, while it hides latency, one is hard-pressed to exceed roughly 4GB/sec of memory bandwidth (per core), though the overall system memory bandwidth seems to be higher. Say: 8C/16T (memcpy) Each core has a local peak of ~ 4GB/s; System seems to be ~ 12-16 GB/s Seemingly ~ 6-8 GB/s per group of 4 cores. Peak memcpy bandwidth (L1 local) being in the area of 24 GB/s. Latency is hidden fairly well, granted, but doesn't make as big of a difference if the task is bandwidth limited. > So what's the benefit of using vector/SIMD instructions at all rather > than doing it with scalar code? A SIMD instruction that replaces n > scalar instructions consumes fewer resources for instruction fetching, > decoding, register renaming, administering the instruction in the OoO > engine, and in retiring the instruction. > In my case, with my custom CPU core: The elements are packaged in a way to make them easier to work with, for either parallel or pipeline execution. For the low-precision unit, it can work on all 4 at the same time, if 4 are available. This unit does Binary16 or (optionally) Binary32. In the main FPU, the SIMD packaging allows the FPU to pipeline the operations despite the FPU having too high of a latency to be pipelined normally. The advantage of SIMD would be reduced if the pipeline were long enough to handle Binary64 values directly, but 6 EX stages would be asking a bit much (further increasing either pipeline length or width having a significant impact on the cost of the register-forwarding logic). Arguably, one could have a separate (and longer) pipeline for FPU, but this would add complexity with a shared register space. > So why not use SIMD instructions with longer vector registers? The > progression from 128-bit SSE through AVX-256 to AVX-512 by Intel > suggests that this is happening, but with every doubling the cost in > area doubles but the returns are diminishing thanks to Amdahl's law. > So at some point you stop. Intel introduced AVX-512 for Larrabee (a > special-purpose machine), and now is backpedaling with desktop, laptop > and small-server CPUs (even though only the Golden/Raptor Cove cores > are enabled on the small-server CPUs) only supporting AVX, and with > AVX10 only guaranteeing 256-bit vector registers, so maybe 512-bit > vector registers are already too costly for the benefit they give in > general-purpose computing. > Yeah. As I see it, in general, 128-bit SIMD seems to be the local optimum. Both 256 and 512 end up having more drawbacks than merits as I see it. > Back to old-style vector processors. There have been machines that > supported longer vector registers and AFAIK also memory-to-memory > machines. The question is why have they not been the answer of the > vector-processor community to the problem of covering the latency? Or > maybe they have? AFAIK NEC SX has been available in some form even in > recent years, maybe still. > Going outside of Load/Store adds has a lot of hair for comparably little benefit. Like, technically, I could go Load-Op / Op-Store for a subset of operations, as I ended up with the logic to support it. But, it doesn't seem like it would bring enough benefit to really be worth it (would not improve code-density as they require 64-bit encodings in my case, and in most cases seem unlikely to bring a performance advantage either; and given some limitations of the WEXifier, using them might actually make performance worse by interfering with shuffle-and-bundle). The main merit they would have is if the CPU were register-pressure limited, but in my case, with 64 GPRs, this isn't really the case either. > Anyway, after thinking about this, the reason behind Mitch Alsup's > statement is that in a > > doall(load process store) > > computation (like what SIMD is good at), the loads precede the > corresponding processing by the load latency (i.e., memory latency on > the Cray machines). If your OoO capabilities are limited (and I think > they are on the Cray machines), you cannot start the second iteration > of the doall loop before the processing step of the first iteration > has finished with the register. You can do a bit of software > pipelining and software register renaming by transforming this into > > load1 doall(load2 process1 store1 load1 process2 store2) > > but at some point you run out of vector registers. > > One thing that comes to mind is tracking individual parts of the > vector registers, which allows to starting the next iteration as soon > as the first part of the vector register no longer has any readers. > However, it's probably not that far off in complexity to tracking > shorter vector registers in an OoO engine. And if you support > exceptions (the Crays probably don't), this becomes messy, while with > short vector registers it's easier to implement the (ISA) > architecture. > ========== REMAINDER OF ARTICLE TRUNCATED ==========