Article <2024Apr24.081658@mips.complang.tuwien.ac.at>

Deutsch English Français Italiano
<2024Apr24.081658@mips.complang.tuwien.ac.at>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!news.nobody.at!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: anton@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: Short Vectors Versus Long Vectors
Date: Wed, 24 Apr 2024 06:16:58 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 30
Message-ID: <2024Apr24.081658@mips.complang.tuwien.ac.at>
References: <v06vdb$17r2v$1@dont-email.me> <5451dcac941e1f569397a5cc7818f68f@www.novabbs.org> <hqmg2j1vbkf6suddfnsh3h3uhtkqqio4uk@4ax.com>
Injection-Date: Wed, 24 Apr 2024 08:26:30 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="68f13f15e74c6cc1e6ed32f2711e82b5";
	logging-data="2295708"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX1/8iwosbhEqSnX3QsIbwpkt"
Cancel-Lock: sha1:ai88ncY9CsDDpG9BIRRIXNHxy+0=
X-newsreader: xrn 10.11
Bytes: 2331

John Savard <quadibloc@servername.invalid> writes:
>And if memory bandwidth issues make Cray-style vector machines
>impractical, then wouldn't it be even worse for GPUs?

The claim by Mitch Alsup is that latency makes the Crays impractical,
because of chaining issues.  Do GPUs have chaining?  My understanding
is that GPUs deal with latency in the barrel processor way: use
another data-parallel thread while waiting for memory.  Tera also
pursued this idea, but the GPUs succeeded with it.

>If
>most problems anyone would want to use a vector CPU for today do
>involve a large amount of memory, used in a random fashion, so as to
>fit poorly in cache

When the working set is larger than the cache, it does not fit even
when accesses regularly.  Prefetchers can reduce the latency, but they
will not increase the bandwidth.

So if you have a problem that walks through a lot of memory and
performs only a few operations per data item, that's where CPUs will
wait for memory a lot, due to limited bandwidth (and you won't benefit
from SIMD/vector instructions on these kinds of problems).  For that
kind of stuff you better use GPUs, which have memory systems with more
bandwidth.

- anton
-- 
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
  Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>