Article <20250209141744.00007a4a@yahoo.com>

Deutsch English Français Italiano
<20250209141744.00007a4a@yahoo.com>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!eternal-september.org!.POSTED!not-for-mail
From: Michael S <already5chosen@yahoo.com>
Newsgroups: comp.arch
Subject: Re: Cost of handling misaligned access
Date: Sun, 9 Feb 2025 14:17:44 +0200
Organization: A noiseless patient Spider
Lines: 107
Message-ID: <20250209141744.00007a4a@yahoo.com>
References: <5lNnP.1313925$2xE6.991023@fx18.iad>
	<2025Feb6.115939@mips.complang.tuwien.ac.at>
	<20250206152808.0000058f@yahoo.com>
	<vo2iqq$30elm$1@dont-email.me>
	<vo2p33$31lqn$1@dont-email.me>
	<20250206211932.00001022@yahoo.com>
	<vo36go$345o3$1@dont-email.me>
	<20250206233200.00001fc3@yahoo.com>
	<vo4lvl$3eu3c$1@dont-email.me>
	<20250207124138.00006c8d@yahoo.com>
	<vo551p$3hhbc$1@dont-email.me>
	<20250207170423.000023b7@yahoo.com>
	<2025Feb8.091104@mips.complang.tuwien.ac.at>
	<20250208192119.0000148e@yahoo.com>
	<2025Feb8.184632@mips.complang.tuwien.ac.at>
MIME-Version: 1.0
Content-Type: text/plain; charset=US-ASCII
Content-Transfer-Encoding: 7bit
Injection-Date: Sun, 09 Feb 2025 13:17:44 +0100 (CET)
Injection-Info: dont-email.me; posting-host="8e2d9f7cf4b47825ab2c863bbd44b8cd";
	logging-data="566868"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX18jqWI9o4sZWVlCJxnLME8NwPvAeXXbPDE="
Cancel-Lock: sha1:+uyXAE2VcBJkaKthJfhOJMkCVw8=
X-Newsreader: Claws Mail 3.19.1 (GTK+ 2.24.33; x86_64-w64-mingw32)
Bytes: 5507

On Sat, 08 Feb 2025 17:46:32 GMT
anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:

> Michael S <already5chosen@yahoo.com> writes:
> >On Sat, 08 Feb 2025 08:11:04 GMT
> >anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
> 
> >That's very disappointing. Haswell has 4-wide front
> >end and majority of AVX2 integer instruction is limited to throughput
> >of two per clock. Golden Cove has 5+ wide front end and nearly all
> >AVX2 integer instruction have throughput of three per clock.
> >Could it be that clang introduced some sort of latency bottleneck?  
> 
> As far as I looked into the code, I did not see such a bottleneck.
> Also, Zen4 has significantly higher IPC on this variant (5.36 IPC for
> clang keylocks2-256), and I expect that it would suffer from a general
> latency bottleneck, too.  Rocket Lake is also faster on this program
> than Haswell and Golden Cove.  It seems to be just that this program
> rubs Golden Cove the wrong way.
> 

Did you look at the code in the outer loop as well?
The number of iterations in the inner loop is not huge, so excessive
folding of accumulators in the outer loop could be a problem too.
It shouldn't, theoretically, but somehow it could.

And if you still didn't manage to get my source compiled, here is
another version, slightly less clever, but more importantly, formatted
with shorter lines:

#include <stdint.h>
#include <immintrin.h>

#define BROADCAST_u32(p) \
  _mm256_castps_si256(_mm256_broadcast_ss((const float*)(p)))

#define ADD_NZ(acc, x, y) _mm256_sub_epi32(acc, _mm256_cmpeq_epi32 \
  (_mm256_and_si256(x, y), _mm256_setzero_si256()))

int foo_tst(const uint32_t* keylocks, int len, int li)
{
  if (li >= len || li <= 0)
    return 0;
  const uint32_t* px = &keylocks[li];
  unsigned nx = len - li;
  __m256i res0 = _mm256_setzero_si256();
  __m256i res1 = _mm256_setzero_si256();
  __m256i res2 = _mm256_setzero_si256();
  __m256i res3 = _mm256_setzero_si256();

  int nx1 = nx & 31;
  if (nx1) {
    const uint32_t* px_last = &px[nx1];
    // process head, 8 x values per loop
    static const int32_t masks[15] = {
      -1, -1, -1, -1,  -1, -1, -1, -1,
       0,  0,  0,  0,   0,  0,  0,
    };
    int rem0 = (-nx) & 7;
    __m256i mask = _mm256_loadu_si256((const __m256i*)&masks[rem0]);
    __m256i x = _mm256_maskload_epi32((const int*)px, mask);
    px += 8 - rem0;
    const uint32_t* py1 = &keylocks[li & -4];
    const uint32_t* py2 = &keylocks[li];
    for (;;) {
      const uint32_t* py;
      for (py = keylocks; py != py1; py += 4) {
        res0 = ADD_NZ(res0, x, BROADCAST_u32(&py[0]));
        res1 = ADD_NZ(res1, x, BROADCAST_u32(&py[1]));
        res2 = ADD_NZ(res2, x, BROADCAST_u32(&py[2]));
        res3 = ADD_NZ(res3, x, BROADCAST_u32(&py[3]));
      }
      for (; py != py2; py += 1)
        res0 = ADD_NZ(res0, x, BROADCAST_u32(py));
      if (px == px_last)
        break;
      x = _mm256_loadu_si256((const __m256i*)px);
      px += 8;
    }
  }

  int nx2 = nx & -32;
  const uint32_t* px_last = &px[nx2];
  for (; px != px_last; px += 32) {
    __m256i x0 = _mm256_loadu_si256((const __m256i*)&px[0*8]);
    __m256i x1 = _mm256_loadu_si256((const __m256i*)&px[1*8]);
    __m256i x2 = _mm256_loadu_si256((const __m256i*)&px[2*8]);
    __m256i x3 = _mm256_loadu_si256((const __m256i*)&px[3*8]);
    for (const uint32_t* py = keylocks; py != &keylocks[li]; ++py) {
      __m256i y = BROADCAST_u32(py);
      res0 = ADD_NZ(res0, y, x0);
      res1 = ADD_NZ(res1, y, x1);
      res2 = ADD_NZ(res2, y, x2);
      res3 = ADD_NZ(res3, y, x3);
    }
  }
  // fold accumulators
  res0 = _mm256_add_epi32(res0, res2);
  res1 = _mm256_add_epi32(res1, res3);
  res0 = _mm256_add_epi32(res0, res1);
  res0 = _mm256_hadd_epi32(res0, res0);
  res0 = _mm256_hadd_epi32(res0, res0);
  int res = _mm256_extract_epi32(res0, 0)
          + _mm256_extract_epi32(res0, 4);
  return res - (-nx & 7) * li;
}