Article <2024Sep6.152642@mips.complang.tuwien.ac.at>

Deutsch English Français Italiano
<2024Sep6.152642@mips.complang.tuwien.ac.at>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!eternal-september.org!feeder3.eternal-september.org!news.eternal-september.org!.POSTED!not-for-mail
From: anton@mips.complang.tuwien.ac.at (Anton Ertl)
Newsgroups: comp.arch
Subject: Re: Computer architects leaving Intel...
Date: Fri, 06 Sep 2024 13:26:42 GMT
Organization: Institut fuer Computersprachen, Technische Universitaet Wien
Lines: 181
Message-ID: <2024Sep6.152642@mips.complang.tuwien.ac.at>
References: <2024Aug30.161204@mips.complang.tuwien.ac.at> <vautmu$vr5r$1@dont-email.me> <2024Aug31.170347@mips.complang.tuwien.ac.at> <vavpnh$13tj0$2@dont-email.me> <vb2hir$1ju7q$1@dont-email.me> <8lcadjhnlcj5se1hrmo232viiccjk5alu4@4ax.com> <vb4amr$2rcbt$1@dont-email.me> <2024Sep5.133102@mips.complang.tuwien.ac.at> <vbchiv$cde4$1@dont-email.me> <2024Sep5.174939@mips.complang.tuwien.ac.at> <ljuc4fF86o3U2@mid.individual.net> <2024Sep6.092535@mips.complang.tuwien.ac.at> <20240906135718.00004f84@yahoo.com>
Injection-Date: Fri, 06 Sep 2024 16:27:06 +0200 (CEST)
Injection-Info: dont-email.me; posting-host="2e85e44354343afe68fe387c30e9d02d";
	logging-data="908895"; mail-complaints-to="abuse@eternal-september.org";	posting-account="U2FsdGVkX18U+KOCHGRuRScBLLmHCAOY"
Cancel-Lock: sha1:ub351VFsQUJ3LjI3P9M4HYqMBgU=
X-newsreader: xrn 10.11
Bytes: 7419

Michael S <already5chosen@yahoo.com> writes:
>On Fri, 06 Sep 2024 07:25:35 GMT
>anton@mips.complang.tuwien.ac.at (Anton Ertl) wrote:
>> What gcc produces for both formulations is longer than
>> 
>> dec %rdi
>> jno ...
>>
>
>Good trick.

Thanks.  It's not from me.  I published it in 2015
<https://www.complang.tuwien.ac.at/kps2015/proceedings/KPS_2015_submission_29.pdf>,
but unfortunately did not give a reference to where I have it from (I
read it elsewhere).

>The same trick in non-destructive form would be 1 byte longer.
> cmp $1, %rdi
> jno ...
>
>But I was not able to force any of compilers currently installed on my
>home desktop (gcc 13.2, clang 18.1, MSVC 19.30.30706 == VS2022) to
>produce such code. 
>
>The closest was MSVC that sometimes (not in all circumstances) produces
>2 bytes longer versiin:
>  49 8d 49 ff             lea    -0x1(%r9),%rcx
>  4c 3b c9                cmp    %rcx,%r9
>
>Of course, it's still good deal shorter than 
>  48 ba 00 00 00 00 00 00 00 80  movabs $0x8000000000000000,%rdx
>  4c 3b ca                       cmp    %rdx,%r9
>
>Both gcc and clang [under -fwrapv] insisted on turning x<=x-1 into
>x==LLONG_MIN.
>
>However even if we were able to force compiler to produce desired code,
>the space saving is architecture-specific.

With this gcc-specific code we can force it:

extern long foo1(long);
extern long foo2(long);

long bar(long a, long b)
{
  long c;
  if (__builtin_sub_overflow(b,1,&c))
    return foo1(a);
  else
    return foo2(a);
}

gcc -O3 -c and gcc -Os -c (gcc-12.2) produce, on AMD64:

   0:   48 83 c6 ff             add    $0xffffffffffffffff,%rsi
   4:   70 05                   jo     b <bar+0xb>
   6:   e9 00 00 00 00          jmp    b <bar+0xb>
   b:   e9 00 00 00 00          jmp    10 <bar+0x10>

So, even though %rsi is dead afterwards, it does not use dec, but it's
certainly better than the other variants.

On Arch A64 both gcc invocations (gcc-10.2) produce:

   0:   f1000421        subs    x1, x1, #0x1
   4:   54000046        b.vs    c <bar+0xc>
   8:   14000000        b       0 <foo2>
   c:   14000000        b       0 <foo1>

On RV64GC bith gcc invocations (gcc-10.3) produce:

0000000000000000 <bar>:
   0:   fff58793                addi    a5,a1,-1
   4:   00f5c663                blt     a1,a5,10 <.L6>
   8:   00000317                auipc   t1,0x0
   c:   00030067                jr      t1 # 8 <bar+0x8>

0000000000000010 <.L6>:
  10:   00000317                auipc   t1,0x0
  14:   00030067                jr      t1 # 10 <.L6>

So on RISC-V gcc manages to actually translate the if back into "if (b
< b-1)" without pessimising that (but gcc-10 does not pessimize this
code on AMD64, either.

>E.g. I expect no saving on ARM64 where both variants occupie 8 bytes.

Here we have the three variants:

#include <limits.h>

extern long foo1(long);
extern long foo2(long);

long bar(long a, long b)
{
  long c;
  if (__builtin_sub_overflow(b,1,&c))
    return foo1(a);
  else
    return foo2(a);
}

long bar2(long a, long b)
{
  if (b < b-1)
    return foo1(a);
  else
    return foo2(a);
}

long bar3(long a, long b)
{
  if (b == LONG_MIN)
    return foo1(a);
  else
    return foo2(a);
}

And here is what gcc-10 -Os -fwrapv -Wall -c produces:

ARM A64:
subs x1, x1, #0x1    sub  x2, x1, #0x1      mov  x2, #0x8000000000000000 
b.vs c <bar+0xc>     cmp  x2, x1            cmp  x1, x2                  
                     b.le 20 <bar2+0x10>    b.ne 34 <bar3+0x10>

RV64GC:
addi a5,a1,-1        addi a5,a1,-1          li   a5,-1         
bge  a1,a5,10 <.L4>  bge  a1,a5,28 <.L6>    slli a5,a5,0x3f    
                                            bne  a1,a5,40 <.L8>
8 Bytes              8 Bytes                8 Bytes

AMD64:
add $-1,%rsi         lea -0x1(%rsi),%rax    mov $0x1,%eax     
jo  b <bar+0xb>      cmp %rsi,%rax          shl $0x3f,%rax       
                     jle 1e <bar2+0xe>      cmp %rax,%rsi     
                                            jne 36 <bar3+0x13>
6 Bytes              9 Bytes                14 Bytes

With gcc-12 on AMD64:
add -1,%rsi          mov $0x1,%eax          mov $0x1,%eax     
jo  b <bar+0xb>      shl $0x3f,%rax         shl $0x3f,%rax      
                     cmp %rax,%rsi          cmp %rax,%rsi     
                     jne 23 <bar2+0x13>     jne 23 <bar2+0x13>
6 Bytes              14 Bytes               14 Bytes

(Actually in the latter case gcc recognizes that bar2 and bar3 are
equivalent and jumps from bar3 to bar2, but I am sure that without
bar2, bar3 would look the same as bar2 does now).

So when gcc does not pessimize "b < b-1" into "b == LONG_MIN", the
straightforward code for the former has the same or smaller size, and
the same or smaller number of instructions on these architectures.
The "__builtin_sub_overflow(b,1,&c)" has the same or fewer bytes than
"b < b-1" and the same or fewer instructions.  So, with
straightforward translations "__builtin_sub_overflow(b,1,&c)"
dominates "b < b-1", which dominates "b == LONG_MIN".

As a new feature, gcc-12 recognizes "b < b-1" and pessimizes it into
the same code as "b == LONG_MIN".

>> Interestingly, the first idiom is a case where gcc recognizes what the
>> intention of the programmer is, and warns that it is going to
>> miscompile it.  The warning is good, the miscompilation not (but it
>> would be worse without the warning).
>>
>
>You had more luck with warnings than I did.
>In all my test cases both gcc and clang [in absence of -fwrapv]
>silently dropped the check and depended code.

Interesting.  I tried both "b < b-1" and "b >= b+1" and got no warning
(with gcc-10 and gcc-12), but I have seen a warning with one of those
idioms in the past.  Maybe someone decided that warning about this
idiom is unnecessary, while "optimizing" it is.

- anton
-- 
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
  Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>