From: Farley Flud Subject: GNU/Linux Greatness: AVX 512 Assembly Newsgroups: comp.os.linux.advocacy Mime-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Lines: 100 Path: ...!eternal-september.org!feeder3.eternal-september.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!feeder.usenetexpress.com!tr2.iad1.usenetexpress.com!news.usenetexpress.com!not-for-mail Date: Sat, 25 Jan 2025 14:37:47 +0000 Nntp-Posting-Date: Sat, 25 Jan 2025 14:37:47 +0000 X-Received-Bytes: 3073 Organization: UsenetExpress - www.usenetexpress.com X-Complaints-To: abuse@usenetexpress.com Message-Id: <181df652f994e0cb$34540$2484$802601b3@news.usenetexpress.com> Bytes: 3482 Assembly language programming is both extremely simple and extremely fun. Yes, simple. A CPU is a stupid beast and can only perform very simple tasks. Yes, fun. There is much enjoyment to be had in using these simple CPU tasks, like Lego, to construct complex functionality. AVX-512 is currently the way to go with assembly programming. AVX-512 operates on 512-bits, or 8 doubles, 16 floats, 8 long ints, 16 ints, or 64 chars (uint_8) simultaneously. With GNU/Linux, AVX-512 is totally at your command. What follows is a very basic program that essentially does nothing. It merely uses AVX-512 assembly to read a data block of arbitrary length and then write that block back into different memory. It's purpose is to illustrate how to step through memory at a given stride to read all the data. Since not all data is a multiple of 512 bits the code shows to deal with any trailing bits. For the sake of illustration the following assembly code reads/writes 37 unsigned integers. These will fill 2 AVX-512 registers with 5 uints left over. Those final 5 are handled with masking. But any data block, up to 2^64 bytes (whew!), can be handled with this simple code. This program is written in NASM assembly. NASM is the fucking best assembler on planet Earth, hands down. As I indicated, this program does essentially nothing. There is no output. To view the "results" use the GDB debugger or, better, the front end DDD. With DDD one can step through the code to watch the action unfold. Feast thine bloodshot, jaundiced eyeballs on absolutely perfect AVX-512 assembly code: ================================== Begin AVX-512 NASM Assembly ================================== BITS 64 segment .text global _start _start: mov r8, data_in mov r9, data_out mov rbx, qword [stride] xor rdx, rdx mov rax, qword [N] div rbx ; rax = quotient, rdx = remainder load: vmovdqa32 zmm1, zword [r8] vmovdqa32 zword [r9], zmm1 add r8, 64 ; increment data pointers add r9, 64 dec rax jnz load xor r11, r11 ; load mask, i.e. only rdx left over to load mov r10, -1 mov rcx, rdx shld r11, r10, cl kmovq k1, r11; vmovdqa32 zmm1{k1}{z}, zword [r8] vmovdqa32 zword [r9], zmm1 exit: xor edi,edi mov eax,60 syscall segment .data align 64 N: dq 37 ;set length of block and stride stride: dq 16 align 64 data_in: dd 16 dup (0xefbeadde) ;dummy data dd 16 dup (0xfecaafde) dd 5 dup (0xefbeadde) segment .bss alignb 64 data_out: resd 37 ================================== End AVX-512 NASM Assembly ================================== -- Gentoo: The Fastest GNU/Linux Hands Down