Deutsch   English   Français   Italiano  
<181df652f994e0cb$34540$2484$802601b3@news.usenetexpress.com>

View for Bookmarking (what is this?)
Look up another Usenet article

From: Farley Flud <fflud@gnu.rocks>
Subject: GNU/Linux Greatness: AVX 512 Assembly
Newsgroups: comp.os.linux.advocacy
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
Lines: 100
Path: ...!eternal-september.org!feeder3.eternal-september.org!usenet.blueworldhosting.com!diablo1.usenet.blueworldhosting.com!feeder.usenetexpress.com!tr2.iad1.usenetexpress.com!news.usenetexpress.com!not-for-mail
Date: Sat, 25 Jan 2025 14:37:47 +0000
Nntp-Posting-Date: Sat, 25 Jan 2025 14:37:47 +0000
X-Received-Bytes: 3073
Organization: UsenetExpress - www.usenetexpress.com
X-Complaints-To: abuse@usenetexpress.com
Message-Id: <181df652f994e0cb$34540$2484$802601b3@news.usenetexpress.com>
Bytes: 3482

Assembly language programming is both extremely simple and
extremely fun.

Yes, simple.  A CPU is a stupid beast and can only perform
very simple tasks.

Yes, fun.  There is much enjoyment to be had in using these
simple CPU tasks, like Lego, to construct complex functionality. 

AVX-512 is currently the way to go with assembly programming.
AVX-512 operates on 512-bits, or 8 doubles, 16 floats, 8 long ints,
16 ints, or 64 chars (uint_8) simultaneously. 

With GNU/Linux, AVX-512 is totally at your command.

What follows is a very basic program that essentially does
nothing.  It merely uses AVX-512 assembly to read a data
block of arbitrary length and then write that block back
into different memory.

It's purpose is to illustrate how to step through memory
at a given stride to read all the data.  Since not all data
is a multiple of 512 bits the code shows to deal with any
trailing bits.

For the sake of illustration the following assembly code
reads/writes 37 unsigned integers.  These will fill 2 AVX-512
registers with 5 uints left over.  Those final 5 are handled
with masking.

But any data block, up to 2^64 bytes (whew!), can be handled with
this simple code.

This program is written in NASM assembly.  NASM is the fucking
best assembler on planet Earth, hands down.

As I indicated, this program does essentially nothing.  There is
no output.  To view the "results" use the GDB debugger or, better,
the front end DDD.  With DDD one can step through the code to watch
the action unfold. 

Feast thine bloodshot, jaundiced eyeballs on absolutely perfect
AVX-512 assembly code:

==================================
Begin AVX-512 NASM Assembly
==================================

BITS 64

segment .text
	global _start

_start:
	mov r8, data_in
	mov r9, data_out
	mov rbx, qword [stride]
	xor rdx, rdx
	mov rax, qword [N]
	div rbx 	; rax = quotient, rdx = remainder
load:
	vmovdqa32 zmm1, zword [r8]
	vmovdqa32 zword [r9], zmm1
	add r8, 64 ; increment data pointers
	add r9, 64
	dec rax
	jnz load
	xor r11, r11 	; load mask, i.e. only rdx left over to load
	mov r10, -1
	mov rcx, rdx
	shld r11, r10, cl  
	kmovq k1, r11;
	vmovdqa32 zmm1{k1}{z}, zword [r8]
	vmovdqa32 zword [r9], zmm1
exit:	
	xor edi,edi
	mov eax,60
	syscall

segment .data
align 64
N:		dq 37 	;set length of block and stride
stride:		dq 16
align 64
data_in:	dd 16 dup (0xefbeadde) ;dummy data
		dd 16 dup (0xfecaafde)
		dd 5 dup (0xefbeadde)

segment .bss
alignb 64
data_out:	resd 37

==================================
End AVX-512 NASM Assembly
==================================



-- 
Gentoo: The Fastest GNU/Linux Hands Down