Article <vq2j6f$v95h$1@solani.org>

Deutsch English Français Italiano
<vq2j6f$v95h$1@solani.org>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: news.eternal-september.org!eternal-september.org!feeder3.eternal-september.org!weretis.net!feeder8.news.weretis.net!reader5.news.weretis.net!news.solani.org!.POSTED!not-for-mail
From: Mild Shock <janburse@fastmail.fm>
Newsgroups: comp.lang.prolog
Subject: I didn't invent these things (Was: Will a decoder-only transformer
 also work?)
Date: Sun, 2 Mar 2025 22:39:27 +0100
Message-ID: <vq2j6f$v95h$1@solani.org>
References: <vpis5p$n6g2$1@solani.org> <vq0gvr$tuur$1@solani.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 8bit
Injection-Date: Sun, 2 Mar 2025 21:39:27 -0000 (UTC)
Injection-Info: solani.org;
	logging-data="1025201"; mail-complaints-to="abuse@news.solani.org"
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:128.0) Gecko/20100101
 Firefox/128.0 SeaMonkey/2.53.20
Cancel-Lock: sha1:zsLvHa8MHDgIWgSS/fNFohcotms=
X-User-ID: eJwFwQkBwDAIA0BLhCcUOZQO/xJ2F0Zw0hn02NhT6E/JeHqx0bpb10dwLJ8+80AKh1Wgmai0pPmpvoSnzQ80jxQL
In-Reply-To: <vq0gvr$tuur$1@solani.org>

Thank you that you think,
I would invent these things:

 > Are you thinking that autoencoders
 > could play a bigger role in tasks like
 > language modeling

Nope, it is all in the papers, like here:

 > **Attention Is All You Need**
 > Vaswani et al., 2017
 > https://arxiv.org/abs/1706.03762

The conclusion says, its same architecture
as autoencoders:

 > In this work, we presented the Transformer,
 > the first sequence transduction model based
 > entirely on attention, replacing the recurrent
 > layers most commonly used in encoder-decoder
 > architectures with multi-headed self-attention.

Same architecture with latent spaces between
encoder and decoder. The training on my laptop
would take, for the EN-DE model ConvS2S Ensemble
reported in the paper Table 2, using my GPU:

7.7e19 / 3e13 = 1 month

If I would try to train GPT 4.5 on my
laptop it would take:

1E23 / 3e13 = 3'000 years

P.S.: The paper is the from the same Vaswani et al.,
2017 as referenced in the Python code of
the other Grokking paper.

Mild Shock schrieb:
> 
> Ok, my bad. You can of course also try a decoder-only.
> Just like here in this Python code example:
> 
>  > **Simple PyTorch Implementation of “Grokking”**
>  > We trained a standard decoder-only transformer (Vaswani et al., 2017)
>  > https://github.com/teddykoker/grokking
> 
> The transformer need not necessarely have a encoder and
> a latent space. It can be also a decoder-only.
> 
> Mild Shock schrieb:
>>
>> Very simple challenge conceptually, develop the idea
>> of Centipawn towards TicTacToe and implement the
>> game based on learning / training a transformer, and
>>
>> then executing it. All written in Prolog itself! Optional
>> bonus exercise, make the execution ИИUƎ style, i.e.
>> incremental evaluation of the transformer.
>>
>> Centipawn - Chess Wiki
>> https://chess.fandom.com/wiki/Centipawn
>>
>> NNUE - Chess Programming Wiki
>> https://www.chessprogramming.org/NNUE
>