Article <ln8jpuFlipbU3@mid.individual.net>

Deutsch English Français Italiano
<ln8jpuFlipbU3@mid.individual.net>

View for Bookmarking (what is this?)
Look up another Usenet article
Path: ...!feeds.phibee-telecom.net!news.mixmin.net!news.swapon.de!fu-berlin.de!uni-berlin.de!individual.net!not-for-mail
From: rbowman <bowman@montana.com>
Newsgroups: comp.os.linux.advocacy
Subject: Re: If you're a fucking moron
Date: 16 Oct 2024 01:39:10 GMT
Lines: 31
Message-ID: <ln8jpuFlipbU3@mid.individual.net>
References: <dvhpfjd1rh7uheoien02arle31q9fhcd57@4ax.com>
	<8m6dnczkO_GAPpz6nZ2dnZfqnPSdnZ2d@giganews.com>
	<slrnvgdh2b.2nfc6.candycanearter07@candydeb.host.invalid>
	<pan$78b32$3adf0bd2$1deb5e7a$b8c37cfa@linux.rocks>
	<slrnvgg5g4.1gkvp.candycanearter07@candydeb.host.invalid>
	<lmqvr7FiegjU1@mid.individual.net>
	<slrnvgtmjv.3lkna.candycanearter07@candydeb.host.invalid>
Mime-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
X-Trace: individual.net SVPV3MHFzJTWs/zA9EMWTAibmq21LyQoNdqfnc7C8EQMCPUi3Z
Cancel-Lock: sha1:M3mFdOsgdDx+vmg9MC543luaaHQ= sha256:sCt/FDPKQatr5Ye6D8NWxoHm4HbFHvo++G+ESMPz9Ss=
User-Agent: Pan/0.149 (Bellevue; 4c157ba)
Bytes: 2622

On Tue, 15 Oct 2024 21:20:04 -0000 (UTC), candycanearter07 wrote:

> No wonder youtube autocaptions are so unreliable.

One project for a TinyML course I took was using an Arduino Nano Sense 33 
to handle wake words. Those are phrases like 'Alexa'. Currently the wake 
word triggers the system but almost all subsequent speech processing is 
done by a server in the cloud. The object someday is to have the 
capability in a phone or an edge device to handle the whole process. There 
would be a lot of savings in eliminating a massive backend and also would 
address privacy issues.

Anyway I could train the board to recognize a few words like start, stop, 
up and down. Some were more reliable than others. Messing around I could 
get some feel for what the neural network model was looking for, so to 
speak, and trick it. That's the problem with NNs. It's not clear what they 
really are doing even while understanding the process. In this case the 
microphone output was sampled by an AD converter and used to create a 
spectrogram.

https://en.wikipedia.org/wiki/Spectrogram

Ultimately deciding if the spoken command was 'start' or 'stop' cam down 
to image classification using the spectrogram. There is clipping, scaling, 
and manipulations to simplify the image all along the way but it worked. 
Mostly.

Trying to autocaption probably breaks the speech into phonemes to be more 
flexible but given accents, inflections, poor pronunciations, and other 
factors human listeners are skilled at handling it is a challenge.