Sujet : Re: If you're a fucking moron
De : bowman (at) *nospam* montana.com (rbowman)
Groupes : comp.os.linux.advocacyDate : 16. Oct 2024, 02:39:10
Autres entêtes
Message-ID : <ln8jpuFlipbU3@mid.individual.net>
References : 1 2 3 4 5 6 7
User-Agent : Pan/0.149 (Bellevue; 4c157ba)
On Tue, 15 Oct 2024 21:20:04 -0000 (UTC), candycanearter07 wrote:
No wonder youtube autocaptions are so unreliable.
One project for a TinyML course I took was using an Arduino Nano Sense 33
to handle wake words. Those are phrases like 'Alexa'. Currently the wake
word triggers the system but almost all subsequent speech processing is
done by a server in the cloud. The object someday is to have the
capability in a phone or an edge device to handle the whole process. There
would be a lot of savings in eliminating a massive backend and also would
address privacy issues.
Anyway I could train the board to recognize a few words like start, stop,
up and down. Some were more reliable than others. Messing around I could
get some feel for what the neural network model was looking for, so to
speak, and trick it. That's the problem with NNs. It's not clear what they
really are doing even while understanding the process. In this case the
microphone output was sampled by an AD converter and used to create a
spectrogram.
https://en.wikipedia.org/wiki/SpectrogramUltimately deciding if the spoken command was 'start' or 'stop' cam down
to image classification using the spectrogram. There is clipping, scaling,
and manipulations to simplify the image all along the way but it worked.
Mostly.
Trying to autocaption probably breaks the speech into phonemes to be more
flexible but given accents, inflections, poor pronunciations, and other
factors human listeners are skilled at handling it is a challenge.