This month was spent primarily assisting David with RADAE testing. Most of this involved using a web-based interface to generate a file to send out over the air (and subsequently record from a remote SDR and decode).
I also did some work on a version of freedv-gui that is able to use the existing RADAE scripts to have a two-way QSO with someone else also running the same software. So far this appears to work fine on Linux and macOS, but I am running into challenges on Windows. The main challenge is that PyTorch and/or Python seem to run significantly slower in the Windows VM that I’m using than on other platforms, which means that decoding unfortunately can’t happen in real time using this setup. I’ll investigate this further, time permitting, but it’s possible that Windows users will need to use a PC with a nVidia GPU to use the modified version of freedv-gui.
Other than that, some minor bugs and GUI tweaks were done for ezDV and freedv-gui, namely adding the configuration filename to the titlebar (for the latter) and increasing maximum HTTP header length (for the former).
More information can be found in the commit history below:
Many digital voice systems have the ability to send small amounts of digital data in parallel with the compressed voice. For example in FreeDV we allocate a few bits/frame for call sign and grid square (location) information. This is a bit complex with RADAE, as we don’t actually send any “bits” over the system – it’s all analog PSK symbols.
So I’ve work out a way to inject 25 bits/s of data into the ML network along side the vocoder features. The ML magic spreads these bits across OFDM carriers and appears to do some sort of error protection, as I note the BER is quite low and it show some robustness to multipath. I can tune the bit error rate (BER) by adjusting the loss function and bit rate; a few percent BER at low SNRs (where the voice link falls over) is typical.
The plot below shows the “loss” (RMS error) of the vocoder features as a function of SNR (Energy per symbol/noise density). The vertical axis is the mean square error of the vocoder features through the system – lower is better. It’s useful for comparing networks.
So “red” is model17, which is our control with no auxiliary data. Yellow was my first attempt at injecting data, and purple the final version. You can see purple and red are almost on top of each other, which suggests the vocoder speech quality has barely changed, despite the injection of the data. Something for nothing? Or perhaps this suggests the data bits consume a small amount of power compared the vocoder features.
Much of this month was spent preparing for the August test campaign. I performed a dry run of some over the air (OTA) tests, leading to many tweaks and bug fixes. As usual, I spent a lot of time on making acquisition reliable. Sigh.
The automated tests (ctests) were invaluable, as they show up any effects of tuning one parameter on other system functions. They also let me test in simulation, rather than finding obscure problems through unrepeatable OTA tests. The loss function is a very useful measure for trapping subtle issues. A useful objective measure of speech quality is something I have been missing in many years of speech coding development. It’s sensitive to small errors, and saves a lot of time with listening tests.
I have developed a test procedure for the stored file phase of the August 2024 test campaign. The first phase of testing uses stored files (just like the April test campaign) but this time using the new PAPR optimised waveform and with a chirp header that lets us measure SNR. To make preparation and processing easier, I have developed a web based system for processing the Tx and Rx samples. This means the test team can decode RADAE samples by themselves, without using the command line Linux tools. A test team of about 10 people has been assembled and a few of them have already posted some interesting samples (thanks Yuichi, Simon, and Mooneer).
If you would like to actively participate in RADAE testing, please see this post.
The next phase of testing is real time PTT. The Python code runs in real time, so I have cobbled together a bash script based system (ptt_test.sh) – think of it as crude command line version of freedv-gui. It works OK for me – I can transmit in real time using my IC-7200 to KiwiSDRs, and receive off air from the IC-7200. By using loop back sound devices I can also receive from a KiwSDR. The script only runs on Linux and requires some knowledge of sound cards, but if I can find a few Linux-savvy testers we can use ptt_test.sh to obtain valuable early on-air experience with RADAE. This is an opportunity for someone to make the first live RADAE QSO.
An interesting side project was working with Mooneer to establish the feasibility of running RADAE on ezDV. Unfortunately, this looks unlikely. Modern machine learning systems really require a bit more CPU (like a 1GHz multi-core machine). Fortunately, this sort of CPU is pretty common now (e.g. a Raspberry Pi or cell phone). Once RADAE matures, we will need to reconsider our options for a “headless” adapter type platform.
We are ready to start another test campaign for the radio autoencoder (RADAE). This will consist of stored file tests (like the April campaign), and some real time PTT testing. The draft test procedure is here.
If you would like to join the team testing RADAE, please reach out to us directly or via the comments below.
To use FreeDV with commercial radios we have developed a series of “rig adapters” such as the SM1000 and now ezDV. These are embedded devices that run “headless'”(no GUI) and connect between your SSB radio and a microphone/headset to allow it to run FreeDV.
Our latest prototype speech waveform is RADAE, which is showing promise of improved voice quality and robustness over our existing FreeDV modes and indeed SSB. RADAE uses machine learning (ML) and requires significantly more CPU and memory than existing FreeDV modes.
We would like to know if we can run RADAE on the ezDV, which is based around an ESP32-S3.
The RADAE “stack” consists of the RADAE encoder and decoder, and the FARGAN vocoder. The RADAE encoder and decoder requires around 80 MMAC/s (million multiply-accumulates per second) each, and 1 Mbyte of RAM for the weights. The FARGAN vocoder (used only on receive) requires 1 Mbytes of weights, and around 300 MMAC/s of CPU. The CPU is dominated by the FARGAN vododer, which runs on receive. As the weights are quantised to 8 bits the MMACs can be use 8 bit multiply accumulates, which suits many machines with 8 bit SIMD support.
In practice, you want plenty of overhead, so for a 300 MMACS/s algorithm a machine with above 3x this capability will make the port “easy” (e.g. a recompile with a little SIMD assembly language for the heavy lifting). It also allows you to tweak the algorithm, and run other code on the same machine without worrying about real time issues. If the CPU is struggling you will spent a great deal of time optimizing the code and the algorithm – time that could be better spent elsewhere.
ezDV is based on a ESP32-S3 CPU which has two cores that run at about 240 MHz, has 512 kbytes of local (fast) memory, and 8 MBytes of slower PSRAM that is accessed over a SPI bus. It does have hardware acceleration for integer multiply accumulates.
To answer our question, we developed a simple test program to characterize the ESP32. Many ML operations are “dot products”, or multiplying two vectors together. So we generated a 1Mbyte matrix in PSRAM, and performed a dot product with it one “row” at a time. The other input vector in the dot product was in fast internal memory. The idea was to exercise both the CPU and memory access performance in a way similar to RADAE, but without the hassle of porting the entire algorithm across.
Results using matrix containing 1M elements (1024 x 1024) for various datatypes. This does not fit entirely within the 32-64 KB of on-chip cache, so the ESP32-S3 needs to repeatedly access PSRAM to complete the operation. PSRAM was configured to execute at 120 MHz (currently experimental per Espressif).
Data type
SIMD?
Raw time (us)
MMACS
int8
No
486
33
int8
Yes
84
195
int16
No
631
25
int16
Yes
98
167
int16
Yes (using ESP-DSP matrix multiply)
88
186
int32
No
419
39
Results using matrix containing 16384 (128 x 128) elements for various datatypes. This smaller matrix fits entirely within the ESP32-S3’s cache, reducing the number of times that it has to go out to PSRAM.
Here is the source code for the program used to measure the ezDV performance.
As shown above, the performance of the matrix multiplication operation on the ESP32-S3 is highly dependent on the size of the matrices involved. For matrices that fit entirely within its internal RAM (either because it can fit within the internal RAM-backed PSRAM cache without many cache misses or because it was originally allocated entirely within internal RAM), performance is fairly reasonable for a micro-controller. In other applications, the ESP32-S3 is able to perform inference on smaller ML models with good performance.
Unfortunately, with larger matrices, the system becomes memory bandwidth limited extremely quickly. For instance, using int16 and ESP-DSP’s matrix multiplication function is slightly more performant than handwritten SIMD assembly when the dataset fits entirely in internal RAM, but are both limited to approximately the same MMACS when the system repeatedly has to go out to PSRAM. int8 using SIMD additionally performs 2x better than int16 because it has to access to PSRAM only half of the time.
These results suggest we will not be able to run the RADAE stack on ezDV. While unfortunate, is it useful to reach this conclusion early so we can consider alternatives for an adapter style implementation of RADAE.
We thought this characterization testing might be useful for others using the ESP32 for ML and other CPU-intense applications, so as part of our open source project philosophy, have written it up here to share.
This post was jointly written by Mooneer and David.
This month, the FreeDV application got a few updates:
The previous work on updating the Voice Keyer feature was finally completed and merged into the repository. This mainly consisted of updating the appearance of the voice keyer file’s name in the Voice Keyer button based on user feedback.
wxWidgets inside the Windows and macOS binary builds was updated to version 3.2.5.
Adjustment dials for the monitor volume (for both Voice Keyer and standard PTT) were added to their respective right-click menus.
Logic to automatically adjust the audio configuration upon detection of missing devices was removed by user request (mainly due to the feature never working properly).
ezDV also got the following updates:
The in-progress work on Ethernet support for ezDV was finally merged. This resulted in version 1.1.0 of the firmware being released as well as additional content added to the User’s Guide to document the required hardware modifications.
Minor code cleanup of the I2C bus handling due to deprecation of the “legacy” I2C driver by Espressif.
Updated the minimum ESP-IDF version to 5.3.
Reenabled asynchronous HTTP request handling (previously disabled due to an ESP-IDF bug that is now fixed).
More information can be found in the commit history below:
This month I’ve been working on a real time implementation of the Radio Autoencoder (RADAE), suitable for Push To Talk (PTT) use over the air.
One big step was refactoring the core Machine Learning (ML) encoder and decoder to a “stateful” design, that can be run on short (120ms) sequences of data, preserving state each time it is called. The result is a set of command line utilities that can work with streaming audio from a headset or radio. This example demonstrates the full receiver stack: the rx.f32 file (off-air float IQ samples) is decoded to audio samples that are played through your speakers:
I spent some time profiling and with a little optimisation, we now have a real time RADAE Tx and Rx that achieves real time encoding and decoding on Desktop and laptop PCs. Quite surprising given it’s still Python code (with the heavy lifting performed in PyTorch and NumPy). With a little more work, we could use these streaming utilities to build a network based RADAE server, a sound card plug in, or a “headless” RADAE system like the ezDV/SM1000.
Our end goal for a RADAE implementation is a C callable library. While low technical risk, a C port is time consuming, and would delay testing the big unknowns in a new speech communication system such as RADAE. There is also the risk of significant rework of the C code if (when) there are any problems with the waveform. So our priority is to test the RADAE waveform against our requirements, and fortunately the Python version is fast enough for that already.
Over the years we’ve discovered many ways to break digital voice systems. These issues are much easier to fix in simulation so I’ve developed many intricate automated tests, for example tests that simulate slowly varying, stationary channels, and other tests that simulate fast fading like the northern European winter. Do carriers (sine waves) in the middle of a RADAE signal cause it to fall over or make it sync by accident? What happens if the Tx and Rx stations have slightly different sample clock frequencies? I won’t bore you with the details here, but a lot of work goes into this stuff.
While giving RADAE a hard time in simulation I tried the mulitpath disturbed (MPD) channel. This has 2 Hz fading and 4ms delay spread, and is encountered in Winter at high latitudes (e.g. NVIS communications during the UK Winter). It’s tough on HF modems. The mission here is “do not fall over with fast fading” – it’s OK if a few more dB of SNR is required. Here is a sample of what the off air received signal sounds like at 3dB SNR, followed by the decoded audio.
Despite the received signal dipping into the noise at times, RADAE seems to handle it OK. I designed the DSP equalization to handle fast fading, but only trained the ML network with a simulation of 1 Hz fading. So I was concerned the ML might fall over but this time we got lucky! Here is the spectrogram of the same signal – at times the fading completely wipes it out.
One innovation is an “End of Over” system. When a transmission ends, an “end of over” frame is sent and the Rx cleanly “squelches” the receive audio. Previous FreeDV modes would run on for a few seconds making R2D2 sounds, as from the receivers perspective it’s hard to know if the transmitter has finished or you are just in a fade.
On another topic this month I also set up a new WordPress host for this site, and spruced up the content a little. I’m more at home with DSP than SPF and MX records but with the kind support from VentraIP I got there eventually. Thanks Bruce Perens for hosting this site for the last few years.
If you are interested in helping out with the RADAE work I have been building up a list of small chunks of work that need doing using the GitHub Issues system. Many of them require general GitHub/C coding/Linux skills, and not hard core DSP or ML. I’ve listed the skills required in each Issue. Please (please!) discuss them with me first (using the Issue comment system) before kicking off your own PR – I have a really good idea what needs to be done and we need to stay focused.
I have written a test plan for the next phase of over the air (OTA) RADAE testing. The goals will be (a) crowd sourced testing of the latest PAPR-optimised waveform over a variety of channels using the stored file system (b) test real time, PTT conversations over real radio channels using RADAE. This will build our experience and no doubt uncover bugs that will require some rework. I’m on track to start this test campaign in August.
This month I’ve been working on the DSP detail work required for a practical HF waveform based on RADAE. Not as interesting as the Machine Learning (ML) work, but something we need to grind through for a real world HF speech system.
Acquisition
Acquisition is where we determine (a) is a received signal present and (b) if so what is it’s frequency offset and where each frame of “data” starts (coarse timing). The general approach is to search for the pilot symbols at the start of each frame over a grid of time and frequency points. The problem is complicated by the presence of noise, multipath, and high power ML data symbols.
In my earlier FreeDV work I built some ad-hoc acquisition algorithms but this time I took a more mathematical approach. The problem with RADAE is that it operates at very low SNRs which makes acquisition using traditional DSP difficult. Due to the PAPR optimisation the RMS power of the ML data symbols is higher than the classical DSP pilot symbols used for acquisition. While reduced PAPR is in general a good thing, it complicates detection of the pilots.
So I needed a deep dive into the math behind acquisition to get an extra boost in performance. Anyway, the sums showed me two ways I can improve acquisition performance, and it seems to be working well in simulation down to reasonably low SNRs.
Automated Tests
There has been a lot of RADAE code developed over the course of 2024, so much that I’m starting to lose track of it myself. So I’ve added a set of automated tests to make sure everything keeps working and help trap any bugs I might introduce as the code develops. It’s also a neat framework to guide future refactoring and a real time/C port.
Chirp SNR estimator
The April Over the Air (OTA) test campaign showed the need for a way to measure the SNR of off-air samples. It needs to work on HF multipath channels which tend to notch out various frequencies. After a few false starts, I’ve built a “chirp” based SNR estimator. At the start of a transmission, I send a few seconds of chirp signal that sweeps over a range of frequencies. The receiver script knows where this signal is and using a little math can come up with a good estimate of the actual channel SNR.
Interesting Bugs
The previous round of OTA tests was in April. After thinking about the results I found some bugs in the waveform we tested.
I accidentally omitted the cyclic prefix in the waveform tested in April. The cyclic prefix protects us from intersymbol interference, so it “shouldn’t have worked” on HF channels. Exploring just why it worked (and worked rather well) is on the TODO list, and might explain the poor performance on DX channels (e.g Japan to Australia). Sometimes accidents lead to “light bulb” moments.
Another possible bug is the use of fixed timing estimate used for the entire 10 second sample (we don’t adjust timing after the initial estimate). The ionosphere is changing all the time, and the Tx DAC and Rx ADC sample clocks are also slightly different which means a timing estimate that varies over time. So a fixed timing estimate is a bad idea, and I was kind of lucky it worked on most of the samples we collected.
Recent Progress and OTA Low PAPR Tests
So I figure the last few months of work is probably enough for this round of development:
Two new low PAPR waveforms (750 and 1500Hz RF bandwidth)
Acquisition system improvements
Addressing some bugs from the April 2024 test campaign
Chirp based SNR measurement to calibrate our OTA tests
While there are many possibilities for further development, I don’t want to go too far down any R&D rabbit holes without checking against real world performance. So I’m preparing for some more stored file OTA tests, to see how we are performing against our stated goals of low and high SNR performance that is competitive with SSB.
Here are some initial samples (using a sample of my voice) of the 1500Hz low PAPR waveform (model17) over a 2000km path at 14.250 MHz, at a few watts transmit power:
The SNR is measured from the chirp. The chirp signal has 0dB PAPR, so this is the SNR at the peak power of the SSB and RADAE signals. The RMS power and hence average SNR of the SSB signal would be about 6dB lower (-5.5dB), and the RADAE about 0.8dB lower (-0.3 dB). So with the same power amplifier, RADAE delivers about 5dB more power to the receiver than SSB.
An hour or so later I turned up the power to get a high SNR sample over the same 2000km path:
While much easier to understand, even at high SNR there is quite a bit of background noise with SSB (this could possibly be improved with DSP noise reduction). However there is some “vocoder” distortion on the RADAE signal as well – it’s not totally clean. You actually have to listen fairly carefully to hear differences between the low and high SNR RADAE samples. This might mean we’ve biased the training towards “low SNR”, rather than “highest quality”. These results also suggests we can run 1.5W rather than 100W, for similar speech quality, as 10log10(100/1.5) = 18dB.
While performing these test I noticed a bunch of little things to look into:
A pop artifact in one of my samples that goes away when the input speech level changes. Suggests the ML is entering territory is hasn’t seen in training.
I’m not sure if my Tx power from my SSB radio is staying constant as intended with a low PAPR waveform – need to sample the actual Tx power and plot on the spec-an. I need to confirm all three signals are at the same peak power.
The high SNR RADAE speech quality isn’t consistent across samples, some speakers sound a bit better. This is subjective of course so needs a further look.
Tx Spurious
At high SNRs there is some out of band spurious Tx energy (e.g. from 2000 to 3000 Hz) in the in the PAPR optimised RADAE signal. We should remove this if possible.
Next Steps
Every time I put this technology over real radio channels I learn a lot and have a bunch more questions and tasks added to my TODO list. However I do feel it’s time to focus on building a real time system that we can test with real PTT conversations. Even a rudimentary system that has some teething problems will teach us a lot. We have several ML models we can try (e.g. high and low PAPR, 750 and 1500 Hz wide waveforms), and it’s quite easy to try others as our experience improves.
So I will continue working towards a real time implementation so we can get on the air and test this technology with real time PTT conversations. Some challenges ahead are (a) a state machine sync system that can acquire and determine when an over is complete (b) refactoring the code to run on modem frame size chunks rather than several seconds of samples (c) some way for anyone to run RADAE in real time (either in Python or a C port) with streaming audio (d) other chunks of DSP like tracking frequency, amplitude, and timing offsets as they evolve (e) a way to perform controlled tests and evaluate quality automatically – subjective reports and ad-hoc testing is not very reliable.
This month, a special demo version of freedv-gui was spun up that allowed the project to demo RADAE at Dayton Hamvention. In addition, functionality was added to allow the user to resize the “Msg” column in FreeDV Reporter and preserve the new size across executions.
ezDV also got the following changes:
Documentation/web UI updates to reflect the recent introduction of the Flex 8000 series radios.
Optimizations for Icom Wi-Fi support to reduce CPU usage.
Fixed a bug preventing ezDV from properly adjusting filters on Flex when in LSB mode.
Fixed a bug where ezDV transmitted the voice keyer one more time than configured.
Fixed a bug where ezDV was using the wrong underlying mode for FDVL mode on Flex.
More information can be found in the commit history below:
FreeDV once again was at Dayton Hamvention in Xenia, OH, where we shared a booth with the M17 project. At the booth, Mooneer Salem K6AQ and Mel Whitten K0PFX demoed the Radio Autoencoder efforts (using pre-generated audio) as well as the SM1000 and ezDV devices:
Additionally, Mooneer gave two presentations at Dayton. One presentation was related to the ezDV device as part of the TAPR Forum and another introduced FreeDV during the annual Digital Modes forum (unfortunately, technical issues prevented this one from properly being recorded).
All in all, it was another successful show. Hope to see you next year!