Beeps and melodies in two-way radio

Lately my listening activities have focused on two-way FM radio. I'm interested in automatic monitoring and visualization of multiple channels simultaneously, and classifying transmitters. There's a lot of in-band signaling to be decoded! This post shall demonstrate this diversity and also explain how my listening station works.

Background: walkie-talkies are fun

The frequency band I've recently been listening to the most is called PMR446. It's a European band of radio frequencies for short-distance UHF walkie-talkies. Unlike ham radio, it doesn't require licenses or technical competence – anyone with 50€ to spare can get a pair of walkie-talkies at the department store. It's very similar to FRS in the US. It's quite popular where I live.

[Image: Photo of three different walkie-talkies.]

The short-distance nature of PMR446 is what I find perhaps most fascinating: in normal conditions, everything you hear has been transmitted from a 2-kilometer (1.3-mile) radius. Transmitter power is limited to 500 mW and directional antennas are not allowed on the transmitter side. But I have a receive-only system and a my only directional antenna is for 450 MHz, which is how I originally found these channels.

Roger beep

The roger beep is a short melody sent by many hand-held radios to indicate the end of transmission.

The end of transmission must be indicated, because two-way radio is 'half-duplex', which means only one person can transmit at a time. Some voice protocols solve the same problem by mandating the use of a specific word like 'over'; others rely on the short burst of static (squelch tail) that can be heard right after the carrier is lost. Roger beeps are especially common in consumer radios, but I've heard them in ham QSOs as well, especially if repeaters are involved.

Other signaling on PMR

PMR also differs from ham radio in that many of its users don't want to hear random people talking on the same frequency; indeed, many devices employ tones or digital codes designed to silence unwanted conversations, called CTCSS, DCS, or coded squelch. They are very low-frequency tones that can't usually be heard at all because of filtering. These won't prevent others from listening to you though; anyone can just disable coded squelch on their device and hear everyone else on the channel.

Many devices also use a tone-based system for preventing the short burst of static, that classic walkie-talkie sound, from sounding whenever a transmission ends. Baofeng calls these squelch tail elimination tones, or STE for short. The practice is not standardized and I've seen several different sub-audible frequencies being used in the wild, namely 55, 62, and 260 Hz. (Edit: As correctly pointed out by several people, another way to do this is to reverse the phase of the CTCSS tone in the end, called a 'reverse burst'. Not all radios use it though; many opt to send a 55 Hz tone instead, even when they are using CTCSS.)

Some radios have a button called 'alarm' that sends a long, repeating melody resembling a 90s mobile phone ring tone. These melodies also vary from one radio to the other.

My receiver

I have a system in place to alert me whenever there's a strong enough signal matching an interesting set of parameters on any of the eight PMR channels. It's based on a Raspberry Pi 3B+ and an Airspy R2 SDR receiver. The program can play the live audio of all channels simultaneously, or one could be selected for listening. It also has an annotated waterfall view that shows traffic on the band during the last couple of hours:

[Image: A user interface with text-mode graphics, showing eight vertical lanes of timestamped information. The lanes are mostly empty, but there's an occasional colored bar with annotations like 'a1' or '62'.]

The computer is a headless Raspberry Pi with only SSH connectivity; that's why it's in text mode. Also, text-mode waterfall plots are cool!

The coloured bars indicate signal strength (colour) and the duty factor (pattern). The numbers around the bars are decoded squelch codes, STEs and roger beeps. Uncertain detections are greyed out. In this view we've detected roger beeps of type 'a1' and 'a2'; a somewhat rare 62 Hz STE tone; and a ring tone, or alarm (RNG).

Because squelch codes are designed to be read by electronic circuits and their frequencies and codewords are specified exactly, writing a digital decoder for them was somewhat straightforward. Roger beeps and ring tones, on the other hand, are only meant for the human listener and detecting them amongst the noise took a bit more trial-and-error.

Melody detection algorithm

The melody detection algorithm in my receiver is based on a fast Fourier transform (FFT). When loss of carrier is detected, the last moments of the audio are searched for tones thusly:

[Image: A diagram illustrating how an FFT is used to search for a melody. The FFT in the image is noisy and some parts of the melody can not be measured.]
  1. The audio buffer is divided up into overlapping 60-millisecond Hann-windowed slices.
  2. Every slice is Fourier transformed and all peak frequencies (local maxima) are found. Their center frequencies are refined using Gaussian peak interpolation (Gasior & Gonzalez 2004). We need this, because we're only going to allow ±15 Hz of frequency error.
  3. The time series formed by the strongest maxima is compared to a list of pre-defined 'tone signatures'. Each candidate tone signature gets a score based on how many FFT slices match (+) corresponding slices of the tone signature. Slices with too much frequency error subtract from the score ().
  4. Most tone signatures have one or more 'quiet zones', the quietness of which further contributes to the score. This is usually placed after the tone, but some tones may also have a pause in the middle.
  5. The algorithm allows second and third harmonics (with half the score), because some transmitters may distort the tones enough for these to momentarily overpower the fundamental frequency.
  6. Every possible time shift (starting position) inside the 1.5-second audio buffer is searched.
  7. The tone signature with the best score is returned, if this score exceeds a set threshold.

This algorithm works quite well. It's not always able to detect the tones, especially if part of the melody is completely lost in noise, but it's good enough to be used for waterfall annotation. False positives are rare; most of them are detections of very short tone signatures that only consist of one or two beeps. My test dataset of 92 recorded transmissions yields only 5 false negatives and no false positives.

For example, this noisy recording:

was succesfully recognized as having a ringtone (RNG), a roger beep of type a1, and CTCSS code XA:

Naming and classification

Because I love classifying stuff I've had to come up with a system for naming these roger tones as well. My current system uses a lower-case letter for classifying the tone into a category, followed by a number that differentiates similar but slightly different tones. This is a work in progress, because every now and then a new kind of tone appears.

My goal would be to map the melodies to specific manufacturers. I've only managed to map a few. Can you recognise any of these devices?

ClassIdentified modelRecording
aCobra AM845 (a1)
cMotorola TLKR T40 (c1)
hBaofeng UV-5RC

I didn't list them all here, but there are even more samples. I've added some alarm tones there as well, and a list of all the tone signatures that I currently know of. (Why no full source code? FAQ)

In my rx log I also have an emoji classification system for CTCSS codes. This way I can recognize a familiar transmission faster. A few examples below (there are 38 different CTCSS codes in total):

[Image: Two-character codes grouped into categories and paired with emoji. Four categories, namely fruit, sound, mammals, and scary. The fruit category has codes beginning with an M, and emoji for different fruit, etc.]

Future directions

There are mainly just minor bugs in my project trello at the moment, like adding the aforementioned emoji. But as the RasPi is not very powerful the DSP chain could be made more efficient. Sometimes a block of samples gets dropped. Currently it uses a bandpass-sampled filterbank to separate the channels, exploiting aliasing to avoid frequency shifting altogether:

This is quite fast. But the 1:20 decimation from the Airspy IQ data is done with SoX's 1024-point FIR filter and could possibly be done with fewer coefficients. Also, the RasPi has four cores, so half of the channels could be demodulated in a second thread. Currently all concurrency is thanks to SoX and pmrsquash being different processes.

Related posts


Animated line drawings with OpenCV

OpenCV is a pretty versatile C++ computer vision library. Because I use it every day it has also become my go-to tool for creating simple animations at pixel level, for fun, and saving them as video files. This is not one of its core functions but happens to be possible using its GUI drawing tools.

Below we'll take a look at some video art I wrote for a music project. It goes a bit further than just line drawings but the rest is pretty much just flavouring. As you'll see, creating images in OpenCV has a lot in common with how you would work with layers and filters in an image editor like GIMP or Photoshop.

Setting it up

It doesn't take a lot of boilerplate to initialize an OpenCV project. Here's my minimal CMakeLists.txt:

cmake_minimum_required (VERSION 2.8)
project                (marmalade)
find_package           (OpenCV REQUIRED)
add_executable         (marmalade
target_link_libraries  (marmalade ${OpenCV_LIBS})

I also like to set compiler flags to enforce the C++11 standard, but this is not necessary.

In the main .cc file I have:

#include "opencv2/imgproc/imgproc.hpp"
#include "opencv2/highgui/highgui.hpp"

Now you can build the project by just typing cmake . && make in the terminal.

Basic shapes

First, we'll need an empty canvas. It will be a matrix (cv::Mat) with three unsigned char channels for RGB at Full HD resolution:

const cv::Size video_size(1920, 1080);
cv::Mat mat_frame = cv::Mat::zeros(video_size, CV_8UC3);

This will also initialize everything to zero, i.e. black.

Now we can draw our graphics!

I had an initial idea of an endless cascade of concentric rings each rotating at a different speed. There might be color and brightness variations as well but otherwise it would stay static the whole time. You can't see a circle's rotation around its center, so we'll add some features to them as well, maybe some kind of bars or spokes.

A simplified render method for a ring would look like this:

void Ring::RenderTo(cv::Mat& mat_output) const {
  cv::circle(mat_output, 8 * center_, 8 * radius_, color_, 1, CV_AA, 3);
  for (const Bar& bar : bars()) {
    cv::line(mat_output, 8 * (center_ + bar.start), 8 * (center_ + bar.end),
             color_, 1, CV_AA, 3);

Drawing antialiased graphics at subpixel coordinates can make for some confusing OpenCV code. Here, all coordinates are multiplied by the magic number 8 and the drawing functions are instructed to do a bit shift of 3 bits (2^3 == 8). These three bits are used for the decimal part of the subpixel position.

The coordinated of the bars are generated for each frame based on the ring's current rotation angle.

Here are some rings at different phases of rotation. A bug leaves the innermost circle with no spokes, but it kind of looks better that way.

[Image: White concentric circles on a black background, with evenly separated lines connecting them.]

Eye candy: Glow effect

I wanted a subtle vector display look to the graphics, even though I wasn't aiming for any sort of realism with it. So the brightest parts of the image would have to glow a little, or spread out in space. This can be done using Gaussian blur.

Gaussian blur requires convolution, which is very CPU-intensive. I think most of the rendering time was spent calculating blur convolution. It could be sped up using threads (cv::parallel_for_) or the GPU (cv::cuda routines) but there was no real-time requirement in this hobby project.

There are a couple of ways to only apply the blur to the brightest pixels. We could blur a copy of the image masked with its thresholded version, for example. But I like to use look-up tables (LUT). This is similar to the curves tool in Photoshop. A look-up table is just a 256-by-1 RGB matrix that maps an 8-bit index to a colour. In this look-up table I just have a linear ramp where everything under 127 maps to black.

cv::Mat mat_lut = GlowLUT();
cv::Mat mat_glow;
cv::LUT(mat_frame, mat_lut, mat_glow);

Now when blurring, if we add the original image on top of the blurred version, its sharpness is preserved:

cv::GaussianBlur(mat_glow, mat_glow, cv::Size(0,0), 3.0);
mat_frame += 2 * mat_glow;
[Image: A zoomed view of a circle, showing the glow effect.]

The effect works unevenly on antialiased lines which adds a nice pearl-stringy look.

Eye candy: Tinted glass and grid lines

I created a vignetted and dirty green-yellow tinted look by multiplying the image per-pixel by an overlay made in GIMP. This has the same effect as having a "Multiply" layer mode in an image editor. Perhaps I was thinking of an old glass display, or Vectrex overlays. The overlay also has black grid lines that will appear black in the result. Multiplication doesn't change the color of black areas in the original, but I also added a copy of the overlay at 10% brightness to make it dimly visible in the background.

cv::Mat mat_overlay = cv::imread("overlay.png");
cv::multiply(mat_frame, mat_overlay, mat_frame, 1.f/255);
mat_frame += mat_overlay * 0.1f;
[Image: A zoomed view of a circle, showing the color overlay effect.]

Eye candy: Flicker

Some objects flicker slightly for an artistic effect. This can be headache-inducing if overdone, so I tried to use it in moderation. The rings have a per-frame probability for a decrease in brightness, which I think looks good at 60 fps.

if (randf(0.f, 1.f) < .0001f)
  color *= .5f;

The spokes will also sometimes blink upon encountering each other, and the whole ring flickers a bit when it first becomes visible.

Title text

An LCD matrix font was used for the title text. This was just a PNG image of 128 characters that was spliced up and rearranged. This can be done in OpenCV by using submatrices and rectangle ROIs:

cv::Mat mat_font = cv::imread("lcd_font.png");
const cv::Size letter_size(24, 32);
const std::string text("finally, the end of the "
                       "marmalade forest!");

int cursor_x = 0;
for (char code : text_) {
  int mx = code % 32;
  int my = code / 32;

  cv::Rect font_roi(cv::Point(mx * letter_size.width,
                              my * letter_size.height),
  cv::Mat mat_letter = mat_font(font_roi);

  cv::Rect target_roi(text_origin_.x + cursor_x,
                      mat_letter.cols, mat_letter.rows)

  cursor_x += letter_size.width;
[Image: A zoomed view of the text 'finally' with a glow and color overlay effect.]

Encoding the video

Now we can save the frames as a video file. OpenCV has a VideoWriter class for just this purpose. But I like to do this a bit differently. I encoded the frame images individually as BMP and just concatenated them one after the other to stdout:

std::vector<uchar> outbuf;
cv::imencode(".bmp", mat_frame, outbuf);
fwrite(, sizeof(uchar), outbuf.size(), stdout);

I then ran this program from a shell script that piped the output to ffmpeg for encoding. This way I could also combine it with the soundtrack in a single run.

make && \
 ./marmalade -p | \
 ffmpeg -y -i $AUDIOFILE -framerate $FPS -f image2pipe \
        -vcodec bmp -i - -s:v $VIDEOSIZE -c:v libx264 \
        -profile:v high -b:a 192k -crf 23 \
        -pix_fmt yuv420p -r $FPS -shortest -strict -2 \
        video.mp4 && \
 open video.mp4


The 1080p/60 version can be viewed by clicking on the gear wheel menu.

In pursuit of Otama's tone

It would be fun to use the Otamatone in a musical piece. But for someone used to keyboard instruments it's not so easy to play cleanly. It has a touch-sensitive (resistive) slider that spans roughly two octaves in just 14 centimeters, which makes it very sensitive to finger placement. And in any case, I'd just like to have a programmable virtual instrument that sounds like the Otamatone.

What options do we have, as hackers? Of course the slider could be replaced with a MIDI interface, so that we could use a piano keyboard to hit the correct frequencies. But what if we could synthesize a similar sound all in software?

Sampling via microphone

We'll have to take a look at the waveform first. The Otamatone has a piercing electronic-sounding tone to it. One is inclined to think the waveform is something quite simple, perhaps a sawtooth wave with some harmonic coloring. Such a primitive signal would be easy to synthesize.

[Image: A pink Otamatone in front of a microphone. Next to it a screenshot of Audacity with a periodic but complex waveform in it.]

A friend lended me her Otamatone for recording purposes. Turns out the wave is nothing that simple. It's not a sawtooth wave, nor a square wave, no matter how the microphone is placed. But it sounds like one! Why could that be?

I suspect this is because the combination of speaker and air interface filters out the lowest harmonics (and parts of the others as well) of square waves. But the human ear still recognizes the residual features of a more primitive kind of waveform.

We have to get to the source!

Sampling the input voltage to the Otamatone's speaker could reveal the original signal. Also, by recording both the speaker input and the audio recorded via microphone, we could perhaps devise a software filter to simulate the speaker and head resonance. Then our synthesizer would simplify into a simple generator and filter. But this would require opening up the instrument and soldering a couple of leads in, to make a Line Out connector. I'm not doing this to my friend's Otamatone, so I bought one of my own. I named it TÄMÄ.

[Image: A Black Otamatone with a cable coming out of its mouth into a USB sound card. A waveform with more binary nature is displayed on a screen.]

I soldered the left channel and ground to the same pads the speaker is connected to. I had no idea about the voltage range in advance, but fortunately it just happens to fit line level and not destroy my sound card. As you can see in the background, we've recorded a signal that seems to be a square wave with a low duty cycle.

[Image: Oscillogram of a square wave.]

This square wave seems to be superimposed with a much quieter sinusoidal "ring" at 584 Hz that gradually fades out in 30 milliseconds.

Next we need to map out the effect the finger position on the slider has on this signal. It seems to not only change the frequency but the duty cycle as well. This happens a bit differently depending on which one of the three octave settings (LO, MID, or HI) is selected.

The Otamatone has a huge musical range of over 6 octaves:

[Image: Musical notation showing a range from A1 to B7.]

In frequency terms this means roughly 55 to 3800 Hz.

The duty cycle changes according to where we are on the slider: from 33 % in the lowest notes to 5 % in the highest ones, on every octave setting. The frequency of the ring doesn't change, it's always at around 580 Hz, but it doesn't seem to appear at all on the HI setting.

So I had my Perl-based software synth generate a square wave whose duty cycle and frequency change according to given MIDI notes.

FIR filter 1: not so good

Raw audio generated this way doesn't sound right; it needs to be filtered to simulate the effects of the little speaker and other parts.

Ideally, I'd like to simulate the speaker and head resonances as an impulse response, by feeding well-known impulses into the speaker. The generated square wave could then be convolved with this response. But I thought a simpler way would be to create a custom FIR frequency response in REAPER, by visually comparing the speaker input and microphone capture spectra. When their spectra are laid on top of each other, we can read the required frequency response as the difference between harmonic powers, using the cursor in baudline. No problem, it's just 70 harmonics until we're outside hearing range!

[Image: Screenshot of Baudline showing lots of frequency spikes, and next to it a CSV list of dozens of frequencies and power readings in the Vim editor.]

I then subtracted one spectrum from another and manually created a ReaFir filter based on the extrema of the resulting graph.

[Image: Screenshot of REAPER's FIR filter editor, showing a frequency response made out of nodes and lines interpolated between them.]

Because the Otamatone's mouth can be twisted to make slightly different vowels I recorded two spectra, one with the mouth fully closed and the other one as open as possible.

But this method didn't quite give the sound the piercing nasalness I was hoping for.

FIR filter 2: better

After all that work I realized the line connection works in both directions! I can just feed any signal and the Otamatone will sound it via the speaker. So I generated a square wave in Audacity, set its frequency to 35 Hz to accommodate 30 milliseconds of response, played it via one sound card and recorded via another one:

[Image: Two waveforms, the top one of which is a square wave and the bottom one has a slowly decaying signal starting at every square transition.]

The waveform below is called the step response. The simplest way to get a FIR convolution kernel is to just copy-paste one of the repetitions. Strictly, to get an impulse response would require us to sound a unit impulse, i.e. just a single sample at maximum amplitude, not a square wave. But I'm not redoing that since recording this was hard enough already. For instance, I had to turn off the fridge to minimize background noise. I forgot to turn it back on, and now I have a box of melted ice cream and a freezer that smells like salmon. The step response gives pretty good results.

One of my favorite audio tools, sox, can do FFT convolution with an impulse response. You'll have to save the impulse response as a whitespace-separated list of plaintext sample values, and then run sox original.wav convolved.wav fir response.csv.

Or one could use a VST plugin like FogConvolver:

[Image: A screenshot of Fog Convolver.]

A little organic touch

There's more to an instrument's sound than its frequency spectrum. The way the note begins and ends, the so-called attack and release, are very important cues for the listener.

The width of a player's finger on the Otamatone causes the pressure to be distributed unevenly at first, resulting in a slight glide in frequency. This also happens at note-off. The exact amount of Hertz to glide depends on the octave, and by experimentation I stuck with a slide-up of 5 % of the target frequency in 0.1 seconds.

It is also very difficult to hit the correct note, so we could add some kind of random tuning error. But turns out this is would be too much; I want the music to at least be in tune.

Glides (glissando) are possible with the virtual instrument by playing a note before releasing the previous one. This glissando also happens in 100 milliseconds. I think it sounds pretty good when used in moderation.

I read somewhere (Wikipedia?) that vibrato is also possible with Otamatone. I didn't write a vibratio feature in the code itself, but it can be added using a VST plugin in REAPER (I use MVibrato from MAudioPlugins). I also added a slight flanger with inter-channel phase difference in the sample below, to make the sound just a little bit easier on the ears (but not too much).

Sometimes the Otamatone makes a short popping sound, perhaps when finger pressure is not firm enough. I added a few of these randomly after note-off.

Working with MIDI

We're getting on a side track, but anyway. Working with MIDI used to be straightforward on the Mac. But GarageBand, the tool I currently use to write music, amazingly doesn't have a MIDI export function. However, you can "File -> Add Region To Loop Library", then find the AIFF file in the loop library folder, and use a tool called GB2MIDI to extract MIDI data from it.

I used mididump from python-midi to read MIDI files.

Tyna Wind - lucid future vector

Here's TÄMÄ's beautiful synthesized voice singing us a song.

Descrambling split-band voice inversion with deinvert

Voice inversion is a primitive method of rendering speech unintelligible to prevent eavesdropping of radio or telephone calls. I wrote about some simple ways to reverse it in a previous post. I've since written a software tool, deinvert (on GitHub), that does all this for us. It can also descramble a slightly more advanced scrambling method called split-band inversion. Let's see how that happens behind the scenes.

Simple voice inversion

Voice inversion works by inverting the audio spectrum at a set maximum frequency called the inversion carrier. Frequencies near this carrier will thus become frequencies near zero Hz, and vice versa. The resulting audio is unintelligible, though familiar sentences can easily be recognized.

Deinvert comes with 8 preset carrier frequencies that can be activated with the -p option. These correspond to a list of carrier frequencies I found in an actual scrambler's manual, dubbed "the most commonly used inversion carriers".

The algorithm behind deinvert can be divided into three phases: 1) pre-filtering, 2) mixing, and 3) post-filtering. Mixing means multiplying the signal by an oscillation at the selected carrier frequency. This produces two sidebands, or mirrored copies of the signal, with the lower one frequency-inverted. Pre-filtering is necessary to prevent this lower sideband from aliasing when its highest components would go below zero Hertz. Post-filtering removes the upper sideband, leaving just the inverted audio. Both filters can be realized as low-pass FIR filters.

[Image: A spectrogram in four steps, where the signal is first cut at 3 kHz, then shifted up, producing two sidebands, the upper of which is then filtered out.]

This operation is its own inverse, like ROT13; by applying the same inversion again we get intelligible speech back. Indeed, deinvert can also be used as a scrambler by just running unscrambled audio through it. The same inversion carrier should be used in both directions.

Split-band inversion

The split-band scrambling method adds another carrier frequency that I call the split point. It divides the spectrum into two parts that are inverted separately and then combined, preventing ordinary inverters from fully descrambling it.

A single filter-inverter pair may already bring back the low end of the spectrum. Descrambling it fully amounts to running the inversion algorithm twice, with different settings for the filters and mixer, and adding the results together.

The problem here is to find these two frequencies. But let's take a look at an example from audio scrambled using the CML CMX264 split-band inverter (from a video by GBPPR2).

[Image: A spectrogram showing a narrow band of speech-like harmonics, but with a constant dip in the middle of the band.]

In this case the filter roll-off is clearly visible in the spectrogram and it's obvious where the split point is. The higher carrier is probably at the upper limit of the full band or slightly above it. Here the full bandwidth seems to be around 3200 Hz and the split point is at 1200 Hz. This could be initially descrambled using deinvert -f 3200 -s 1200; if the result sounds shifted up or down in frequency this could be refined accordingly.


On a single core of an i7-based laptop from 2013, deinvert processes a 44.1 kHz WAV file at 60x realtime speed (120x for simple inversion). Most of the CPU cycles are spent doing filter convolution, i.e. calculating the signal's vector dot product with the low-pass filter kernels:

[Image: A graph of the time spent in various parts of the call tree of the program, with the subtree leading to the dot product operation highlighted. It takes well over 80 % of the tree.]

For this reason deinvert has a quality setting (0 to 3) for controlling the number of samples in the convolution kernels. A filter with a shorter kernel is linearly faster to compute, but has a low roll-off and will leave more unwanted harmonics.

A quality setting of 0 turns filtering off completely, and is very fast. For simple inversion this should be fine, as long as the original doesn't contain much power above the inversion carrier. It's easy to ignore the upper sideband because of its high frequency. In split-band descrambling this leaves some nasty folded harmonics in the speech band though.

Here's a descramble of the above CMX264 split-band audio using all the different quality settings in deinvert. You will first hear it scrambled, and then descrambled with increasing quality setting.

The default quality level is 2. This should be enough for real-time descrambling of simple inversion on a Raspberry Pi 1, still leaving cycles for an FM receiver for instance:

(RasPi 1)Simple inversionSplit-band inversion
-q 016x realtime5.8x realtime
-q 16.5x realtime3.0x realtime
-q 22.8x realtime1.3x realtime
-q 31.2x realtime0.4x realtime

The memory footprint is less than four megabytes.

Future developments

There's a variant of split-band inversion where the inversion carrier changes constantly, called variable split-band. The transmitter informs the receiver about this sequence of frequencies via short bursts of data every couple of seconds or so. This data seems to be FSK, but it shall be left to another time.

I've also thought about ways to automatically estimate the inversion carrier frequency. Shifting speech up or down in frequency breaks the relationships of the harmonics. Perhaps this fact could be exploited to find a shift that would minimize this error?

Gramophone audio from photograph, revisited

"I am the atomic powered robot. Please give my best wishes to everybody!"

Those are the words uttered by Tommy, a childhood toy robot of mine. I've taken a look at his miniature vinyl record sound mechanism a few times before (#1, #2), in an attempt to recover the analog audio signal using only a digital camera. Results were noisy at best. The blog posts resurfaced in a recent IRC discussion which inspired me to try my luck with a slightly improved method.

Source photo

I'm using a photo of Tommy's internal miniature record I already had from previous adventures in 2012. This way, I don't have to perform another invasive operation on Tommy, as I already broke a plastic tab last time I opened him. But it also means I don't have control over the photographing environment.

The picture was taken with a DSLR and it's an uncompressed 8-bit color photo measuring 3000 by 3000 pixels. There's a fair amount of focus blur, chromatic aberration and similar distortions. But at this resolution, a clear pattern can be seen when zooming into the grooves.

[Image: Close-up shot of a miniature vinyl record, with a detail view of the grooves.]

This pattern superficially resembles a variable-area optical audio track seen in old film prints, and that's why I previously tried to decode it as such. But it didn't produce satisfactory results, and there is no physical reason it even should. In fact, I'm not even sure as to which physical parameter the audio is encoded in – does the needle move vertically or horizontally? How would this feature manifest itself in the photograph? Do the bright blobs represent crests in the groove, or just areas that happen to be oriented the right way in this particular lighting?


To make the grooves a little easier to follow I first unwrapped the circular record into a linear image. I did this by remapping the image space from polar to 9000-wide Cartesian coordinates and then resampling it with a windowed sinc kernel:

[Image: The photo of the circular record unwrapped into a long linear strip.]

Mapping the groove path

It's not easy to automatically follow the groove. As one would imagine, it's not a mathematically perfect spiral. Sometimes the groove disappears into darkness, or blurs into the adjacent track. But it wasn't overly tedious to draw a guiding path manually. Most of the work was just copy-pasting from a previous groove and making small adjustments.

I opened the unwrapped image in Inkscape and drew a colored polyline over all obvious grooves. I tried to make sure a polyline at the left image border would neatly continue where the previous one ended on the right side.

The grooves were alternatively labeled as 'a' and 'b', since I knew this record had two different sound effects on interleaved tracks.

[Image: A zoomed-in view of the unwrapped grooves labeled and highlighted with colored lines.]

This polyline was then exported from Inkscape and loaded by a script that extracted a 3-7 pixel high column from the unwrapped original, centered around the groove, for further processing.

Pixels to audio

I had noticed another information-carrying feature besides just the transverse area of the groove: its displacement from center. The white blobs sometimes appear below or above the imaginary center line.

[Image: Parts of a few grooves shown greatly magnified. They appear either as horizontal stripes, or horizontally organized groups of distinct blobs.]

I had my script calculate the brightness mass center (weighted y average) relative to the track polyline at all x positions along the groove. This position was then directly used as a PCM sample value, and the whole groove was written to a WAV file. A noise reduction algorithm was also applied, based on sample noise from the silent end of the groove.

The results are much better than what I previously obtained (see video below, or mp3 here):

Future ideas

Several factors limit the fidelity and dynamic range obtained by this method. For one, the relationship between the white blobs and needle movement is not known. The results could possibly still benefit from more pixel resolution and color bit depth. The blob central displacement (insofar as it is the most useful feature) could also be more accurately obtained using a Gaussian fit or similar algorithm.

The groove guide could be drawn more carefully, as some track slips can be heard in the recovered audio.

Opening up the robot for another photograph would be risky, since I already broke a plastic tab before. But other ways to optically capture the signal would be using a USB microscope or a flatbed scanner. These methods would still be only slightly more complicated that just using a microphone! The linear light source of the scanner would possibly cause problems with the circular groove. I would imagine the problem of the disappearing grooves would still be there, unless some sort of carefully controlled lighting was used.

Virtual music box

A little music project I was writing required a melody be played on a music box. However, the paper-programmable music box I had (pictured) could only play notes on the C major scale. I couldn't easily find a realistic-sounding synthesizer version either. They all seemed to be missing something. Maybe they were too perfectly tuned? I wasn't sure.

Perhaps, if I digitized the sound myself, I could build a flexible virtual instrument to generate just the perfect sample for the piece!

[Image: A paper programmable music box.]

I haven't really made a sampled instrument before, short of perhaps using Impulse Tracker clones with terrible single-sample ones. So I proceeded in an improvised manner. Below I'll post some interesting findings and sound samples of how the instrument developed along the way. There won't be any source code as for now.

By the way, there is a great explanatory video by engineerguy about the workings of music boxes that will explain some terminology ("pins" and "teeth") used in this post.

Recording samples

[Image: A recording setup with a microphone.]

The first step was, obviously, to record the sound to be used as samples. I damped my room using towels and mattresses to minimize room echo; this could be added later if desired, but for now it would only make it harder to cleanly splice the audio. The microphone used was the Audio Technica AT2020, and I digitized it using the Behringer Xenyx 302 USB mixer.

I perforated a paper roll to play all the possible notes in succession, and rolled the paper through. The sound of the paper going through the mechanism posed a problem at first, but I soon learned to stop the paper at just the right moment to make way for the sound of the tooth.

Now I had pretty decent recordings of the whole two-octave range. I used Audacity to extract the notes from the recording, and named the files according to the actual playing MIDI pitch. (The music box actually plays a G# major scale, contrary to what's marked on the blank paper rolls.)

The missing notes

Next, we'll need to generate the missing notes that don't belong in the scale of this music box. Because pitch is proportional to the speed of vibration, this could be done by simply speeding up or slowing down an adjacent note by just the right factor. In equal temperament tuning, this factor would be the 12th root of 2, or roughly 1.05946. Such scaling is straightforward to do on the command line using SoX, for instance (sox c1.wav c_sharp1.wav speed 1.05946).

[Image: Musical notation explaining transposition by multiplication by the 12th root of 2.]

This method can also be used to generate whole new octaves; for example, a transposition of +8 semitones would have a ratio of (12√2)8 ≈ 1.5874. Inter-note variance could be retained by using a random source file for each resampled note. But large-interval transpositions would probably not sound very good due to coloring in the harmonic series.

Here's a table of some intervals and the corresponding speed ratios in equal temperament:

–3= (12√2)–3≈ 0.840896
–2= (12√2)–2≈ 0.890899
–1= (12√2)–1≈ 0.943874
+1= (12√2)1≈ 1.059463
+2= (12√2)2≈ 1.122462
+3= (12√2)3≈ 1.189207

First test!

Now I could finally write a script to play my melody!

It sounds pretty good already - there's no obvious noise and the samples line up seamlessly even though they were just naively glued together sample by sample. There's a lot of power in the lower harmonics, probably because of the big cardboard box I used, but this can easily be changed by EQ if we want to give the impression of a cute little music box.

Adding errors

The above sound still sounded quite artificial, I think mostly because simultaneous notes start on the same exact millisecond. There seems to be a small timing variance in music boxes that is an important contributor to their overall delicate sound. In the below sample I added a timing error from a normal distribution with a standard deviation of 11 milliseconds. It sounds a lot better already!

Other sounds from the teeth

If you listen to recordings of music boxes you can occasionally hear a high-pitched screech as well. It sounds a bit like stopping a tuning fork or guitar string with a metal object. That's why I thought it must be the sound of the pin stopping a vibrating tooth just before playing another note on the same tooth.

[Image: Spectrogram of the beginning of a note with the characteristic screech, centered around 12 kilohertz.]

Sure enough, this sound could always be heard by playing the same note twice in quick succession. I recorded this sound for each tooth and added it to my sound generator. The sound will be generated only if the previous note sample is still playing, and its volume will be scaled in proportion to the tooth's envelope amplitude at that moment. Also, it will silence the note. The amount of silence between the screech and the next note will depend on a tempo setting.

Adding this resonance definitely brings about a more organic feel:

The wind-up mechanism

For a final touch I recorded sounds from the wind-up mechanism of another music box, even though this one didn't have one. It's all stitched up from small pieces, so the number of wind-ups in the beginning and the speed of the whirring sound can all be adjusted. I was surprised at the smoothness of the background sound; it's a three-second loop with no cross-fading involved. You can also hear the box lid being closed in the end.


[Image: VIM screenshot of a text file containing music box markup.]

The native notation of a music box is some kind of a perforated tape or drum, so I ended up using a similar format. There's a tempo marking and tuning information in the beginning, followed by notation one eighth per line. Arpeggios are indicated by a pointy bracket >. I also wrote a script to convert MIDI files into this format; but the number of notes in a music box loop is usually so small that it's not very hard to write manually.

This format could include additional information as well, perhaps controlling the motor sound or box size and shape (properties of the EQ filter).

This format could also potentially be useful when producing or transcribing music from music drums.

Future developments

Currently the music box generator has a hastily written "engineer's UI", which means I probably won't remember how to use it in a couple months' time. Perhaps it could it be integrated into some music software, as a plugin.

Possibilities for live performances are limited, I think. It wouldn't work exactly like a keyboard instrument usually does. At least there should be a way to turn on the background noise, and the player should take into account the 300-millisecond delay caused by the pin slowly rotating over the tooth. But it could be used to play a roll in an endless loop and the settings could be modified on the fly.

As such, the tool performs best at pre-rendering notated music. And I'm happy with the results!

CTCSS fingerprinting: a method for transmitter identification

Identifying unknown radio transmitters by their signals is called radio fingerprinting. It is usually based on rise-time signatures, i.e. characteristic differences in how the transmitter frequency fluctuates at carrier power-up. Here, instead, I investigate the fingerprintability of another feature in hand-held FM transceivers, known as CTCSS or Continuous Tone-Coded Squelch System.

Motivation & data

I came across a long, losslessly compressed recording of some walkie-talkie chatter and wanted to know more about it, things like the number of participants and who's talking with who. I started writing a transcript – a fun pastime – but some voices sounded so similar I wondered if there was a way to tell them apart automatically.

[Image: Screenshot of Audacity showing an audio file over eleven hours long.]

The file comprises several thousand short transmissions as FM demodulated audio lowpass filtered at 4500 Hz. Signal quality is variable; most transmissions are crisp and clear but some are buried under noise. Passages with no signal are squelched to zero.

I considered several potentially fingerprintable features, many of them unrealistic:

  • Carrier power-up; but many transmissions were missing the very beginning because of squelch
  • Voice identification; but it would probably require pretty sophisticated algorithms (too difficult!) and longer samples
  • Mean audio power; but it's not consistent enough, as it depends on text, tone of voice, etc.
  • Maximum audio power; but it's too sensitive to peaks in FM noise

I then noticed all transmissions had a very low tone at 88.5 Hz. It turned out to be CTCSS, an inaudible signal that enables handsets to silence unwanted transmissions on the same channel. This gave me an idea inspired by mains frequency analysis: Could this tone be measured to reveal minute differences in crystal frequencies and modulation depths? Also, knowing that these were recorded using a cheap DVB-T USB stick – would it have a stable enough oscillator to produce consistent measurements?


I used the liquid-dsp library for signal processing. It has several methods for measuring frequencies. I decided to use a phase-locked loop, or PLL; I could have also used FFT with peak interpolation.

In my fingerprinting tool, the recording is first split into single transmissions. The CTCSS tone is bandpass filtered and a PLL starts tracking it. When the PLL frequency stops fluctuating, i.e. the standard deviation is small enough, it's considered locked and its frequency is averaged over this time. The average RMS power is measured similarly.

Here's one such transmission:

[Image: A graph showing frequency and power, first fluctuating but then both stabilize for a moment, where text says 'PLL locked'. Caption says 'No, I did not copy'.]


At least three clusters are clearly distinguishable by eye. Zooming in to one of the clusters reveals it's made up of several smaller clusters. Perhaps the larger clusters correspond to three different models of radios in use, and these smaller ones are the individual transmitters?

[Image: A plot of RMS power versus frequency, with dots scattered all over, but mostly concentrated in a few clusters.]

A heat map reveals even more structure:

[Image: The same clusters presented in a gradual color scheme and numbered from 1 to 12.]

It seems at least 12 clusters, i.e. potential individual transmitters, can be distinguished.

Even though most transmissions are part of some cluster, there are many outliers as well. These appear to correspond to a very noisy or very short transmission. (Could the FFT have produced better results with these?)

Use as transcription aid

My goal was to make these fingerprints useful as labels aiding transcription. This way, a human operator could easily distinguish parties of a conversation and add names or call signs accordingly.

I experimented with automated k-means clustering, but that didn't immediately produce appealing results. Then I manually assigned 12 anchor points at apparent cluster centers and had a script calculate the nearest anchor point for all transmissions. Prior to distance calculations the axes were scaled so that the data seemed uniformly distributed around these points.

This automatic labeling proved quite sensitive to errors. It could be useful when listing possible transmitters for an unknown transmission with no context; distances to previous transmissions positively mentioning call signs could be used. Instead I ended up printing the raw coordinates and colouring them with a continuous RGB scale:

[Image: A few lines from a conversation between Boa 1 and Cobra 1. Numbers in different colors are printed in front of each line.]

Here the colours make it obvious which party is talking. Call signs written in a darker shade are deduced from the context. One sentence, most probably by "Cobra 1", gets lost in noise and the RMS power measurement becomes inaccurate (463e-6). The PLL frequency is still consistent with the conversation flow, though.


If CTCSS is not absolutely required in your network, i.e. there are no unwanted conversations on the frequency, then it can be disabled to prevent this type of fingerprinting. In Motorola radios this is done by setting the CTCSS code to 0. (In the menus it may also be called a PT code or Interference Eliminator code.) In many other consumer radios it's doesn't seem to be that easy.


CTCSS is a suitable signal for fingerprinting transmitters, reflecting minute differences in crystal frequencies and, possibly, FM modulation indices. Even a cheap receiver can recover these differences. It can be used when the signal is already FM demodulated or otherwise not suitable for more traditional rise-time fingerprinting.