Capturing PAL video with an SDR (and a few dead-ends)

I play 1980s games, mostly Super Mario Bros., on the Nintendo NES console. It would be great to be able to capture live video from the console for recording speedrun attempts. Now, how to make the 1985 NES and the 2013 MacBook play together, preferably using hardware that I already have? This project diary documents my search for the answer.

Here's a spoiler – it did work:

[Image: A powered-on NES console and a MacBook on top of it, showing a Tetris title screen.]

Things that I tried first

A capture device

Video capture devices, or capture cards, are devices specially made for this purpose. There was only one cheap (~30€) capture device for composite video available locally, and I bought it, hopingly. But it wasn't readily recognized as a video device on the Mac, and there seemed to be no Mac drivers available. Having already almost capped my budget for this project I then ordered a 5€ EasyCap device from eBay, as there was some evidence of Mac drivers online. The EasyCap was still making its way to Finland as of this writing, so I continued to pursure other routes.

PS: When the device finally arrived, it sadly seemed that the EasyCapViewer-Fushicai software only supports opening this device in NTSC mode. There's PAL support in later commits in the GitHub repo, but the project is old and can't be compiled anymore as Apple has deprecated QuickTime.

Even when they do work, a downside to many cheap capture devices is that they can only capture at half the true framerate (that is, at 25 or 30 fps).

CRT TV + DSLR camera

The cathode-ray tube television that I use for gaming could be filmed with a digital camera. This posed interesting problems: The camera must be timed appropriately so that a full scan is captured in every frame, to prevent temporal aliasing (stripes). This is why I used a DSLR camera with a full manual mode (Canon EOS 550D in this case).

For the 50 Hz PAL television screen I used a camera frame rate of 25 fps and an exposure time of 1/50 seconds (set by camera limitations). The camera will miss every other frame of the original 50 fps video, but on the other hand, will get an evenly lit screen every time.

A Moiré pattern will also appear if the camera is focused on the CRT shadow mask. This is due to intererence between two regular 2D arrays, the shadow mask in the TV and the CCD array in the camera. I got rid of this by setting the camera on manual focus and defocusing the lense just a bit.

[Image: A screen showing Super Mario Bros., and a smaller picture with Oona in it.]

This produced surprisingly good quality video, save for the slight jerkiness caused by the low frame rate (video). This setup was good for one-off videos; However, I could not use this setup for live streaming, because the camera could only record on the SD card and not connect to the computer directly.

LCD TV + webcam

An old LCD TV that I have has significantly less flicker than the CRT, and I could have live video via the webcam. But the Microsoft LifeCam HD-3000 that I have had only a binary option for manual exposure (pretty much "none" and "lots"). Using the higher setting the video was quite washed out, with lots of motion blur. The lower setting was so fast that it looked like the LCD had visible vertical scanning. Brightness was also heavily dependent on viewing angle, which caused gradients over the image. I had to film at a slightly elevated angle so that the upper part of the image wouldn't go too dark, and this made the video look like a bootleg movie copy.

[Image: A somewhat blurry photo of an LCD TV showing Super Mario Bros.]

Composite video

Now to capturing the actual video signal. The NES has two analog video outputs: one is composite video and the other an RF modulator, which has the same composite video signal modulated onto an AM carrier in the VHF television band plus a separate FM audio carrier. This is meant for televisions with no composite video input: the TV sees the NES as an analog TV station and can tune to it.

In composite video, information about brightness, colour, and synchronisation is encoded in the signal's instantaneous voltage. The bandwidth of this signal is at least 5 MHz, or 10 MHz when RF modulated, which would require a 10 MHz IQ sampling rate.

[Image: Oscillogram of one PAL scanline, showing hsync, colour burst, and YUV parts.]

I happen to have an Airspy R2 SDR receiver that can listen to VHF and take 10 million samples per second - could it be possible? I made a cable that can take the signal from the NES RCA connector to the Airspy SMA connector. And sure enough, when the NES RF channel selector is at position "3", a strong signal indeed appears on VHF television channel 3, at around 55 MHz.

Software choices

There's already an analog TV demodulator for SDRs - it's a plugin for SDR# called TVSharp. But SDR# is a Windows program and TVSharp doesn't seem to support colour. And it seemed like an interesting challenge to write a real-time PAL demodulator myself anyway.

I had been playing with analog video demodulation recently because of my HDMI Tempest project (video). So I had already written a C++ program that interprets a 10 Msps digitised signal as greyscale values and sync pulses and show it live on the screen. Perhaps this could be used as a basis to build on. (It was not published, but apparently there is a similar project written in Java, called TempestSDR)

Data transfer from the SDR is done using airspy_rx from airspy-tools. This is piped to my program that reads the data into a buffer, 256 ksamples at a time.

Automatic gain control is an important part of demodulating an AM signal. I used liquid-dsp's AGC by feeding it the maximum amplitude over every scanline period; this roughly corresponds to sync level. This is suboptimal, but it works in our high-SNR case. AM demodulation was done using std::abs() on the complex-valued samples. The resulting real value had to be flipped from 1, because TV is transmitted "inverse AM" to save on the power bill. I then scaled the signal so that black level was close to 0, white level close to 1, and sync level below 0.

I use SDL2 to display the video and OpenCV for pixel addressing, scaling, cropping, and YUV-RGB conversions. OpenCV is an overkill dependency inherited from the Tempest project and SDL2 could probably do all of those things by itself. This remains TODO.

Removing the audio

The captured AM carrier seems otherwise clean, but there's an interfering peak on the lower sideband side at about –4.5 MHz. I originally saw it in the demodulated signal and thought it would be related to colour, as it's very close to the PAL chroma subcarrier frequency of 4.43361875 MHz. But when it started changing frequency in triangle-wave shapes, I realized it's the audio FM carrier. Indeed, when it is FM demodulated, beautiful NES music can be heard.

[Image: A spectrogram showing the AM carrier centered in zero, with the sidebands, chroma subcarriers and audio alias annotated.]

The audio carrier is actually outside this 10 MHz sampled bandwidth. But it's so close to the edge (and so powerful) that the Airspy's anti-alias filter cannot sufficiently attenuate it, and it becomes folded, i.e. aliased, onto our signal. This caused visible banding in the greyscale image, and some synchronization problems.

I removed the audio using a narrow FIR notch filter from the liquid-dsp library. Now, the picture quality is very much acceptable. Minor artifacts are visible in narrow vertical lines because of a pixel rounding choice I made, but they can be ignored.

[Image: Black-and-white screen capture of NES Tetris being played.]

Decoding colour

PAL colour is a bit complicated. It was designed in the 1960s to be backwards compatible with black-and-white TV receivers. It uses the YUV colourspace, the Y or "luminance" channel being a black-and-white sum signal that already looks good by itself. Even if the whole composite signal is interpreted as Y, the artifacts caused by colour information are bearable. Y also has a lot more bandwidth, and hence resolution, than the U and V (chrominance) channels.

U and V are encoded in a chrominance subcarrier in a way that I still haven't quite grasped. The carrier is suppressed, but a burst of carrier is transmitted just before every scanline for reference (so-called colour burst).

Turns out that much of the chroma information can be recovered by band-pass filtering the chrominance signal, mixing it down to baseband using a PLL locked to the colour burst, rotating it by a magic number (chroma *= std::polar(1.f, deg2rad(170.f))), and plotting the real and imaginary parts of this complex number as the U and V colour channels. This is similar to how NTSC colour is demodulated.

In PAL, every other scanline has its chrominance phase shifted (hence the name, Phase Alternating [by] Line). I couldn't get consistent results demodulating this, so I skipped the chrominance part of every other line and copied it from the line above. This doesn't even look too bad for my purposes. However, there seems to be a pre-echo in UV that's especially visible on a blue background (most of SMB1 sadly), and a faint stripe pattern on the Y channel, most probably crosstalk from the chroma subcarrier that I left intact for now.

[Image: The three chroma channels Y, U, and V shown separately as greyscale images, together with a coloured composite of Mario and two Goombas.]

I used liquid_firfilt to band-pass the chroma signal, and liquid_nco to lock onto the colour burst and shift the chroma to baseband.

Let's play Tetris!


It's not my goal to use this system as a gaming display; I'm still planning to use the CRT. However, total buffer delays are quite small due to the 10 Msps sampling rate, so the latency from controller to screen is pretty good. The laptop can also easily decode and render at 50 fps, which is the native frame rate of the PAL NES. Tetris is playable up to level 12!

Using a slow-mo phone camera, I measured the time it takes for a button press to make Mario jump. The latency is similar to that of a NES emulator:

MethodFrames @240fpsLatency
RetroArch emulator28117 ms
PAL NES + Airspy SDR26108 ms
PAL NES + LCD TV2083 ms

Performance considerations

A 2013 MacBook Pro is perhaps not the best choice for dealing with live video to begin with. But I want to be able to run the PAL decoder and a screencap / compositing / streaming client on the same laptop, so performance is even more crucial.

When colour is enabled, CPU usage on this quad-core laptop is 110% for palview and 32% for airspy_rx. The CPU temperature is somewhere around 85 °C. Black-and-white decoding lowers palview usage to 84% and CPU temps to 80 °C. I don't think there's enough cycles left for a streaming client just yet. Some CPU headroom would be nice as well; a resync after dropped samples looks quite nasty, and I wouldn't want that to happen very often.

[Image: htop screenshot show palview and airspy_rx on top, followed by some system processes.]

Profiling reveals that the most CPU-intensive tasks are those related to FIR filtering. FIR filters are based on convolution, which is of high computational complexity, unless done in hardware. FFT convolution can also be faster, but only when the kernel is relatively long.

[Image: Diagram shows the Audio notch FIR takes up 27 % and Chroma Bandpass FIR 12 % of CPU. Several smaller contributors mentioned.

I've thought of having another computer do the Airspy transfer, audio notch filtering, and AM demodulation, and then transmit this preprocessed signal to the laptop via Ethernet. But my other computers (Raspberry Pi 3B+ and a Core 2 Duo T7500 laptop) are not nearly as powerful as the MacBook.

Instead of a FIR bandpass filter, a so-called chrominance comb filter is often used to separate chrominance from luminance. This could be realized very efficiently as a linear-complexity delay line. This is a promising possibility, but so far my experiments have had mixed results.

There's no source code release for now (Why? FAQ), but if you want some real-time coverage of this project, I did a multi-threaded tweetstorm: one, two, three.

Beeps and melodies in two-way radio

Lately my listening activities have focused on two-way FM radio. I'm interested in automatic monitoring and visualization of multiple channels simultaneously, and classifying transmitters. There's a lot of in-band signaling to be decoded! This post shall demonstrate this diversity and also explain how my listening station works.

Background: walkie-talkies are fun

The frequency band I've recently been listening to the most is called PMR446. It's a European band of radio frequencies for short-distance UHF walkie-talkies. Unlike ham radio, it doesn't require licenses or technical competence – anyone with 50€ to spare can get a pair of walkie-talkies at the department store. It's very similar to FRS in the US. It's quite popular where I live.

[Image: Photo of three different walkie-talkies.]

The short-distance nature of PMR446 is what I find perhaps most fascinating: in normal conditions, everything you hear has been transmitted from a 2-kilometer (1.3-mile) radius. Transmitter power is limited to 500 mW and directional antennas are not allowed on the transmitter side. But I have a receive-only system and a my only directional antenna is for 450 MHz, which is how I originally found these channels.

Roger beep

The roger beep is a short melody sent by many hand-held radios to indicate the end of transmission.

The end of transmission must be indicated, because two-way radio is 'half-duplex', which means only one person can transmit at a time. Some voice protocols solve the same problem by mandating the use of a specific word like 'over'; others rely on the short burst of static (squelch tail) that can be heard right after the carrier is lost. Roger beeps are especially common in consumer radios, but I've heard them in ham QSOs as well, especially if repeaters are involved.

Other signaling on PMR

PMR also differs from ham radio in that many of its users don't want to hear random people talking on the same frequency; indeed, many devices employ tones or digital codes designed to silence unwanted conversations, called CTCSS, DCS, or coded squelch. They are very low-frequency tones that can't usually be heard at all because of filtering. These won't prevent others from listening to you though; anyone can just disable coded squelch on their device and hear everyone else on the channel.

Many devices also use a tone-based system for preventing the short burst of static, that classic walkie-talkie sound, from sounding whenever a transmission ends. Baofeng calls these squelch tail elimination tones, or STE for short. The practice is not standardized and I've seen several different sub-audible frequencies being used in the wild, namely 55, 62, and 260 Hz. (Edit: As correctly pointed out by several people, another way to do this is to reverse the phase of the CTCSS tone in the end, called a 'reverse burst'. Not all radios use it though; many opt to send a 55 Hz tone instead, even when they are using CTCSS.)

Some radios have a button called 'alarm' that sends a long, repeating melody resembling a 90s mobile phone ring tone. These melodies also vary from one radio to the other.

My receiver

I have a system in place to alert me whenever there's a strong enough signal matching an interesting set of parameters on any of the eight PMR channels. It's based on a Raspberry Pi 3B+ and an Airspy R2 SDR receiver. The program can play the live audio of all channels simultaneously, or one could be selected for listening. It also has an annotated waterfall view that shows traffic on the band during the last couple of hours:

[Image: A user interface with text-mode graphics, showing eight vertical lanes of timestamped information. The lanes are mostly empty, but there's an occasional colored bar with annotations like 'a1' or '62'.]

The computer is a headless Raspberry Pi with only SSH connectivity; that's why it's in text mode. Also, text-mode waterfall plots are cool!

The coloured bars indicate signal strength (colour) and the duty factor (pattern). The numbers around the bars are decoded squelch codes, STEs and roger beeps. Uncertain detections are greyed out. In this view we've detected roger beeps of type 'a1' and 'a2'; a somewhat rare 62 Hz STE tone; and a ring tone, or alarm (RNG).

Because squelch codes are designed to be read by electronic circuits and their frequencies and codewords are specified exactly, writing a digital decoder for them was somewhat straightforward. Roger beeps and ring tones, on the other hand, are only meant for the human listener and detecting them amongst the noise took a bit more trial-and-error.

Melody detection algorithm

The melody detection algorithm in my receiver is based on a fast Fourier transform (FFT). When loss of carrier is detected, the last moments of the audio are searched for tones thusly:

[Image: A diagram illustrating how an FFT is used to search for a melody. The FFT in the image is noisy and some parts of the melody can not be measured.]
  1. The audio buffer is divided up into overlapping 60-millisecond Hann-windowed slices.
  2. Every slice is Fourier transformed and all peak frequencies (local maxima) are found. Their center frequencies are refined using Gaussian peak interpolation (Gasior & Gonzalez 2004). We need this, because we're only going to allow ±15 Hz of frequency error.
  3. The time series formed by the strongest maxima is compared to a list of pre-defined 'tone signatures'. Each candidate tone signature gets a score based on how many FFT slices match (+) corresponding slices of the tone signature. Slices with too much frequency error subtract from the score ().
  4. Most tone signatures have one or more 'quiet zones', the quietness of which further contributes to the score. This is usually placed after the tone, but some tones may also have a pause in the middle.
  5. The algorithm allows second and third harmonics (with half the score), because some transmitters may distort the tones enough for these to momentarily overpower the fundamental frequency.
  6. Every possible time shift (starting position) inside the 1.5-second audio buffer is searched.
  7. The tone signature with the best score is returned, if this score exceeds a set threshold.

This algorithm works quite well. It's not always able to detect the tones, especially if part of the melody is completely lost in noise, but it's good enough to be used for waterfall annotation. False positives are rare; most of them are detections of very short tone signatures that only consist of one or two beeps. My test dataset of 92 recorded transmissions yields only 5 false negatives and no false positives.

For example, this noisy recording:

was succesfully recognized as having a ringtone (RNG), a roger beep of type a1, and CTCSS code XA:

Naming and classification

Because I love classifying stuff I've had to come up with a system for naming these roger tones as well. My current system uses a lower-case letter for classifying the tone into a category, followed by a number that differentiates similar but slightly different tones. This is a work in progress, because every now and then a new kind of tone appears.

My goal would be to map the melodies to specific manufacturers. I've only managed to map a few. Can you recognise any of these devices?

ClassIdentified modelRecording
aCobra AM845 (a1)
cMotorola TLKR T40 (c1)
hBaofeng UV-5RC

I didn't list them all here, but there are even more samples. I've added some alarm tones there as well, and a list of all the tone signatures that I currently know of. (Why no full source code? FAQ)

In my rx log I also have an emoji classification system for CTCSS codes. This way I can recognize a familiar transmission faster. A few examples below (there are 38 different CTCSS codes in total):

[Image: Two-character codes grouped into categories and paired with emoji. Four categories, namely fruit, sound, mammals, and scary. The fruit category has codes beginning with an M, and emoji for different fruit, etc.]

Future directions

There are mainly just minor bugs in my project trello at the moment, like adding the aforementioned emoji. But as the RasPi is not very powerful the DSP chain could be made more efficient. Sometimes a block of samples gets dropped. Currently it uses a bandpass-sampled filterbank to separate the channels, exploiting aliasing to avoid CPU-intensive frequency shifting altogether:

This is quite fast. But the 1:20 decimation from the Airspy IQ data is done with SoX's 1024-point FIR filter and could possibly be done with fewer coefficients. Also, the RasPi has four cores, so half of the channels could be demodulated in a second thread. Currently all concurrency is thanks to SoX and pmrsquash being different processes.

Related posts


Animated line drawings with OpenCV

OpenCV is a pretty versatile C++ computer vision library. Because I use it every day it has also become my go-to tool for creating simple animations at pixel level, for fun, and saving them as video files. This is not one of its core functions but happens to be possible using its GUI drawing tools.

Below we'll take a look at some video art I wrote for a music project. It goes a bit further than just line drawings but the rest is pretty much just flavouring. As you'll see, creating images in OpenCV has a lot in common with how you would work with layers and filters in an image editor like GIMP or Photoshop.

Setting it up

It doesn't take a lot of boilerplate to initialize an OpenCV project. Here's my minimal CMakeLists.txt:

cmake_minimum_required (VERSION 2.8)
project                (marmalade)
find_package           (OpenCV REQUIRED)
add_executable         (marmalade
target_link_libraries  (marmalade ${OpenCV_LIBS})

I also like to set compiler flags to enforce the C++11 standard, but this is not necessary.

In the main .cc file I have:

#include "opencv2/imgproc/imgproc.hpp"
#include "opencv2/highgui/highgui.hpp"

Now you can build the project by just typing cmake . && make in the terminal.

Basic shapes

First, we'll need an empty canvas. It will be a matrix (cv::Mat) with three unsigned char channels for RGB at Full HD resolution:

const cv::Size video_size(1920, 1080);
cv::Mat mat_frame = cv::Mat::zeros(video_size, CV_8UC3);

This will also initialize everything to zero, i.e. black.

Now we can draw our graphics!

I had an initial idea of an endless cascade of concentric rings each rotating at a different speed. There might be color and brightness variations as well but otherwise it would stay static the whole time. You can't see a circle's rotation around its center, so we'll add some features to them as well, maybe some kind of bars or spokes.

A simplified render method for a ring would look like this:

void Ring::RenderTo(cv::Mat& mat_output) const {
  cv::circle(mat_output, 8 * center_, 8 * radius_, color_, 1, CV_AA, 3);
  for (const Bar& bar : bars()) {
    cv::line(mat_output, 8 * (center_ + bar.start), 8 * (center_ + bar.end),
             color_, 1, CV_AA, 3);

Drawing antialiased graphics at subpixel coordinates can make for some confusing OpenCV code. Here, all coordinates are multiplied by the magic number 8 and the drawing functions are instructed to do a bit shift of 3 bits (2^3 == 8). These three bits are used for the decimal part of the subpixel position.

The coordinates of the bars are generated for each frame based on the ring's current rotation angle.

Here are some rings at different phases of rotation. A bug leaves the innermost circle with no spokes, but it kind of looks better that way.

[Image: White concentric circles on a black background, with evenly separated lines connecting them.]

Eye candy: Glow effect

I wanted a subtle vector display look to the graphics, even though I wasn't aiming for any sort of realism with it. So the brightest parts of the image would have to glow a little, or spread out in space. This can be done using Gaussian blur.

Gaussian blur requires convolution, which is very CPU-intensive. I think most of the rendering time was spent calculating blur convolution. It could be sped up using threads (cv::parallel_for_) or the GPU (cv::cuda routines) but there was no real-time requirement in this hobby project.

There are a couple of ways to only apply the blur to the brightest pixels. We could blur a copy of the image masked with its thresholded version, for example. But I like to use look-up tables (LUT). This is similar to the curves tool in Photoshop. A look-up table is just a 256-by-1 RGB matrix that maps an 8-bit index to a colour. In this look-up table I just have a linear ramp where everything under 127 maps to black.

cv::Mat mat_lut = GlowLUT();
cv::Mat mat_glow;
cv::LUT(mat_frame, mat_lut, mat_glow);

Now when blurring, if we add the original image on top of the blurred version, its sharpness is preserved:

cv::GaussianBlur(mat_glow, mat_glow, cv::Size(0,0), 3.0);
mat_frame += 2 * mat_glow;
[Image: A zoomed view of a circle, showing the glow effect.]

The effect works unevenly on antialiased lines which adds a nice pearl-stringy look.

Eye candy: Tinted glass and grid lines

I created a vignetted and dirty green-yellow tinted look by multiplying the image per-pixel by an overlay made in GIMP. This has the same effect as having a "Multiply" layer mode in an image editor. Perhaps I was thinking of an old glass display, or Vectrex overlays. The overlay also has black grid lines that will appear black in the result. Multiplication doesn't change the color of black areas in the original, but I also added a copy of the overlay at 10% brightness to make it dimly visible in the background.

cv::Mat mat_overlay = cv::imread("overlay.png");
cv::multiply(mat_frame, mat_overlay, mat_frame, 1.f/255);
mat_frame += mat_overlay * 0.1f;
[Image: A zoomed view of a circle, showing the color overlay effect.]

Eye candy: Flicker

Some objects flicker slightly for an artistic effect. This can be headache-inducing if overdone, so I tried to use it in moderation. The rings have a per-frame probability for a decrease in brightness, which I think looks good at 60 fps.

if (randf(0.f, 1.f) < .0001f)
  color *= .5f;

The spokes will also sometimes blink upon encountering each other, and the whole ring flickers a bit when it first becomes visible.

Title text

An LCD matrix font was used for the title text. This was just a PNG image of 128 characters that was spliced up and rearranged. This can be done in OpenCV by using submatrices and rectangle ROIs:

cv::Mat mat_font = cv::imread("lcd_font.png");
const cv::Size letter_size(24, 32);
const std::string text("finally, the end of the "
                       "marmalade forest!");

int cursor_x = 0;
for (char code : text_) {
  int mx = code % 32;
  int my = code / 32;

  cv::Rect font_roi(cv::Point(mx * letter_size.width,
                              my * letter_size.height),
  cv::Mat mat_letter = mat_font(font_roi);

  cv::Rect target_roi(text_origin_.x + cursor_x,
                      mat_letter.cols, mat_letter.rows)

  cursor_x += letter_size.width;
[Image: A zoomed view of the text 'finally' with a glow and color overlay effect.]

Encoding the video

Now we can save the frames as a video file. OpenCV has a VideoWriter class for just this purpose. But I like to do this a bit differently. I encoded the frame images individually as BMP and just concatenated them one after the other to stdout:

std::vector<uchar> outbuf;
cv::imencode(".bmp", mat_frame, outbuf);
fwrite(, sizeof(uchar), outbuf.size(), stdout);

I then ran this program from a shell script that piped the output to ffmpeg for encoding. This way I could also combine it with the soundtrack in a single run.

make && \
 ./marmalade -p | \
 ffmpeg -y -i $AUDIOFILE -framerate $FPS -f image2pipe \
        -vcodec bmp -i - -s:v $VIDEOSIZE -c:v libx264 \
        -profile:v high -b:a 192k -crf 23 \
        -pix_fmt yuv420p -r $FPS -shortest -strict -2 \
        video.mp4 && \
 open video.mp4


The 1080p/60 version can be viewed by clicking on the gear wheel menu.

In pursuit of Otama's tone

It would be fun to use the Otamatone in a musical piece. But for someone used to keyboard instruments it's not so easy to play cleanly. It has a touch-sensitive (resistive) slider that spans roughly two octaves in just 14 centimeters, which makes it very sensitive to finger placement. And in any case, I'd just like to have a programmable virtual instrument that sounds like the Otamatone.

What options do we have, as hackers? Of course the slider could be replaced with a MIDI interface, so that we could use a piano keyboard to hit the correct frequencies. But what if we could synthesize a similar sound all in software?

Sampling via microphone

We'll have to take a look at the waveform first. The Otamatone has a piercing electronic-sounding tone to it. One is inclined to think the waveform is something quite simple, perhaps a sawtooth wave with some harmonic coloring. Such a primitive signal would be easy to synthesize.

[Image: A pink Otamatone in front of a microphone. Next to it a screenshot of Audacity with a periodic but complex waveform in it.]

A friend lended me her Otamatone for recording purposes. Turns out the wave is nothing that simple. It's not a sawtooth wave, nor a square wave, no matter how the microphone is placed. But it sounds like one! Why could that be?

I suspect this is because the combination of speaker and air interface filters out the lowest harmonics (and parts of the others as well) of square waves. But the human ear still recognizes the residual features of a more primitive kind of waveform.

We have to get to the source!

Sampling the input voltage to the Otamatone's speaker could reveal the original signal. Also, by recording both the speaker input and the audio recorded via microphone, we could perhaps devise a software filter to simulate the speaker and head resonance. Then our synthesizer would simplify into a simple generator and filter. But this would require opening up the instrument and soldering a couple of leads in, to make a Line Out connector. I'm not doing this to my friend's Otamatone, so I bought one of my own. I named it TÄMÄ.

[Image: A Black Otamatone with a cable coming out of its mouth into a USB sound card. A waveform with more binary nature is displayed on a screen.]

I soldered the left channel and ground to the same pads the speaker is connected to. I had no idea about the voltage range in advance, but fortunately it just happens to fit line level and not destroy my sound card. As you can see in the background, we've recorded a signal that seems to be a square wave with a low duty cycle.

[Image: Oscillogram of a square wave.]

This square wave seems to be superimposed with a much quieter sinusoidal "ring" at 584 Hz that gradually fades out in 30 milliseconds.

Next we need to map out the effect the finger position on the slider has on this signal. It seems to not only change the frequency but the duty cycle as well. This happens a bit differently depending on which one of the three octave settings (LO, MID, or HI) is selected.

The Otamatone has a huge musical range of over 6 octaves:

[Image: Musical notation showing a range from A1 to B7.]

In frequency terms this means roughly 55 to 3800 Hz.

The duty cycle changes according to where we are on the slider: from 33 % in the lowest notes to 5 % in the highest ones, on every octave setting. The frequency of the ring doesn't change, it's always at around 580 Hz, but it doesn't seem to appear at all on the HI setting.

So I had my Perl-based software synth generate a square wave whose duty cycle and frequency change according to given MIDI notes.

FIR filter 1: not so good

Raw audio generated this way doesn't sound right; it needs to be filtered to simulate the effects of the little speaker and other parts.

Ideally, I'd like to simulate the speaker and head resonances as an impulse response, by feeding well-known impulses into the speaker. The generated square wave could then be convolved with this response. But I thought a simpler way would be to create a custom FIR frequency response in REAPER, by visually comparing the speaker input and microphone capture spectra. When their spectra are laid on top of each other, we can read the required frequency response as the difference between harmonic powers, using the cursor in baudline. No problem, it's just 70 harmonics until we're outside hearing range!

[Image: Screenshot of Baudline showing lots of frequency spikes, and next to it a CSV list of dozens of frequencies and power readings in the Vim editor.]

I then subtracted one spectrum from another and manually created a ReaFir filter based on the extrema of the resulting graph.

[Image: Screenshot of REAPER's FIR filter editor, showing a frequency response made out of nodes and lines interpolated between them.]

Because the Otamatone's mouth can be twisted to make slightly different vowels I recorded two spectra, one with the mouth fully closed and the other one as open as possible.

But this method didn't quite give the sound the piercing nasalness I was hoping for.

FIR filter 2: better

After all that work I realized the line connection works in both directions! I can just feed any signal and the Otamatone will sound it via the speaker. So I generated a square wave in Audacity, set its frequency to 35 Hz to accommodate 30 milliseconds of response, played it via one sound card and recorded via another one:

[Image: Two waveforms, the top one of which is a square wave and the bottom one has a slowly decaying signal starting at every square transition.]

The waveform below is called the step response. One of the repetitions can readily be used as a FIR convolution kernel. Strictly, to get an impulse response would require us to sound a unit impulse, i.e. just a single sample at maximum amplitude, not a square wave. But I'm not redoing that since recording this was hard enough already. For instance, I had to turn off the fridge to minimize background noise. I forgot to turn it back on, and now I have a box of melted ice cream and a freezer that smells like salmon. The step response gives pretty good results.

One of my favorite audio tools, sox, can do FFT convolution with an impulse response. You'll have to save the impulse response as a whitespace-separated list of plaintext sample values, and then run sox original.wav convolved.wav fir response.csv.

Or one could use a VST plugin like FogConvolver:

[Image: A screenshot of Fog Convolver.]

A little organic touch

There's more to an instrument's sound than its frequency spectrum. The way the note begins and ends, the so-called attack and release, are very important cues for the listener.

The width of a player's finger on the Otamatone causes the pressure to be distributed unevenly at first, resulting in a slight glide in frequency. This also happens at note-off. The exact amount of Hertz to glide depends on the octave, and by experimentation I stuck with a slide-up of 5 % of the target frequency in 0.1 seconds.

It is also very difficult to hit the correct note, so we could add some kind of random tuning error. But turns out this is would be too much; I want the music to at least be in tune.

Glides (glissando) are possible with the virtual instrument by playing a note before releasing the previous one. This glissando also happens in 100 milliseconds. I think it sounds pretty good when used in moderation.

I read somewhere (Wikipedia?) that vibrato is also possible with Otamatone. I didn't write a vibratio feature in the code itself, but it can be added using a VST plugin in REAPER (I use MVibrato from MAudioPlugins). I also added a slight flanger with inter-channel phase difference in the sample below, to make the sound just a little bit easier on the ears (but not too much).

Sometimes the Otamatone makes a short popping sound, perhaps when finger pressure is not firm enough. I added a few of these randomly after note-off.

Working with MIDI

We're getting on a side track, but anyway. Working with MIDI used to be straightforward on the Mac. But GarageBand, the tool I currently use to write music, amazingly doesn't have a MIDI export function. However, you can "File -> Add Region To Loop Library", then find the AIFF file in the loop library folder, and use a tool called GB2MIDI to extract MIDI data from it.

I used mididump from python-midi to read MIDI files.

Tyna Wind - lucid future vector

Here's TÄMÄ's beautiful synthesized voice singing us a song.

Descrambling split-band voice inversion with deinvert

Voice inversion is a primitive method of rendering speech unintelligible to prevent eavesdropping of radio or telephone calls. I wrote about some simple ways to reverse it in a previous post. I've since written a software tool, deinvert (on GitHub), that does all this for us. It can also descramble a slightly more advanced scrambling method called split-band inversion. Let's see how that happens behind the scenes.

Simple voice inversion

Voice inversion works by inverting the audio spectrum at a set maximum frequency called the inversion carrier. Frequencies near this carrier will thus become frequencies near zero Hz, and vice versa. The resulting audio is unintelligible, though familiar sentences can easily be recognized.

Deinvert comes with 8 preset carrier frequencies that can be activated with the -p option. These correspond to a list of carrier frequencies I found in an actual scrambler's manual, dubbed "the most commonly used inversion carriers".

The algorithm behind deinvert can be divided into three phases: 1) pre-filtering, 2) mixing, and 3) post-filtering. Mixing means multiplying the signal by an oscillation at the selected carrier frequency. This produces two sidebands, or mirrored copies of the signal, with the lower one frequency-inverted. Pre-filtering is necessary to prevent this lower sideband from aliasing when its highest components would go below zero Hertz. Post-filtering removes the upper sideband, leaving just the inverted audio. Both filters can be realized as low-pass FIR filters.

[Image: A spectrogram in four steps, where the signal is first cut at 3 kHz, then shifted up, producing two sidebands, the upper of which is then filtered out.]

This operation is its own inverse, like ROT13; by applying the same inversion again we get intelligible speech back. Indeed, deinvert can also be used as a scrambler by just running unscrambled audio through it. The same inversion carrier should be used in both directions.

Split-band inversion

The split-band scrambling method adds another carrier frequency that I call the split point. It divides the spectrum into two parts that are inverted separately and then combined, preventing ordinary inverters from fully descrambling it.

A single filter-inverter pair may already bring back the low end of the spectrum. Descrambling it fully amounts to running the inversion algorithm twice, with different settings for the filters and mixer, and adding the results together.

The problem here is to find these two frequencies. But let's take a look at an example from audio scrambled using the CML CMX264 split-band inverter (from a video by GBPPR2).

[Image: A spectrogram showing a narrow band of speech-like harmonics, but with a constant dip in the middle of the band.]

In this case the filter roll-off is clearly visible in the spectrogram and it's obvious where the split point is. The higher carrier is probably at the upper limit of the full band or slightly above it. Here the full bandwidth seems to be around 3200 Hz and the split point is at 1200 Hz. This could be initially descrambled using deinvert -f 3200 -s 1200; if the result sounds shifted up or down in frequency this could be refined accordingly.


On a single core of an i7-based laptop from 2013, deinvert processes a 44.1 kHz WAV file at 60x realtime speed (120x for simple inversion). Most of the CPU cycles are spent doing filter convolution, i.e. calculating the signal's vector dot product with the low-pass filter kernels:

[Image: A graph of the time spent in various parts of the call tree of the program, with the subtree leading to the dot product operation highlighted. It takes well over 80 % of the tree.]

For this reason deinvert has a quality setting (0 to 3) for controlling the number of samples in the convolution kernels. A filter with a shorter kernel is linearly faster to compute, but has a low roll-off and will leave more unwanted harmonics.

A quality setting of 0 turns filtering off completely, and is very fast. For simple inversion this should be fine, as long as the original doesn't contain much power above the inversion carrier. It's easy to ignore the upper sideband because of its high frequency. In split-band descrambling this leaves some nasty folded harmonics in the speech band though.

Here's a descramble of the above CMX264 split-band audio using all the different quality settings in deinvert. You will first hear it scrambled, and then descrambled with increasing quality setting.

The default quality level is 2. This should be enough for real-time descrambling of simple inversion on a Raspberry Pi 1, still leaving cycles for an FM receiver for instance:

(RasPi 1)Simple inversionSplit-band inversion
-q 016x realtime5.8x realtime
-q 16.5x realtime3.0x realtime
-q 22.8x realtime1.3x realtime
-q 31.2x realtime0.4x realtime

The memory footprint is less than four megabytes.

Future developments

There's a variant of split-band inversion where the inversion carrier changes constantly, called variable split-band. The transmitter informs the receiver about this sequence of frequencies via short bursts of data every couple of seconds or so. This data seems to be FSK, but it shall be left to another time.

I've also thought about ways to automatically estimate the inversion carrier frequency. Shifting speech up or down in frequency breaks the relationships of the harmonics. Perhaps this fact could be exploited to find a shift that would minimize this error?


Gramophone audio from photograph, revisited

"I am the atomic powered robot. Please give my best wishes to everybody!"

Those are the words uttered by Tommy, a childhood toy robot of mine. I've taken a look at his miniature vinyl record sound mechanism a few times before (#1, #2), in an attempt to recover the analog audio signal using only a digital camera. Results were noisy at best. The blog posts resurfaced in a recent IRC discussion which inspired me to try my luck with a slightly improved method.

Source photo

I will be using an old photo of Tommy's internal miniature record I already had from previous adventures in 2012. I don't want to perform another invasive operation on Tommy to take a new photograph, as I already broke a plastic tab last time I opened him. But it also means I don't have control over the photographing environment. It's part of the challenge.

The picture was taken with a DSLR and it's an uncompressed 8-bit color photo measuring 3000 by 3000 pixels. There's a fair amount of focus blur, chromatic aberration and similar distortions. But at this resolution, a clear pattern can be seen when zooming into the grooves.

[Image: Close-up shot of a miniature vinyl record, with a detail view of the grooves.]

This pattern superficially resembles a variable-area optical audio track seen in old film prints, and that's why I previously tried to decode it as such. But it didn't produce satisfactory results, and there is no physical reason it even should. In fact, I'm not even sure as to which physical parameter the audio is encoded in – does the needle move vertically or horizontally? How would this feature manifest itself in the photograph? Do the bright blobs represent crests in the groove, or just areas that happen to be oriented the right way in this particular lighting?


To make the grooves a little easier to follow I first unwrapped the circular record into a linear image. I did this by remapping the image space from polar to 9000-wide Cartesian coordinates and then resampling it with a windowed sinc kernel:

[Image: The photo of the circular record unwrapped into a long linear strip.]

Mapping the groove path

It's not easy to automatically follow the groove. As one would imagine, it's not a mathematically perfect spiral. Sometimes the groove disappears into darkness, or blurs into the adjacent track. But it wasn't overly tedious to draw a guiding path manually. Most of the work was just copy-pasting from a previous groove and making small adjustments.

I opened the unwrapped image in Inkscape and drew a colored polyline over all obvious grooves. I tried to make sure a polyline at the left image border would neatly continue where the previous one ended on the right side.

The grooves were alternatively labeled as 'a' and 'b', since I knew this record had two different sound effects on interleaved tracks.

[Image: A zoomed-in view of the unwrapped grooves labeled and highlighted with colored lines.]

This polyline was then exported from Inkscape and loaded by a script that extracted a 3-7 pixel high column from the unwrapped original, centered around the groove, for further processing.

Pixels to audio

I had noticed another information-carrying feature besides just the transverse area of the groove: its displacement from center. The white blobs sometimes appear below or above the imaginary center line.

[Image: Parts of a few grooves shown greatly magnified. They appear either as horizontal stripes, or horizontally organized groups of distinct blobs.]

I had my script calculate the brightness mass center (weighted y average) relative to the track polyline at all x positions along the groove. This position was then directly used as a PCM sample value, and the whole groove was written to a WAV file. A noise reduction algorithm was also applied, based on sample noise from the silent end of the groove.

The results are much better than what I previously obtained (see video below, or mp3 here):

Future ideas

Several factors limit the fidelity and dynamic range obtained by this method. For one, the relationship between the white blobs and needle movement is not known. The results could possibly still benefit from more pixel resolution and color bit depth. The blob central displacement (insofar as it is the most useful feature) could also be more accurately obtained using a Gaussian fit or similar algorithm.

The groove guide could be drawn more carefully, as some track slips can be heard in the recovered audio.

Opening up the robot for another photograph would be risky, since I already broke a plastic tab before. But other ways to optically capture the signal would be using a USB microscope or a flatbed scanner. These methods would still be only slightly more complicated that just using a microphone! The linear light source of the scanner would possibly cause problems with the circular groove. I would imagine the problem of the disappearing grooves would still be there, unless some sort of carefully controlled lighting was used.

Virtual music box

A little music project I was writing required a melody be played on a music box. However, the paper-programmable music box I had (pictured) could only play notes on the C major scale. I couldn't easily find a realistic-sounding synthesizer version either. They all seemed to be missing something. Maybe they were too perfectly tuned? I wasn't sure.

Perhaps, if I digitized the sound myself, I could build a flexible virtual instrument to generate just the perfect sample for the piece!

[Image: A paper programmable music box.]

I haven't really made a sampled instrument before, short of perhaps using Impulse Tracker clones with terrible single-sample ones. So I proceeded in an improvised manner. Below I'll post some interesting findings and sound samples of how the instrument developed along the way. There won't be any source code as for now.

By the way, there is a great explanatory video by engineerguy about the workings of music boxes that will explain some terminology ("pins" and "teeth") used in this post.

Recording samples

[Image: A recording setup with a microphone.]

The first step was, obviously, to record the sound to be used as samples. I damped my room using towels and mattresses to minimize room echo; this could be added later if desired, but for now it would only make it harder to cleanly splice the audio. The microphone used was the Audio Technica AT2020, and I digitized it using the Behringer Xenyx 302 USB mixer.

I perforated a paper roll to play all the possible notes in succession, and rolled the paper through. The sound of the paper going through the mechanism posed a problem at first, but I soon learned to stop the paper at just the right moment to make way for the sound of the tooth.

Now I had pretty decent recordings of the whole two-octave range. I used Audacity to extract the notes from the recording, and named the files according to the actual playing MIDI pitch. (The music box actually plays a G# major scale, contrary to what's marked on the blank paper rolls.)

The missing notes

Next, we'll need to generate the missing notes that don't belong in the scale of this music box. Because pitch is proportional to the speed of vibration, this could be done by simply speeding up or slowing down an adjacent note by just the right factor. In equal temperament tuning, this factor would be the 12th root of 2, or roughly 1.05946. Such scaling is straightforward to do on the command line using SoX, for instance (sox c1.wav c_sharp1.wav speed 1.05946).

[Image: Musical notation explaining transposition by multiplication by the 12th root of 2.]

This method can also be used to generate whole new octaves; for example, a transposition of +8 semitones would have a ratio of (12√2)8 ≈ 1.5874. Inter-note variance could be retained by using a random source file for each resampled note. But large-interval transpositions would probably not sound very good due to coloring in the harmonic series.

Here's a table of some intervals and the corresponding speed ratios in equal temperament:

–3= (12√2)–3≈ 0.840896
–2= (12√2)–2≈ 0.890899
–1= (12√2)–1≈ 0.943874
+1= (12√2)1≈ 1.059463
+2= (12√2)2≈ 1.122462
+3= (12√2)3≈ 1.189207

First test!

Now I could finally write a script to play my melody!

It sounds pretty good already - there's no obvious noise and the samples line up seamlessly even though they were just naively glued together sample by sample. There's a lot of power in the lower harmonics, probably because of the big cardboard box I used, but this can easily be changed by EQ if we want to give the impression of a cute little music box.

Adding errors

The above sound still sounded quite artificial, I think mostly because simultaneous notes start on the same exact millisecond. There seems to be a small timing variance in music boxes that is an important contributor to their overall delicate sound. In the below sample I added a timing error from a normal distribution with a standard deviation of 11 milliseconds. It sounds a lot better already!

Other sounds from the teeth

If you listen to recordings of music boxes you can occasionally hear a high-pitched screech as well. It sounds a bit like stopping a tuning fork or guitar string with a metal object. That's why I thought it must be the sound of the pin stopping a vibrating tooth just before playing another note on the same tooth.

[Image: Spectrogram of the beginning of a note with the characteristic screech, centered around 12 kilohertz.]

Sure enough, this sound could always be heard by playing the same note twice in quick succession. I recorded this sound for each tooth and added it to my sound generator. The sound will be generated only if the previous note sample is still playing, and its volume will be scaled in proportion to the tooth's envelope amplitude at that moment. Also, it will silence the note. The amount of silence between the screech and the next note will depend on a tempo setting.

Adding this resonance definitely brings about a more organic feel:

The wind-up mechanism

For a final touch I recorded sounds from the wind-up mechanism of another music box, even though this one didn't have one. It's all stitched up from small pieces, so the number of wind-ups in the beginning and the speed of the whirring sound can all be adjusted. I was surprised at the smoothness of the background sound; it's a three-second loop with no cross-fading involved. You can also hear the box lid being closed in the end.


[Image: VIM screenshot of a text file containing music box markup.]

The native notation of a music box is some kind of a perforated tape or drum, so I ended up using a similar format. There's a tempo marking and tuning information in the beginning, followed by notation one eighth per line. Arpeggios are indicated by a pointy bracket >. I also wrote a script to convert MIDI files into this format; but the number of notes in a music box loop is usually so small that it's not very hard to write manually.

This format could include additional information as well, perhaps controlling the motor sound or box size and shape (properties of the EQ filter).

This format could also potentially be useful when producing or transcribing music from music drums.

Future developments

Currently the music box generator has a hastily written "engineer's UI", which means I probably won't remember how to use it in a couple months' time. Perhaps it could it be integrated into some music software, as a plugin.

Possibilities for live performances are limited, I think. It wouldn't work exactly like a keyboard instrument usually does. At least there should be a way to turn on the background noise, and the player should take into account the 300-millisecond delay caused by the pin slowly rotating over the tooth. But it could be used to play a roll in an endless loop and the settings could be modified on the fly.

As such, the tool performs best at pre-rendering notated music. And I'm happy with the results!