Smoother sailing: Studying audio imperfections in Steamboat Willie

[Image: Mickey Mouse whistling on the bridge of a steamboat.]

Steamboat Willie (1928) was one of the earliest cartoons with synchronized sound. That is, it had post-production sound effects; this was something new and exciting. Now that the cartoon has recently entered the public domain[bbc24] we can safely delve into its famous soundtrack. See, there's something interesting about how it sounds...

If you listen closely to the soundtrack on Youtube it sounds somehow distorted. You might be tempted to point out that it's 96 years old, yes. But you might also recognize that it is suffering from flutter, i.e. an unstable playback or recording speed.

In the spirit of this blog let's geek out for a bit and study this flutter distortion further. Can we learn something interesting? Could we perhaps learn enough to be able to reduce it?

Of course the flutter might be 100% authentic to how it sounded in theatres in the 1920s; we don't know when and why it appeared in the audio (more on that later!). It might have sounded even worse. But we can still hope to enjoy the sound effects in their original recorded form.

Prior work

I'm not the first one to notice this clip is 'fluttering' and to try and do something about it. I found videos of people's attempts to un-flutter it using Celemony Capstan, a professional tool made just for this purpose, with varying results. Capstan uses Melodyne's famous note detection engine to detect musical features and then controls a varispeed effect to cancel out any flutter.

But Capstan is expensive, and it's more fun to come up with a home-made solution anyway. And what about non-musical sounds? Besides, I had some code laying around in a forgotten desk drawer that just might fit the purpose.

Finding a high quality source

Why would I need a high-quality digital file of a poor-quality soundtrack from the 1920s? I guess it's the archivist in me hoping that it has been preserved with high level of detail. But also, if you're going to try and dig up some hidden details in the sound, you'd want minimal interference from any lossy psychoacoustic compression, right? These artifacts might become audible after varispeed effects and could also hinder frequency detection.

[Image: Two spectrograms labeled 'random Youtube video' and '4K version', the former showing compression artifacts.]

The high-quality source I found is in the Internet Archive. It might originally be coming from the 4K Blu-Ray release called Celebrating Mickey. The spectrogram doesn't show almost any compression artifacts that I can see, even in the quietest frequency ranges! Perfect!

[Image: A single film frame.]

But the Internet Archive delivers something even better. There's a (visually) lossless 4K scan of the movie with the optical soundtrack partially included (above)! The high-quality version is 34 GB, but there's a downscaled 480p MP4 one thousandth of the size.

I listened to the optical soundtrack from this low-resolution version with a little pixel-reader script. Turns out the flutter is already present on the film! (Edit: Note that we don't know where this particular film print came from. When was it created? Is there an original somewhere, without flutter?)

Hand-guiding a frequency tracker

Looking at the above spectrogram, we can see that the frequency of everything is zig-zagging as a function of time – that's flutter all right. But how to quantify these variations? We could zoom in on one of the frequency peaks and follow the course of its frequency in time. I'm using FFT peak interpolation to find more accurate frequency estimates[gasior04].

Take the sound of Pete's tobacco hitting the ship's bell around the 01'45'' mark. You'd think a bell is supposed to have a constant frequency, yet this one sounds quite unstable. We can follow any one of the harmonics and see how the playback speed (bell frequency) varies over the period of one second:

[Image: Spectrogram with fluctuating tones.]

To my eye, this oscillation looks periodic and not random at all. We can run another round of FFT on a longer stretch of samples to find the strongest period of these fluctuations: It turns out to be 15 Hz. (Why 15? I so hoped it would have been 24 Hz – it would have made a more interesting story! More on that later...)

[Image: Spectrum plot showing a peak at 15.0 Hz about 15 dB higher than background.]

Okay, so can we repeat this process for the whole movie? I don't think we can just automatically follow the frequency of every peak, since some sounds will naturally contain vibration and rises and drops in frequency. Not all of it is due to flutter. Some sort of a vetting process is needed. We could try a tedious manual route...

[Image: GUI of a software with spectrograms and oscillogram plots.]

I made a little software tool (above) where I could click and drag little boxes onto a spectrogram to search for peaks in. This wobbly line is then simply taken to be the speed variation (red graph in the top picture).

It became quite a chore to annotate longer sounds as this software didn't come with undo, edit, or save features for the longest time!

Now let's think about what to do with this speed information...

Desk drawer deep dive

Some time ago I had made a tool that could well come in handy now. It was for correcting wobbly wideband radio recordings stored on VHS tapes. These recordings contained some empty carriers that happened to work like seismographs, accurately recording the tape speed variations. The tool then used a Lagrange polynomial to interpolate new samples at a steady interval, so called 'digital varispeed'.

It was ultimately based on an interesting paper on de-fluttering magnetic tapes using the tape bias signal as reference[howarth04].

[Image: Buttons of an old device, one of them Varispeed, labeled 1981. Below, part of a GUI with the text Varispeed, labeled 2023.]

By the way, I keep mentioning varispeed and never explained it. This was a feature of old studio-grade reel-to-reel tape recorders where the playback speed could be freely varied by the operator; hence vari+speed. Audio people still use this word in the digital world to essentially refer to variable-rate resampling, which has the same effect, so I'm using them interchangeably. (Topmost photo: Ferdinando Traversa, CC BY, cropped to detail)

Here's what this digital varispeed sounds like when exaggerated. In the below example I'm doing it in a simpler way. Instead of the Lagrange method I first upsampled some music by 10x in an audio software; hand-drew a speed curve in Audacity; and then used that curve to pick samples out of the oversampled music:

[Image: A waveform in Audacity.]

Carefully controlled, this effect can be used to cancel out flutter. Here's how: If we knew exactly how the playback speed was fluctuating we could instantly vary the speed of our resampler in the opposite direction, thus canceling the variations. And with the above research we now have that knowledge!

Well, almost. I couldn't always see a clear frequency peak to follow, so the graph is patchy. But.. Maybe it could help to band-pass the speed signal at 15 Hz? This would help fill out small gaps and also preserve vibrato and other fluctuations that aren't part of the flutter distortion. We can at least try!

[Image: Two waveforms, one of them piecewise and noisy, the other one smooth and continuous.]

In the example above, I replaced empty parts with a constant value of 100% and then filtered the whole thing. This sews the disjointed parts together in a smooth way.

Can we hear some examples already?

This clip is from when the goat ate Minnie's sheet music and guitar – the apparent catalyst event that sent Mickey Mouse to seek revenge on the entire animal kingdom.

Before [Image: Movie screenshot]
After

You can definitely hear the difference in the bell-like sounds coming from the goats insides. It even sounds like the little flute notes in the beginning are easier to tell apart in the corrected version.

Here's another musical example, with strings.

Before [Image: Movie screenshot]
After

The cow's moo. That's a hard one because it's so rich in harmonics, in the spectrogram it looks almost like a spaghetti bolognese. My algorithm is constrained to a box and can't stay with one harmonic when the 'moo' slides in frequency. You can hear some artifacts because of this, but still the result sounds less sheep-like than the original.

Before [Image: Movie screenshot]
After

But Mickey whistling "Steamboat Bill" in the beginning of the film actually doesn't sound better when corrected... I preferred a bit of vibrato!

Before [Image: Movie screenshot]
After

Sidetrack 1: Anything else we can find?

Glad you're still reading! Let's step away from flutter for a while and take the raw audio track itself under the Fourier microscope. Zooming closer, is there anything interesting in the lower end?

[Image: Spectrogram showing a frequency range from 0 to 180 Hz.]

We can faintly see peaks at multiples of both 24 and 60 Hz. No surprises there, really... 24 Hz being the film framerate and 60 Hz the North American mains frequency. Was there a projector running in the recording studio? Or maybe it's an artifact of scanning the soundtrack one frame at a time? In any case, these sounds are pretty weak.

[Image: Spectrogram showing tones with apparent sidebands.]

In some places you can see some sort of modulation that seems to be generating sidebands, just like in radio signals. It's especially visible in Mickey's whistle when it's flutter-corrected, here at the 5-second mark. The sidebands peaks are 107 and 196 Hz away from the 'carrier' if you will. I'm not sure what this could be. Fluctuating amplitude?

Sidetrack 2: Playing sound-on-film frame by frame?

This is an experiment I did some time ago. It's just a silly thought - what would happen if the soundtrack was being read in the same way as the picture is – stopped 24 times per second? Would this be the ultimate flutter distortion?

In the olden days, sound was stored on the film next to the picture frames as analog information. Unlike the picture frames that had to be stopped momentarily for projection, the sound had to be played at a constant speed. There was a complicated mechanism in the projector to make this possible.

I found some speed curves for old-school movie projectors in [bickford72]. They describe the film's deceleration and acceleration during these stops. Let's emulate these speed curves in audio with the oversampling varispeed method.

The video below is a 3D animation where this same speed curve controls an animation of a moving film in an imaginary machine. The clip is from another 1920s animation, Alice in the Wooly West (1926).

~~ Now we know ~~

Conclusions

  • We found a 15 Hz speed fluctuation that was, to some extent, reversible.
  • This flutter signal is already present in the optical soundtrack of a film scan (of unknown origin).
  • With enough manual work, much of the soundtrack could probably be 'corrected'.
  • 'Hmm, that sounds odd' are sometimes the words of a white rabbit.

References

Using HDMI radio interference for high-speed data transfer

This story, too, begins with noise. I was browsing the radio waves with a software radio, looking for mysteries to accompany my ginger tea. I had started to notice a wide-band spiky signal on a number of frequencies that only seemed to appear indoors. Some sort of interference from electronic devices, probably. Spoiler alert, it eventually led me to broadcast a webcam picture over the radio waves... but how?

It sounds like video

The mystery deepened when I listened to how this interference sounded like as an AM signal. It reminded me of a time I mistakenly plugged our home stereo system to the Nintendo console's video output and heard a very similar buzz.

Am I possibly listening to video? Why would there be analog video transmitting on any frequency, let alone inside my home?

[Image: Oscillogram of a noisy waveform that seems to have a pulse every 10 microseconds or so.]

If we plot the signal's amplitude against time we can see that there is a strong pulse exactly 60 times per second. This could be the vertical synchronisation signal of 60 Hz video. A shorter pulse (pictured above) can be seen repeating more frequently; it could be the horizontal one. Between these pulses there is what appears to be noise. Maybe, if we use the strong pulses for synchronisation and plot the amplitude of that noise as a two-dimensional picture, we could see something?

And sure enough, when main screen turn on, we get signal:

[Image: A grainy greyscale image of what appears to be a computer desktop.]

(I've hidden the bright synchronisation signal from this picture.)

It seems to be my Raspberry Pi's desktop with weirdly distorted greyscale colours! Somehow, some part of the monitor setup is radiating it quite loudly into the aether. The frequency I'm listening to is a multiple of the monitor's pixel clock frequency.

As it turns out, this vulnerability of some monitors has been known for a long time. In 1985, van Eck demonstrated how CRT monitors can be spied on from a distance[1]; and in 2004, Markus Kuhn showed that the same still works on flat-screen monitors[2]. The image is heavily distorted, but some shapes and even bigger text can be recognisable.

The next thought was, could we get any more information out of these images? Is there any information about colour?

Mapping all the colours

HDMI is fully digital; there is no linear dependency between pixel values and greyscale brightness in this amplitude image. I believe the brightness is related to the number of bit transitions over my radio's sampling time (which is around 8 bit-lengths); and in HDMI, this is dependent on many things, not just the actual RGB value of the pixel. HDMI also uses multiple differential wires that all are transmitting their own picture channels side by side.

This is why I don't think it's possible easy to reconstruct a clear picture of what's being shown on the screen, let alone decode any colours.

But could the reverse be possible? Could we control this phenomenon to draw the greyscale pictures of our choice on the receiver's screen? How about sending binary data by displaying alternating pixel values on the monitor?

[Image: On the left, gradients of red, green, and blue; on the right, greyscale lines of seemingly unrelated brightness.]

My monitor uses 16-bit colours. There are "only" 65,536 different colours, so it's possible to go through all of them and see how each appears in the receiver. But it's not that simple; the bit-pattern of a HDMI pixel can actually get modified based on what came before it. And my radio isn't fast enough to even tell the bits apart anyway. What we could do is fill entire lines with one colour and average the received signal strength. We would then get a mapping for single-colour horizontal streaks (above). Assuming a long run of the same colour always produces the same bitstream, this could be good enough.

[Image: An XY plot where x goes from 0 to 65536 and Y from 0 to 1.2. A pattern seems to repeat itself every 256 values of x. Values from 16128 to 16384 are markedly higher.]

Here's the map of all the colours and their intensity in the radio receiver. (Whatever happens between 16,128 and 16,384? I don't know.)

Now, we can resample a greyscale image so that its pixels become short horizontal lines. Then, for every greyscale value find the closest matching RGB565 color in the above map. When we display this psychedelic hodge-podge of colour on the screen (on the right), enough of the above mapping seems to be preserved to produce a recognizable picture of a movie[3] on the receiver side (on the left):

[Image: On the right, a monitor shows a noisy green and blue image. On the left, another monitor shows a grainy picture of a man and the text 'Hackerman'.]

These colours are not constant in any way. If I move the antenna around, even if I turn it from vertical to horizontal, the greyscales will shift or even get inverted. If I tune the radio to another harmonic of the pixel clock frequency, the image seems to break down completely. (Are there more secrets to be unfolded in these variations?)

The binary exfiltration protocol

Now we should have enough information to be able to transmit bits. Maybe even big files and streaming data, depending on the bitrate we can achieve.

First of all, how should one bit be encoded? The absolute brightness will fluctuate depending on radio conditions. So I decided to encode bits as the brightness difference between two short horizontal lines. Positive difference means 1 and negative 0. This should stay fairly constant, unless the colours completely flip around that is.

[Image: When a bit is 0, the leftmost line is darker than the rightmost line, and vice versa. These lines are used to form 768-bit packets.]

The monitor has 768 pixels vertically. This is a nice number so I designed a packet that runs vertically across the display. (This proved to be a bad decision, as we will later see.) We can stack as many packets side-by-side as the monitor width allows. A new batch of packets can be displayed in each frame, or we can repeat them over multiple frames to improve reliability.

These packets should have some metadata, at least a sequence number. Our medium is also quite noisy, so we need some kind of forward error correction. I'm using a Hamming(12,8) code which adds 4 error correction bits for every 8 bits of data. Finally, we need to add a CRC to each packet so we can make sure it arrived intact; I chose CRC16 with the polynomial 0x8005 (just because liquid-dsp provided it by default).

First results!

It was quite unbelievable, I was able to transmit a looping 64 kbps audio stream almost without any glitches, with the monitor and the receiver in the same room approximately 2 meters from each other.

Quick tip. Raw 8-bit PCM audio is a nice test format for these kinds of streaming experiments. It's straightforward to set an arbitrary bitrate by resampling the sound (with SoX for instance); there's no structure, headers, or byte order to deal with; and any packet loss, misorder, or buffer underrun is instantly audible. You can use a headerless companding algorithm like A-law to fit more dynamic range in 8 bits. Even stereo works; if you start from the wrong byte the channels will just get swapped. SoX can also play back the stream.

But can we get more? Slowly I added more samples per second, and a second audio channel. Suddenly we were at 256 kbps and still running smoothly. 200 kbps was even possible from the adjacent room, with a directional antenna 5 meters away, and with the door closed! In the same room, it worked up to around 512 kilobits per second but then hit a wall.

[Image: Info window that says HasPreamble: 1. Total: 928.5 kbps, Fresh: 853.6 kbps, Fresh (payload): 515.7 kbps.]

A tearful performance

The heavy error correction and framing adds around 60% of overhead, and we're left wit 480 bits of 'payload' per packet. If we have 39 packets per frame at 60 frames per second we should get more than a megabit per second, right? But for some reason it always caps at half a megabit.

The reason revealed itself when I noticed every other frame was often completely discarded at the CRC check. Of course; I should have thought of properly synchronising the screen update to the graphics adapter's frame update cycle (or VSYNC). This would prevent the picture information changing mid-frame, also known as tearing. But whatever options I tried with the SDL library I couldn't get the Raspberry Pi 4 to not introduce tearing.

Screen tearing appears to be an unsolved problem plaguing the Raspberry Pi 4 specifically (see this Google search). I tried another mini computer, the Asus Tinker Board R2.0, but I couldn't get the graphics drivers to work properly. I then realised it was a mistake to have the packets run from top to bottom; any horizontal tearing will cut every single packet in half! With a horizontal design only one packet per frame would suffer this fate.

A new design enables video-over-video

Packets that run horizontally across the screen indeed fix most of the packet loss. It may also help with CPU load as it improves memory access locality. I'm now able to get 1000 kbps from the monitor! What could this be used for? A live video stream, perhaps?

But the clock was ticking. I had a presentation coming up and I really wanted to amaze everyone with a video transfer demo. I quite literally got it working on the morning of the event. For simplicity, I decided to go with MJPEG, even though fancier schemes could compress way more efficiently. The packet loss issues are mostly kept at bay by repeating frames.

The data stream is "hidden" in a Windows desktop screenshot; I'm changing the colours in a way that both creates a readable bit and also looks inconspicuous when you look from far away.

Mitigations

This was a fun project but this kind of a vulnerability could, in the tinfoiliest of situations, be used for exfiltrating information out of a supposedly airgapped computer.

The issue has been alleviated in some modern display protocols. DisplayPort[4] makes use of scrambling: a pseudorandom sequence of bits is mixed with the bitstream to remove the strong clock oscillations that are so easily radiated out. This also randomizes the bitstream-to-amplitude correlation. I haven't personally tested whether it still has some kind of video in their radio interference, though. (Edit: Scrambling seems to be optionally supported by later versions of HDMI, too – but it might depend on which features exactly the two devices negotiate. How could you know if it's turned on?)

[Image: A monitor completely wrapped in tinfoil, with the text IMPRACTICAL written over it.]

I've also tried wrapping the monitor in tinfoil (very impractical) and inside a cage made out of chicken wire (it had no effect - perhaps I should have grounded it?). I can't recommend either of these.

Software considerations

This project was made possible by at least C++, Perl, SoX, ImageMagick, liquid-dsp, Dear Imgui, GLFW, turbojpeg, and v4l2! If you're a library that feels left out, please leave a comment.

If you wish to play around with video emanations, I heard there is a project called TempestSDR. For generic analog video decoding via a software radio, there is TVSharp.

References

  1. Van Eck, Wim (1985): Electromagnetic radiation from video display units: An eavesdropping risk?
  2. Kuhn, Markus (2004): Electromagnetic Eavesdropping Risks of Flat-Panel Displays
  3. KUNG FURY Official Movie [HD] (2015)
  4. Video Electronics Standards Association (2006): DisplayPort Standard, version 1.

Spiral spectrograms and intonation illustrations

I've been experimenting with methods for visualising harmony, intonation (tuning), and overtones in music. Ordinary spectrograms aren't very well suited for that as the harmonic relations are not intuitively visible. Let's see what could be done about this. I'll try to sprinkle the text with Wikipedia links in order to immerse (nerd snipe?) the reader in the subject.

Equal temperament cents against time

We can examine how tuning evolves during a recording by choosing a reference pitch and plotting all frequencies relative to it modulo 100 cents. This is similar to what an electronic tuner does, but instead of just showing the fundamental frequency, we'll plot the whole spectrum. Information about the absolute frequencies is lost. This "zoomed-in" plot visualises how the distribution of frequencies fits the 12-tone equal temperament system (12-TET) common in Western music.

Here's the first 20 seconds of Debussy's Clair De Lune as found on YouTube, played with a well-tuned (video) and an out-of-tune piano (video). The second piano sounds out of tune because there are relative differences in tuning between the strings. The first piano looks to be a few cents sharp as a whole, but consistently so, so it's not perceptible.

[Image: Two spectrograms labeled 'Piano in tune' and 'Piano out of tune'. The first one shows blobs of light along the center axis. In the second one, the blobs are jumping up and down the graph.]

The vertical axis is the same that electronic tuners use. All the notes of a chord will appear in the middle, as long as they are well tuned against the reference pitch of, say, A = 440 Hz. The top edge of the graph is half a semitone sharp (quarter tone = 50c), and the bottom is half a semitone flat.

Overtones barely appear in the picture because the first three conveniently hit other notes in tune. But from f5 onwards the harmonic series starts deviating from 12-TET and the harmonics start to look out-of-tune (f5 = −14c, f7 = −31c, ...). These can be cut out by filtering, or hoping that they're lower in volume and setting the color range accordingly.

Six months later...

You know how you sometimes accidentally delete a project and have to rewrite it from scratch much later, and it's never exactly the same? That's what happened at this point. It's a little scuffed, but here's some piano music that utilises quarter tones (by Wyschnegradsky). I used the same scale on purpose, so the quarter tones wrap around from the top and bottom.

[Image: Similar as above, but blobs are also seen at the very top and bottom of the graph.]

More examples of these "intonation plots" in a tweet.

This suits well for piano music. However, not even all western classical music is tuned to equal temperament; for instance, solo strings may be played with Pythagorean intonation[1], whereas vocal ensembles[2] and string quartets may tune some intervals closer to just intonation. Unlike the piano, these wouldn't look too good in the equal-note plot.

Octave spiral

If we instead plot the spectrum modulo 1200 cents (1 octave) we get an interesting interval view. We could even ditch the time scale and wind the spectrogram into a spiral to make it prettier and preserve the overtones and absolute frequency information. Now each note is represented by an angle in this spiral; in 12-TET, they're separated by 30 degrees. At any point in the spiral, going 1 turn outwards doubles the frequency.

Here's a C major chord on a piano. Note how the harmonic content adds several high-frequency notes on top of the C, E, and G of the triad chord, and how multiple notes can contribute to the same overtone:

[Image: 12 note names labeled around a spiral. The C, E, and G notes light up sequentially, and their frequency and harmonics are displayed on the spiral.]

I had actually hoped I would get an Ultimate Chord Viewer where any note, regardless of frequency, would have all its harmonics neatly stacked on top of it. But it's not what has happened here: the harmonic series is not a stack of octaves (2^n), but instead of integer multiples (n). Some harmonics appear at seemingly unrelated angles. But it's still a pretty interesting visualisation, and perhaps makes more sense musically.

This plot is also better at illustrating different tuning systems. Let's look at a major third interval F-A in equal temperament and just intonation, with a few more harmonics.

[Image: The interval alternates between 400 and 386 cents. When it's 386, a few harmonics of the F note merge with those of the A note.]

The intuition from this plot is that equal temperament aims for equal distances (angles) between notes and just intonation tries to make more of the harmonics match instead.

Even though the Ultimate Chord Viewer was not achieved I now have ideas for Christmas lights...

Live visualization

Here's what a cappella music with some reverb looks like on the spiral spectrogram.

A shader toy

This little real-time GLSL demo on shadertoy draws a spiral FFT from microphone audio. But don't get your hopes up: The spectrograms in this blog post were made with a 65536-point FFT. Shadertoy's 512px microphone texture offers a lot less in terms of frequency range and bins. This greatly blurs the frequency resolution, especially towars the low end. Could it be improved with the right colormap? Or a custom FFT with the waveform texture as its input?

References

Speech to birdsong conversion

I had a dream one night where a blackbird was talking in human language. When I woke up there was actually a blackbird singing outside the window. Its inflections were curiously speech-like. The dreaming mind only needed to imagine a bunch of additional harmonics to form phonemes and words. One was left wondering if speech could be transformed into a blackbird song by isolating one of the harmonics...

One way to do this would be to:

  • Find the instantaneous fundamental frequency and amplitude of the speech. For example, filter the harmonics out and use an FM demodulator to find the frequency. Then find the signal envelope amplitude by AM demodulation.
  • Generate a new wave with similar amplitude variations but greatly multiplied in frequency.
[Image: Signal path diagram.]

A proof-of-concept script using the Perl-SoX-csdr command-line toolchain is available (source code here). The result sounds surprisingly blackbird-like. Even the little trills are there, probably as a result of FM noise or maybe vocal fry at the end of sentences. I got the best results by speaking slowly and using exaggerated inflection.

Someone hinted that the type of intonation used in certain automatic announcements is perfect for this kind of conversion. And it seems to be true! Here, a noise gate and reverb has been added to the result to improve it a little:

And finally, a piece of sound art where this synthetic blackbird song is mixed with a subtle chord and a forest ambience:

Think of the possibilities: A simultaneous interpreter for talking to birds. A tool for dubbing talking birds in animation or live theatre. Entertainment for cats.

What other birds could be done with a voice changer like this? What about croaky birds like a duck or a crow?

(I talked about this blog post a little on NPR: Here's What 'All Things Considered' Sounds Like — In Blackbird Song)