Descrambling split-band voice inversion with deinvert

Voice inversion is a primitive method of rendering speech unintelligible to prevent eavesdropping of radio or telephone calls. I wrote about some simple ways to reverse it in a previous post. I've since written a software tool, deinvert (on GitHub), that does all this for us. It can also descramble a slightly more advanced scrambling method called split-band inversion. Let's see how that happens behind the scenes.

Simple voice inversion

Voice inversion works by inverting the audio spectrum at a set maximum frequency called the inversion carrier. Frequencies near this carrier will thus become frequencies near zero Hz, and vice versa. The resulting audio is unintelligible, though familiar sentences can easily be recognized.

Deinvert comes with 8 preset carrier frequencies that can be activated with the -p option. These correspond to a list of carrier frequencies I found in an actual scrambler's manual, dubbed "the most commonly used inversion carriers".

The algorithm behind deinvert can be divided into three phases: 1) pre-filtering, 2) mixing, and 3) post-filtering. Mixing means multiplying the signal by an oscillation at the selected carrier frequency. This produces two sidebands, or mirrored copies of the signal, with the lower one frequency-inverted. Pre-filtering is necessary to prevent this lower sideband from aliasing when its highest components would go below zero Hertz. Post-filtering removes the upper sideband, leaving just the inverted audio. Both filters can be realized as low-pass FIR filters.

[Image: A spectrogram in four steps, where the signal is first cut at 3 kHz, then shifted up, producing two sidebands, the upper of which is then filtered out.]

This operation is its own inverse, like ROT13; by applying the same inversion again we get intelligible speech back. Indeed, deinvert can also be used as a scrambler by just running unscrambled audio through it. The same inversion carrier should be used in both directions.

Split-band inversion

The split-band scrambling method adds another carrier frequency that I call the split point. It divides the spectrum into two parts that are inverted separately and then combined, preventing ordinary inverters from fully descrambling it.

A single filter-inverter pair may already bring back the low end of the spectrum. Descrambling it fully amounts to running the inversion algorithm twice, with different settings for the filters and mixer, and adding the results together.

The problem here is to find these two frequencies. But let's take a look at an example from audio scrambled using the CML CMX264 split-band inverter (from a video by GBPPR2).

[Image: A spectrogram showing a narrow band of speech-like harmonics, but with a constant dip in the middle of the band.]

In this case the filter roll-off is clearly visible in the spectrogram and it's obvious where the split point is. The higher carrier is probably at the upper limit of the full band or slightly above it. Here the full bandwidth seems to be around 3200 Hz and the split point is at 1200 Hz. This could be initially descrambled using deinvert -f 3200 -s 1200; if the result sounds shifted up or down in frequency this could be refined accordingly.

Performance

On a single core of an i7-based laptop from 2013, deinvert processes a 44.1 kHz WAV file at 60x realtime speed (120x for simple inversion). Most of the CPU cycles are spent doing filter convolution, i.e. calculating the signal's vector dot product with the low-pass filter kernels:

[Image: A graph of the time spent in various parts of the call tree of the program, with the subtree leading to the dot product operation highlighted. It takes well over 80 % of the tree.]

For this reason deinvert has a quality setting (0 to 3) for controlling the number of samples in the convolution kernels. A filter with a shorter kernel is linearly faster to compute, but has a low roll-off and will leave more unwanted harmonics.

A quality setting of 0 turns filtering off completely, and is very fast. For simple inversion this should be fine, as long as the original doesn't contain much power above the inversion carrier. It's easy to ignore the upper sideband because of its high frequency. In split-band descrambling this leaves some nasty folded harmonics in the speech band though.

Here's a descramble of the above CMX264 split-band audio using all the different quality settings in deinvert. You will first hear it scrambled, and then descrambled with increasing quality setting.

The default quality level is 2. This should be enough for real-time descrambling of simple inversion on a Raspberry Pi 1, still leaving cycles for an FM receiver for instance:

Simple inversionSplit-band inversion
-q 016x realtime5.8x realtime
-q 16.5x realtime3.0x realtime
-q 22.8x realtime1.3x realtime
-q 31.2x realtime0.4x realtime

The memory footprint is less than four megabytes.

Future developments

There's a variant of split-band inversion where the inversion carrier changes constantly, called variable split-band. The transmitter informs the receiver about this sequence of frequencies via short bursts of data every couple of seconds or so. This data seems to be FSK, but it shall be left to another time.

I've also thought about ways to automatically estimate the inversion carrier frequency. Shifting speech up or down in frequency breaks the relationships of the harmonics. Perhaps this fact could be exploited to find a shift that would minimize this error?

Gramophone audio from photograph, revisited

"I am the atomic powered robot. Please give my best wishes to everybody!"

Those are the words uttered by Tommy, a childhood toy robot of mine. I've taken a look at his miniature vinyl record sound mechanism a few times before (#1, #2), in an attempt to recover the analog audio signal using only a digital camera. Results were noisy at best. The blog posts resurfaced in a recent IRC discussion which inspired me to try my luck with a slightly improved method.

Source photo

I'm using a photo of Tommy's internal miniature record I already had from previous adventures. This way, Tommy is spared from another invasive operation, though it also means I don't have control over the photographing environment.

The picture was taken with a DSLR and it's an uncompressed 8-bit color photo measuring 3000 by 3000 pixels. There's a fair amount of focus blur, chromatic aberration and similar distortions. But at this resolution, a clear pattern can be seen when zooming into the grooves.

[Image: Close-up shot of a miniature vinyl record, with a detail view of the grooves.]

This pattern superficially resembles a variable-area optical audio track seen in old film prints, and that's why I previously tried to decode it as such. But it didn't produce satisfactory results, and there is no physical reason it even should. In fact, I'm not even sure as to which physical parameter the audio is encoded in – does the needle move vertically or horizontally? How would this feature manifest itself in the photograph? Do the bright blobs represent crests in the groove, or just areas that happen to be oriented the right way in this particular lighting?

Unwrapping

To make the grooves a little easier to follow I first unwrapped the circular record into a linear image. I did this by remapping the image space from polar to 9000-wide Cartesian coordinates and then resampling it with a windowed sinc kernel:

[Image: The photo of the circular record unwrapped into a long linear strip.]

Mapping the groove path

It's not easy to automatically follow the groove. As one would imagine, it's not a mathematically perfect spiral. Sometimes the groove disappears into darkness, or blurs into the adjacent track. But it wasn't overly tedious to draw a guiding path manually. Most of the work was just copy-pasting from a previous groove and making small adjustments.

I opened the unwrapped image in Inkscape and drew a colored polyline over all obvious grooves. I tried to make sure a polyline at the left image border would neatly continue where the previous one ended on the right side.

The grooves were alternatively labeled as 'a' and 'b', since I knew this record had two different sound effects on interleaved tracks.

[Image: A zoomed-in view of the unwrapped grooves labeled and highlighted with colored lines.]

This polyline was then exported from Inkscape and loaded by a script that extracted a 3-7 pixel high column from the unwrapped original, centered around the groove, for further processing.

Pixels to audio

I had noticed another information-carrying feature besides just the transverse area of the groove: its displacement from center. The white blobs sometimes appear below or above the imaginary center line.

[Image: Parts of a few grooves shown greatly magnified. They appear either as horizontal stripes, or horizontally organized groups of distinct blobs.]

I had my script calculate the brightness mass center (weighted y average) relative to the track polyline at all x positions along the groove. This position was then directly used as a PCM sample value, and the whole groove was written to a WAV file. A noise reduction algorithm was also applied, based on sample noise from the silent end of the groove.

The results are much better than what I previously obtained (see video below, or mp3 here):

Future ideas

Several factors limit the fidelity and dynamic range obtained by this method. For one, the relationship between the white blobs and needle movement is not known. The results could possibly still benefit from more pixel resolution and color bit depth. The blob central displacement (insofar as it is the most useful feature) could also be more accurately obtained using a Gaussian fit or similar algorithm.

The groove guide could be drawn more carefully, as some track slips can be heard in the recovered audio.

Opening up the robot for another photograph would be risky, since I already broke a plastic tab before. But other ways to optically capture the signal would be using a USB microscope or a flatbed scanner. These methods would still be only slightly more complicated that just using a microphone! The linear light source of the scanner would possibly cause problems with the circular groove. I would imagine the problem of the disappearing grooves would still be there, unless some sort of carefully controlled lighting was used.

Virtual music box

A little music project I was writing required a melody be played on a music box. However, the paper-programmable music box I had (pictured) could only play notes on the C major scale. I couldn't easily find a realistic-sounding synthesizer version either. They all seemed to be missing something. Maybe they were too perfectly tuned? I wasn't sure.

Perhaps, if I digitized the sound myself, I could build a flexible virtual instrument to generate just the perfect sample for the piece!

[Image: A paper programmable music box.]

I haven't really made a sampled instrument before, short of perhaps using Impulse Tracker clones with terrible single-sample ones. So I proceeded in an improvised manner. Below I'll post some interesting findings and sound samples of how the instrument developed along the way. There won't be any source code as for now.

By the way, there is a great explanatory video by engineerguy about the workings of music boxes that will explain some terminology ("pins" and "teeth") used in this post.

Recording samples

[Image: A recording setup with a microphone.]

The first step was, obviously, to record the sound to be used as samples. I damped my room using towels and mattresses to minimize room echo; this could be added later if desired, but for now it would only make it harder to cleanly splice the audio. The microphone used was the Audio Technica AT2020, and I digitized it using the Behringer Xenyx 302 USB mixer.

I perforated a paper roll to play all the possible notes in succession, and rolled the paper through. The sound of the paper going through the mechanism posed a problem at first, but I soon learned to stop the paper at just the right moment to make way for the sound of the tooth.

Now I had pretty decent recordings of the whole two-octave range. I used Audacity to extract the notes from the recording, and named the files according to the actual playing MIDI pitch. (The music box actually plays a G# major scale, contrary to what's marked on the blank paper rolls.)

The missing notes

Next, we'll need to generate the missing notes that don't belong in the scale of this music box. This could be done by simply speeding up or slowing down an adjacent note by just the right factor. In equal temperament tuning, this factor would be the 12th root of 2, or roughly 1.05946. Such scaling is straightforward to do on the command line using SoX, for instance (sox c1.wav c_sharp1.wav speed 1.05946).

[Image: Musical notation explaining transposition by multiplication by the 12th root of 2.]

This method can also be used to generate whole new octaves; for example, a transposition of +8 semitones would have a ratio of (12√2)8 ≈ 1.5874. Inter-note variance could be retained by using a random source file for each resampled note. But large-interval transpositions would probably not sound very good due to coloring in the harmonic series.

First test!

Now I could finally write a script to play my melody!

It sounds pretty good already - there's no obvious noise and the samples line up seamlessly even though they were just naively glued together sample by sample. There's a lot of power in the lower harmonics, probably because of the big cardboard box I used, but this can easily be changed by EQ if we want to give the impression of a cute little music box.

Adding errors

The above sound still sounded quite artificial, I think mostly because simultaneous notes start on the same exact millisecond. There seems to be a small timing variance in music boxes that is an important contributor to their overall delicate sound. In the below sample I added a timing error from a normal distribution with a standard deviation of 11 milliseconds. It sounds a lot better already!

Other sounds from the teeth

If you listen to recordings of music boxes you can occasionally hear a high-pitched screech as well. It sounds a bit like stopping a tuning fork or guitar string with a metal object. That's why I thought it must be the sound of the pin stopping a vibrating tooth just before playing another note on the same tooth.

[Image: Spectrogram of the beginning of a note with the characteristic screech, centered around 12 kilohertz.]

Sure enough, this sound could always be heard by playing the same note twice in quick succession. I recorded this sound for each tooth and added it to my sound generator. The sound will be generated only if the previous note sample is still playing, and its volume will be scaled in proportion to the tooth's envelope amplitude at that moment. Also, it will silence the note. The amount of silence between the screech and the next note will depend on a tempo setting.

Adding this resonance definitely brings about a more organic feel:

The wind-up mechanism

For a final touch I recorded sounds from the wind-up mechanism of another music box, even though this one didn't have one. It's all stitched up from small pieces, so the number of wind-ups in the beginning and the speed of the whirring sound can all be adjusted. I was surprised at the smoothness of the background sound; it's a three-second loop with no cross-fading involved. You can also hear the box lid being closed in the end.

Notation

[Image: VIM screenshot of a text file containing music box markup.]

The native notation of a music box is some kind of a perforated tape or drum, so I ended up using a similar format. There's a tempo marking and tuning information in the beginning, followed by notation one eighth per line. Arpeggios are indicated by a pointy bracket >.

This format could include additional information as well, perhaps controlling the motor sound or box size and shape (properties of the EQ filter).

This format could also potentially be useful when producing or transcribing music from music drums.

There could also be a tool to convert MIDI files into this format. But the number of notes in a music box loop is usually so small that it's not very hard to write manually.


Future developments

Currently the music box generator has a hastily written "engineer's UI", which means I probably won't remember how to use it in a couple months' time. Perhaps it could it be integrated into some music software, as a plugin.

Possibilities for live performances are limited, I think. It wouldn't work exactly like a keyboard instrument usually does. At least there should be a way to turn on the background noise, and the player should take into account the 300-millisecond delay caused by the pin slowly rotating over the tooth. But it could be used to play a roll in an endless loop and the settings could be modified on the fly.

As such, the tool performs best at pre-rendering notated music. And I'm happy with the results!

CTCSS fingerprinting: a method for transmitter identification

Identifying unknown radio transmitters by their signals is called radio fingerprinting. It is usually based on rise-time signatures, i.e. characteristic differences in how the transmitter frequency fluctuates at carrier power-up. Here, instead, I investigate the fingerprintability of another feature in hand-held FM transceivers, known as CTCSS or Continuous Tone-Coded Squelch System.

Motivation & data

I came across a long, losslessly compressed recording of some walkie-talkie chatter and wanted to know more about it, things like the number of participants and who's talking with who. I started writing a transcript – a fun pastime – but some voices sounded so similar I wondered if there was a way to tell them apart automatically.

[Image: Screenshot of Audacity showing an audio file over eleven hours long.]

The file comprises several thousand short transmissions as FM demodulated audio lowpass filtered at 4500 Hz. Signal quality is variable; most transmissions are crisp and clear but some are buried under noise. Passages with no signal are squelched to zero.

I considered several potentially fingerprintable features, many of them unrealistic:

  • Carrier power-up; but many transmissions were missing the very beginning because of squelch
  • Voice identification; but it would probably require pretty sophisticated algorithms (too difficult!) and longer samples
  • Mean audio power; but it's not consistent enough, as it depends on text, tone of voice, etc.
  • Maximum audio power; but it's too sensitive to peaks in FM noise

I then noticed all transmissions had a very low tone at 88.5 Hz. It turned out to be CTCSS, an inaudible signal that enables handsets to silence unwanted transmissions on the same channel. This gave me an idea inspired by mains frequency analysis: Could this tone be measured to reveal minute differences in crystal frequencies and modulation depths? Also, knowing that these were recorded using a cheap DVB-T USB stick – would it have a stable enough oscillator to produce consistent measurements?

Measurements

I used the liquid-dsp library for signal processing. It has several methods for measuring frequencies. I decided to use a phase-locked loop, or PLL; I could have also used FFT with peak interpolation.

In my fingerprinting tool, the recording is first split into single transmissions. The CTCSS tone is bandpass filtered and a PLL starts tracking it. When the PLL frequency stops fluctuating, i.e. the standard deviation is small enough, it's considered locked and its frequency is averaged over this time. The average RMS power is measured similarly.

Here's one such transmission:

[Image: A graph showing frequency and power, first fluctuating but then both stabilize for a moment, where text says 'PLL locked'. Caption says 'No, I did not copy'.]

Results

When all transmissions are plotted according to their CTCSS power and frequency relative to 88.5 Hz, we get this:

[Image: A plot of RMS power versus frequency, with dots scattered all over, but mostly concentrated in a few clusters.]

At least three clusters are clearly distinguishable by eye. Zooming in to one of the clusters reveals it's made up of several smaller clusters. Perhaps the larger clusters correspond to three different models of radios in use, and these smaller ones are the individual transmitters?

A heat map reveals even more structure:

[Image: The same clusters presented in a gradual color scheme and numbered from 1 to 12.]

It seems at least 12 clusters, i.e. potential individual transmitters, can be distinguished.

Even though most transmissions are part of some cluster, there are many outliers as well. These appear to correspond to a very noisy or very short transmission. (Could the FFT have produced better results with these?)

Use as transcription aid

My goal was to make these fingerprints useful as labels aiding transcription. This way, a human operator could easily distinguish parties of a conversation and add names or call signs accordingly.

I experimented with automated k-means clustering, but that didn't immediately produce appealing results. Then I manually assigned 12 anchor points at apparent cluster centers and had a script calculate the nearest anchor point for all transmissions. Prior to distance calculations the axes were scaled so that the data seemed uniformly distributed around these points.

This automatic labeling proved quite sensitive to errors. It could be useful when listing possible transmitters for an unknown transmission with no context; distances to previous transmissions positively mentioning call signs could be used. Instead I ended up printing the raw coordinates and colouring them with a continuous RGB scale:

[Image: A few lines from a conversation between Boa 1 and Cobra 1. Numbers in different colors are printed in front of each line.]

Here the colours make it obvious which party is talking. Call signs written in a darker shade are deduced from the context. One sentence, most probably by "Cobra 1", gets lost in noise and the RMS power measurement becomes inaccurate (463e-6). The PLL frequency is still consistent with the conversation flow, though.

Countermeasures

If CTCSS is not absolutely required in your network, i.e. there are no unwanted conversations on the frequency, then it can be disabled to prevent this type of fingerprinting. In Motorola radios this is done by setting the CTCSS code to 0. (In the menus it may also be called a PT code or Interference Eliminator code.) In many other consumer radios it's doesn't seem to be that easy.

Conclusions

CTCSS is a suitable signal for fingerprinting transmitters, reflecting minute differences in crystal frequencies and, possibly, FM modulation indices. Even a cheap receiver can recover these differences. It can be used when the signal is already FM demodulated or otherwise not suitable for more traditional rise-time fingerprinting.

Redsea 0.7, a lightweight RDS decoder

I've written about redsea, my RDS decoder project, many times before. It has changed a lot lately; it even has a version number, 0.7.6 as of this writing. What follows is a summary of its current state and possible future developments.

Input formats

Redsea can decode several types of data streams. The command-line switches to activate these can be found in the readme.

Its main use, perhaps, is to demodulate an FM multiplex carrier, as received using a cheap rtl-sdr radio dongle and demodulated using rtl_fm. The multiplex is an FM demodulated signal sampled at 171 kHz, a convenient multiple of the RDS data rate (1187.5 bps) and the subcarrier frequency (57 kHz). There's a convenience shell script that starts both redsea and the rtl_fm receiver. For example, ./rtl-rx.sh -f 88.0M would start reception on 88.0 MHz.

It can also decode an "ASCII binary" stream (--input-ascii):

0001100100111001000101110000101110011000010010110010011001000000100001
1010010000011010110100010000000100000001101110000100010111000010111001
1001000010110000111111011101101011001010101110100011111101000011100010
100000011010010001011100001

Or hex-encoded RDS groups one per line (--input-hex), which is the format used by RDS Spy:

6201 01D8 E704 594C
6201 01D9 2217 4520
6201 E1C1 594C 6202
6201 01DA 1139 594B
6201 21DC 2020 2020

Output formats

The default output has changed drastically. There used to be no strict format to it, rather it was just a human-readable terminal display. This sort of output format will probably return at some point, as an option. But currently redsea outputs line-delimited JSON, where every group is a JSON object on a separate line. It is quite verbose but machine readable and well-suited for post-processing:

{"pi":"0x6201","group":"0A","tp":false,"prog_type":"Serious classical","ta":tru
e,"is_music":true,"alt_freqs":[87.9,88.5,89.2,89.5,89.8,90.9,93.2],"ps":"YLE YK
SI"}
{"pi":"0x6201","group":"14A","tp":false,"prog_type":"Serious classical","other_
network":{"pi":"0x6205","tp":false,"has_linkage":false}}
{"pi":"0x6201","group":"0A","tp":false,"prog_type":"Serious classical","ta":tru
e,"is_music":true,"partial_ps":"YL      "}
{"pi":"0x6201","group":"2A","tp":false,"prog_type":"Serious classical","partial
_radiotext":"Yöklassinen."}
{"pi":"0x6201","group":"0A","tp":false,"prog_type":"Serious classical","ta":tru
e,"is_music":true,"partial_ps":"YLE     "}
{"pi":"0x6201","group":"0A","tp":false,"prog_type":"Serious classical","ta":tru
e,"is_music":true,"partial_ps":"YLE YK  "}
{"pi":"0x6201","group":"2A","tp":false,"prog_type":"Serious classical","partial
_radiotext":"Yöklassinen."}
{"pi":"0x6201","group":"0A","tp":false,"prog_type":"Serious classical","ta":tru
e,"is_music":true,"alt_freqs":[87.9,88.5,89.2,89.5,89.8,90.9,93.2],"ps":"YLE YK
SI"}

Someone on GitHub hinted about jq, a command-line tool that can color and filter JSON, among other things:

> ./rtl-rx.sh -f 87.9M | jq -c
{"pi":"0x6201","group":"0A","tp":false,"prog_type":"Serious classical","ta":tru
e,"is_music":true,"partial_ps":"YL      "}
{"pi":"0x6201","group":"14A","tp":false,"prog_type":"Serious classical","other_
network":{"pi":"0x6202","tp":false}}
{"pi":"0x6201","group":"0A","tp":false,"prog_type":"Serious classical","ta":tru
e,"is_music":true,"partial_ps":"YLE     "}
{"pi":"0x6201","group":"0A","tp":false,"prog_type":"Serious classical","ta":tru
e,"is_music":true,"partial_ps":"YLE YK  "}
{"pi":"0x6201","group":"1A","tp":false,"prog_type":"Serious classical","prog_it
em_started":{"day":9,"time":"23:10"},"has_linkage":false}
^C

> ./rtl-rx.sh -f 87.9M | grep "\"radiotext\"" | jq ".radiotext"
"Yöklassinen."
"Yöklassinen."
"Yöklassinen."
"Yöklassinen."
"Yöklassinen."
"Yöklassinen."
"Yöklassinen."

The output can be timestamped using the ts utility from moreutils.

Additionally, redsea can output hex-endoded groups, the same format mentioned above.

Fast and lightweight

I've made an effort to make redsea fast and lightweight, so that it could be run real-time on cheap single-board computers like the Raspberry Pi 1. I rewrote it in C++ and chose liquid-dsp as the DSP library, which seems to work very well for the purpose.

Redsea now uses around 40% CPU on the Pi 1. Enough cycles will be left for the FM receiver, rtl_fm, which has a similar CPU demand. On my laptop, redsea has negligible CPU usage (0.9% of a single core). Redsea only runs a single thread and takes up 1500 kilobytes of memory.

Sensitivity

I've gotten several reports that redsea requires a stronger signal than other RDS decoders. This has been improved in recent versions, but I think it still has problems with even many local stations.

Let's examine how a couple of test signals go through the demodulator in Subcarrier::​demodulateMoreBits() and list possible problems. The test signals shall be called the good one (green) and the noisy one (red). They were recorded on different channels using different antenna setups. Here are their average demodulated power spectra:

[Image: Spectrum plots of the two signals superimposed.]

The noise floor around the RDS subcarrier is roughly 23 dB higher in the noisy signal. Redsea recovers 99.9 % of transmitted blocks from the good signal and 60.1 % from the noisy one.

Below, redsea locks onto our good-quality signal. Time is in seconds.

[Image: A graph of several signal properties against time.]

Out of the noisy signal, redsea could recover a majority of blocks as well, even though the PLL and constellations are all over the place:

[Image: A graph of several signal properties against time.]

1) PLL

There's some jitter in the 57 kHz PLL, especially pronounced when the signal is noisy. One would expect a PLL to slowly converge on a frequency, but instead it just fluctuates around it. The PLL is from the liquid-dsp library (internal PLL of the NCO object).

  • Is this an issue?
  • What could affect this? Loop filter bandwidth?
  • What about the gain, i.e. the multiplier applied to the phase error?

2) Symbol synchronizer

  • Is liquid's symbol synchronizer being used correctly?
  • What should be the correct values for bandwidth, delay, excess bandwidth factor?
  • Do we really need a separate PLL and symbol synchronizer? Couldn't they be combined somehow? Afterall, the PLL already gives us a multiple of the symbol speed (57,000 / 48 = 1187.5).

3) Pilot tone

The PLL could potentially be made to lock onto the pilot tone instead. It would yield a much higher SNR.

  • According to the specs, the RDS subcarrier is phase-locked to the pilot, but can we trust this? Also, the phase difference is not defined in the standard.
  • What about mono stations with no pilot tone?
  • Perhaps a command-line option?

4) rtl_fm

  • Are the parameters for rtl_fm (gain, filter) optimal?
  • Is there a poor-quality resampling phase somewhere, such as the one mentioned in the rtl_fm guide? Probably not, since we don't specify -r
  • Is the bandwidth (171 kHz) right?

Other features (perhaps you can help!)

Besides the basic RDS features (program service name, radiotext, etc.) redsea can decode some Open Data applications as well. It receives traffic messages from the TMC service and prints them in English. These are partially encrypted in some areas. It can also decode RadioText+, a service used in some parts of Germany to transmit such information as artist/title tags, studio hotline numbers and web links.

If there's an interesting service in your area you'd like redsea to support, please tell me! I've heard eRT (Enhanced RadioText) being in use somewhere in the world, and RASANT is used to send DGPS corrections in Germany, but I haven't seen any good data on those.

A minute or two of example data would be helpful; you can get hex output by adding the -x switch to the redsea command in rtl-rx.sh.

Barcode recovery using a priori constraints

[Image: A hand holding a Finnish driver's license card with blurred details and the text 'specimen' across it.]

Barcodes can be quite resilient to redaction. Not only is the pattern a strong visual signal, but so is the encoded string that often has a rigidly defined structure. Here I present a method for recovering the data from a blurred, pixelated, or even partially covered barcode using prior knowledge of this higher-layer structure. This goes beyond so-called "deblurring" or blind deconvolution in that it can be applied to non-convolutional distortions as well.

It has also been a fun exercise in OpenCV matrix operations.

As example data, specimen pictures of Finnish driver's licenses shall be used. The card contains a Code 39 barcode encoding the cardholder's national identification number. This is a fixed-length string with well-defined structure and rudimentary error-detection, so it fits our purpose well. High-resolution samples with fictional data are available at government websites. Redacted and low-quality pictures of real cards are also widely available online, from social media sites to illustrations for news stories.

In Finland, knowledge of a name and this code often suffices as proof of identity on the phone, yet nothing on the card indicates that the barcode contains sensitive information. Consequently, it's not hard to find pictures of cards with the barcode completely untouched either, even if all the other information has been carefully removed.

All cards and codes used in this post are simulated.

Image rectification

We'll start by aligning the barcode with the pixel grid and moving it into a known position. Its vertical position on the driver's license is pretty standard, so finding the card's corners and doing a reverse perspective projection should do the job.

Finding the blue EU flag seemed like a good starting point for automating the transform. However, JPEG is quite harsh on high-contrast edges and extrapolating the card boundary from the flag corners wasn't too reliable. A simpler solution is to use manual adjustments: an image viewer is opened and clicking on the image moves the corners of a quadrilateral on top of the image. cv::find­Homography() and cv::warp­Perspective() are then used to map this quadrilateral to a 857×400 rectangular image, giving us a rectified image of the card.

[Image: The card cut out from the previous image and warped to cancel the effects of persective.]

Reduction & filtering

The bottom 60 pixel rows, now containing our barcode of interest, are then reduced to a single 1D column sum signal using cv::reduce(). In this waveform, wide bars (black) will appear as valleys and wide spaces (white) as peaks.

In Code 39, all characters are of equal width and consist of 3 wide and 9 narrow elements (hence the name). Only the positions of the wide elements need to be determined to be able to decode the characters. A 15-pixel convolution kernel – cv::GaussianBlur() – is applied to smooth out any narrow lines.

[Image: A blurred barcode on top of a graph depicting its gray level fluctuations.]

A rectangular kernel matched to the bar width would possibly be a better choice, but the exact bar width is unknown at this point.

Constraints

The format of the driver's license barcode will always be *DDMMYY-NNNC*, where

  • The asterisks * are start and stop characters in Code 39
  • DDMMYY is the cardholder's date of birth
  • NNN is a number from 001 to 899, its least significant bit denoting gender
  • C is a modulo-31 checksum character; Code 39 doesn't provide its own checksum

These constraints will be used to limit the search space at each string position. For example, at positions 0 and 12, the asterisk is the only allowed character, whereas in position 1 we can have either the number 0, 1, 2, or 3 as part of a day of month.

If text on the card is readable then the corresponding barcode characters can be marked as already solved by narrowing the search space to a single character.

Decoding characters

It's a learning adventure so the decoder is implemented as a type of matched filter bank using matrix operations. Perhaps it could be GPU-friendly, too.

Each row of the filter matrix represents an expected 1D convolution output of one character, addressed by its ASCII code. A row is generated by creating an all-zeroes vector with just the peak elements set to plus/minus unity. These rows are then convolved with a horizontal Lanczos kernel.

[Image: A matrix about 40 lines high and 40 columns wide. Each cell has a gray level. The cells of each line form parts of blurred barcodes. Lines are marked with numeric codes and ASCII characters.]

The exact positions of the peaks depend on the barcode's wide-to-narrow ratio, as Code 39 allows anything from 2:1 to 3:1. Experiments have shown it to be 2.75:1 in most of these cards.

The 1D wave in the previous section is divided into character-length pieces which are then multiplied per-element by the above matrix using cv::Mat::mul(). The result is reduced to a row sum vector.

This vector now contains a "score", a kind of matched filter output, for each character in the search space. The best matching character is the one with the highest score; this maximum is found using cv::minMaxLoc(). Constraints are passed to the command as a binary mask matrix.

Barcode alignment and length

To determine the left and right boundaries of the barcode, an exhaustive search is run through the whole 1D signal (around 800 milliseconds). On each iteration the total score is calculated as a sum of character scores, and the alignment with the best total score is returned. This also readily gives us the best decoded string.

[Image: A graph depicting gray level variations. On top of that, decoded characters with associated floating point values. The code says *010185-710P*.]

We can also enable checksum calculation and look for the best string with a valid checksum. This allows for errors elsewhere in the code.

Results

The barcodes in these images were fully recovered using the method presented above:

[Image: Three barcodes each distorted in a different way - pixelated, blurred, or smudged.]

It might be possible to further develop the method to recover even more blurred images. Possible improvements could include fine-tuning the Lanczos kernel used to generate the filter bank, or coming up with a better way to score the matches.

Countermeasures

The best way to redact a barcode seems to be to draw a solid rectangle over it, preferably even slightly bigger than the barcode itself, and make sure it really gets rendered into the bitmap.

Printing an unlabeled barcode with sensitive data seems like a bad idea to begin with, but of course there could be a logical reason behind it.

Pea whistle steganography

[Image: Acme Thunderer 60.5 whistle]

Would anyone notice if a referee's whistle transmitted a secret data burst?

I do really follow the game. But every time the pea whistle sounds to start the jam I can't help but think of the possibility of embedding data in the frequency fluctuation. I'm sure it's alternating between two distinct frequencies. Is it really that binary? How random is the fluctuation? Could it be synthesized to contain data, and could that be read back?

I found a staggeringly detailed Wikipedia article about the physics of whistles – but not a single word there about the effects of adding a pea inside, which is obviously the cause of the frequency modulation.

To investigate this I bought a metallic pea whistle, the Acme Thunderer 60.5, pictured here. Recording its sound wasn't straightforward as the laptop microphone couldn't record the sound without clipping. The sound is incredibly loud indeed – I borrowed a sound pressure meter and it showed a peak level of 106.3 dB(A) at a distance of 70 cm, which translates to 103 dB at the standard 1 m distance. (For some reason I suddenly didn't want to make another measurement to get the distance right.)

[Image: Display of a sound pressure meter showing 106.3 dB max.]

Later I found a microphone that was happy about the decibels and got this spectrogram of a 500-millisecond whistle.

[Image: Spectrogram showing a tone with frequency shifts.]

The whistle seems to contain a sliding beginning phase, a long steady phase with frequency shifts, and a short sliding end phase. The "tail" after the end slide is just a room reverb and I'm not going to need it just yet. A slight amplitude modulation can be seen in the oscillogram. There's also noise on somewhat narrow bands around the harmonics.

The FM content is most clearly visible in the second and third harmonics. And seems like it could very well fit FSK data!

Making it sound right

I'm no expert on synthesizers, so I decided to write everything from scratch (whistle-encode.pl). But I know the start phase of a sound, called the attack, is pretty important in identification. It's simple to write the rest of the fundamental tone as a simple FSK modulator; at every sample point, a data-dependent increment is added to a phase accumulator, and the signal is the cosine of the accumulator. I used a low-pass IIR filter before frequency modulation to make the transitions smoother and more "natural".

Adding the harmonics is just a matter of measuring their relative powers from the spectrogram, multiplying the fundamental phase angle by the index of the harmonic, and then multiplying the cosine of that phase angle by the relative power of that harmonic. SoX takes care of the WAV headers.

Getting the noise to sound right was trickier. I ended up generating white noise (a simple rand()), lowpass filtering it, and then mixing a copy of it around every harmonic frequency. I gave the noise harmonics a different set of relative powers than for the cosine harmonics. It still sounds a bit too much like digital quantization noise.

Embedding data

There's a limit to the amount of bits that can be sent before the result starts to sound unnatural; nobody has lungs that big. A data rate of 100 bps sounded similar to the Acme Thunderer, which is pretty much nevertheless. I preceded the burst with two bytes for bit and byte sync (0xAA 0xA7), and one byte for the packet size.

Here's "OHAI!":

Sounds legit to me! Here's a slightly longer one, encoding "Help me, I'm stuck inside a pea whistle":

Homework

  1. Write a receiver for the data. It should be as simple as receiving FSK. The frequency can be determined using atan2, a zero-crossing detector, or FFT, for instance. The synchronization bytes are meant to help decode such a short burst; the alternating 0s and 1s of 0xAA probably give us enough transitions to get a bit lock, and the 0xA7 serves as a recognizable pattern to lock the byte boundaries on.
  2. Build a physical whistle that does this!