Apr 14, 2015

Trackers and bank accounts

A Finnish online bank used to include a third-party analytics and tracking script in all of its pages. Ospi first wrote about it (in Finnish) in February 2015, and this caused a bit of a fuss.

The bank responded to users' worries by claiming that all information is collected anonymously:

But is it true?

As Ospi notes, a plethora of information is sent along the HTTP request for the tracker script. This includes, of course, the IP address of the user; but also the full URL the user is browsing. The bank's URLs reveal quite a bit about what the user is doing (e.g. continuousSavingsContractStep1.do).

I logged in to the bank using well-known test credentials to record one such tracking request. The URL sent to the third party tracker contains a cleartext transaction archive code that could easily be used to match a transaction between two bank accounts, since it's identical for both users. But there's also a hex string called "accountId" (highlighted in red).

Remote Address: 80.***.***.***:443
Request URL:    https://www.google-analytics.com/collect?v=1&_v=j33&a=870588619&t
                =pageview&_s=1&dl=https%3A%2F%2Fonline.********.fi%2Febank%2Facco
                unt%2FinitTransactionDetails.do%3FbackLink%3Dreset%26accountId%3D
                69af881eca98b7042f18e975e00f9d49d5d5ee64%26rowNo%3D0%26type%3Dtra
                ns%26archivecode%3D20150220123456780002&ul=en-us&de=windows-1252&
                dt=Tilit%C2%A0%7C%C2%A0Verkkopankki%20%7C%20S-Pankki&sd=24-bit&sr
                =1440x900&vp=1440x150&je=1&fl=16.0%20r0&_u=QACAAQQBI~&jid=&cid=18
                39557247.1424801770&uid=&tid=UA-37407484-1&cd1=&cd2=demo_accounts
                &cd3=%2Ffi%2F&z=2098846672
Request Method: GET
Status Code:    200 OK

It's 40 hex characters long, which is 160 bits. This happens to be the length of an SHA-1 hash.

Could it really be a simple hash of the user's bank account number? Let's try. The test account's IBAN code is FI96 3939 0001 0006 03, but this doesn't give us the above hash. However, if we remove the country code, IBAN checksum, and all whitespaces, it turns out we have a match!

~ » echo -n "FI96 3939 0001 0006 03" | shasum
dcf04c4fd3b6e29b4b43a8bf43c2713ac9be1de2  -
~ » echo -n "FI9639390001000603" | shasum
3e3658e4c2802dd5c21b1c6c1ed55fc1f39c8830  -
~ » echo -n "39390001000603" | shasum
69af881eca98b7042f18e975e00f9d49d5d5ee64  -
~ » 

Hashes for bank account numbers are easy to brute-force, especially if the bank is already known. I wrote the following C program that reversed the above hash to the correct account number in 0.5 seconds.

#include <openssl/sha.h>
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
 
int main() {
  const char target[SHA_DIGEST_LENGTH] = {
    0x69, 0xaf, 0x88, 0x1e, 0xca, 0x98, 0xb7, 0x04, 0x2f, 0x18,
    0xe9, 0x75, 0xe0, 0x0f, 0x9d, 0x49, 0xd5, 0xd5, 0xee, 0x64
  };
  unsigned char test_str[15];
  unsigned char test_hash[SHA_DIGEST_LENGTH];
 
  for (int k=0; k < 9999; k++) {
    for (int tili=0; tili <= 999999; tili++) {
      snprintf(test_str, 15, "3939%04d%06d", k, tili);
      SHA1(test_str, 14, test_hash);
      if (memcmp(test_hash, target, 20) == 0) {
        printf("found %s\n", test_str);
        exit(0);
      }
    }
  }
}

In conclusion, the third party is provided with the user's IP address, bank account number, addresses of subpages they visit, and unique identifiers of all transactions they make. The analytics company should also have no difficulty matching the user with its own database collected from other sites.

The script was eventually removed from the site.

Feb 8, 2015

Receiving RDS with the RTL-SDR

redsea is a command-line RDS decoder. I originally wrote it as a script to decode RDS from demultiplexed FM stereo sound. Later I've experimented with other ways to read the bits, and the latest addition is to support the RTL-SDR television receiver via the rtl_fm tool.

Redsea is on GitHub. It has minimal dependencies (perl core modules, C standard library, rtl-sdr command-line tools) and has been tested to work on OSX and Linux with good enough FM reception. All test results, ideas, and pull requests are welcome.

What it says

The program prints out decoded RDS groups, one group per line. Each group will contain a PI code identifying the station plus varying other data, depending on the group type. The below picture explains the types of data you'll probably most often encounter.

A more verbose output can be enabled with the -l option (it contains the same information though). The -t option prefixes all groups with an ISO timestamp.

How it works

The DSP side of my program, named rtl_redsea, is written in C99. It's a synchronous DBPSK receiver that first bandpass filters ① the multiplex signal. A PLL locks onto the 19 kHz stereo pilot tone; its third harmonic (57 kHz) is used to regenerate the RDS subcarrier. Dividing it by 16 also gives us the 1187.5 Hz clock frequency. Phase offsets of these derived signals are adjusted separately.

The local 57 kHz carrier is synchronized so that the constellation lines up on the real axis, so we can work on the real part only ②. Biphase symbols are multiplied by the square-wave clock and integrated ③ over a clock period, and then dumped into a delta decoder ④, which outputs the binary data as bit strings into stdout ⑤.

Signal quality is estimated a couple of times per second by counting the number of "suspicious" integrated biphase symbols, i.e. symbols with halves of opposite signs. The symbols are being sampled with a 180° phase shift as well, and we can switch to that stream if it seems to produce better results.

This low-throughput binary string data is then handled by redsea.pl via a pipe. Synchronization and error detection/correction happens there, as well as decoding. Group data is then displayed on the terminal, in semi-human-readable form.

Future

My ultimate goal is to have a tool useful for FM DX, i.e. pretty good noise resistance.

Jan 16, 2015

My chip collection

Old IC (integrated circuit) packages are fun and I collect them. This involves going to flea markets to look for cheap vintage electronics like telephones, answering machines, radios or toys, and then desoldering and salvaging all the ICs and other interesting parts. Selected packages from my disorganized pile of chips follow. Most are POTS-related.

Sony CXA1619BS

A "one-chip-wonder", this is an FM/AM radio in a small package. It takes an RF signal (from the antenna) and an IF oscillator frequency as inputs and outputs demodulated monaural audio.

Sanyo LA2805

This chip does general answering machine related tasks. It has a tape preamp for recording and playback; voice detector logic; beep detection using zero-crossing comparation; power amplifier; line amplifier; and pins for interfacing with a microcontroller.

Unicorn Microelectronics UM91215C

The UM91215C is a tone/pulse dialer. A telephone keyboard matrix is connected to the input pins, and the chip outputs DTMF-encoded audio or pulsed digits, depending on the selected dialing mode. An external oscillator needs to be connected as well. It can do a one-key redial of the last dialed number, and it can also flash the phone line.

Holtek HT9170

A DTMF receiver, reversing the operation of UM91215C above. The chip, employing filters and zero-crossing detectors, is fed an external oscillator frequency and telephone line audio, and it outputs a four-bit code corresponding to the DTMF digit present in the signal. The use of external components is minimal, but a crystal oscillator is needed in this case as well.

SGS-Thomson TDA1154

A speed regulator for DC motors, this chip can keep a motor running at a very stable speed under varying load conditions. In an answering machine, it is needed to keep distortions in tape audio in the minimum.

Toshiba TC8835AN

This chip can store and play back a total of 16 audio recordings of 512 kilobits in size. It also contains a lot of command logic, explained in a 40-page datasheet. Type of audio encoding is not specified, but the bitrate can be chosen between 22kbps and 16kbps. The analog output must be filtered prior to playback.

Intel 8049

This monster of a chip is a 6 MHz, 8-bit microcontroller with 17 registers, 2 kilobytes ROM, 128 bytes RAM, and an instruction set of 90 codes. It's used in many older devices, from telephones to digital multimeters.

Oct 30, 2014

Visualizing hex dumps with Unicode emoji

Memorizing SSH public key fingerprints can be difficult; they're just long random numbers displayed in base 16. There are some terminal-friendly solutions, like OpenSSH's randomart. But because I use a Unicode terminal, I like to map the individual bytes into characters in the Miscellaneous Symbols and Pictographs block.

What's happening here? First we create a 256-element array containing a hand-picked collection of emoji. Naturally, they're all assigned an index from 0x00 to 0xff. Then we'll loop through standard input and look for lines containing colon-separated hex bytes. Each hex value is replaced with an emoji from the array.

Here's the output:

The script could easily be extended to support output from other hex-formatted sources as well, such as xxd:

kissofoni; tassun kynsi neulana / musa korvista kajahtaa

Jul 14, 2014

Mapping microwave relay links from video

Radio networks are often at least partially based on microwave relay links. They're those little mushroom-like appendices growing out of cell towers and building-mounted base stations. Technically, they're carefully directed dish antennas linking such towers together over a line-of-sight connection. I'm collecting a little map of nearby link stations, trying to find out how they're interconnected and which network they belong to.

Circling around

We can find a rough direction for any link antenna by approximating a tangent for the dish shroud surface from position-stamped video footage taken while circling the tower. Optimally we would have a drone make a full circle around the tower at a constant distance and elevation to map all antennas at once; but if our DJI Phantom has run out of battery, a GPS positioned still camera at ground level will also do.

The rest can be done manually, or using pattern recognition from OpenCV. In these pictures, the ratio of the diameters of the concentric circles is a sinusoid function of the angle between the antenna direction and the camera direction. At its maximum, we're looking straight at the beam. (The ratio won't max out at unity in this case, because we're looking at the antenna slightly from below.) We can select the frame with the maximum ratio from high-speed footage, or we can interpolate a smooth sinusoid to get an even better value.

This particular antenna is pointing west-northwest with an azimuth of 290°.

What about distance?

Because of the line-of-sight requirement, we also know the maximum possible distance to the linked tower, using the formula 7140 × √(4 / 3 × h) where h is the height of the antenna from ground. If the beam happens to hit a previously mapped tower closer than this distance, we can assume they're connected!

This antenna is communicating to a tower not further away than 48 km. Judging from the building it's standing on, it belongs to a government trunked radio network.

Jun 16, 2014

Headerless train announcements

The Finnish state railway company just changed their automatic announcement voice, discarding old recordings from trains. It's a good time for some data dumpster diving for the old ones, don't you think?

A 67-megabyte ISO 9660 image is produced that once belonged to an older-type onboard announcement device. It contains a file system of 58 directories with five-digit names, and one called "yleis" (Finnish for "general").

Each directory contains files with three-digit file names. For each number, there's 001.inf, 001.txt and 001.snd. The .inf and .txt files seem to contain parts of announcements as ISO 8859 encoded strings, such as "InterCity train" and "to Helsinki". The .snd files obviously contain the corresponding audio announcements. There's a total of 1950 sound files.

Directory structure

The file system seems to be structurally pointless; there's nothing apparent that differentiates all files in /00104 from files in /00105. Announcements in different languages are numerically separated, though (/001xx = Finnish, /002xx = Swedish, /003xx = English). Track numbers and time readouts are stored sequentially, but there are out-of-place announcements and test files in between. The logic connecting numbers to their meanings is probably programmed into the device for every train route.

Everything can be spliced together from almost single words. But many common announcements are also recorded as whole sentences, probably to make them sound more natural.

Audio format

The audio files are headerless; there is no explicit information about the format, sample rate or sample size anywhere.

The byte histogram (left) and Poincaré plot (right) from baudline suggest a 4-bit sample size; this, along with the fact that all files start with 0x80, is indicative of an adaptive differential PCM encoding scheme.

Unfortunately there are as many variations to ADPCM as there are manufacturers of encoder chips. None of the decoders known by SoX produce clean results. But with the right settings for the OKI-ADPCM decoder we can already hear some garbled speech under heavy Brownian noise.

For unknown reasons, the output signal from SoX is spectrum-inverted. Luckily it's trivial to fix (see my previous post on frequency inversion). The pitch sounds roughly natural when a 19,000 Hz sampling rate is assumed. A test tone found in one file comes out as a 1000 Hz sine when the sampling rate is further refined to 18,930 Hz.

This is what we get after frequenqy inversion, spectral equalization, and low-pass filtering:

There's still a high noise floor due to the mismatch between OKI-ADPCM and the unknown algorithm used by the announcement device, but it's starting to sound alright!

Peculiarities

There seems to be an announcement for every thinkable situation, such as:

  • "Ladies and Gentlemen, as due to heavy snowfall, we are running slightly late. Please accept our apologies."
  • "Ladies and Gentlemen, an animal has been run over by the train. We have to wait a while before continuing the journey."
  • "Ladies and Gentlemen, the arrival track of the train having been changed, the platform is on your left hand side."
  • "Ladies and Gentlemen, we regret to inform you that today the restaurant-car is exceptionally closed."

Also, there is an English recording of most announcements, even though only Finnish and Swedish are usually heard on commuter trains.

One file contains a long instrumental country song.

In an eerily out-of-place sound file, a small child reads out a list of numbers.

Final words

This is something I've wanted to do with this almost melodically intonated announcement about ticket selling compartments.

Jun 9, 2014

Time-coding audio files

One day you'll need to include real-time UTC timestamps in audio. It's useful when reconstructing events from long, unsupervised surveillance microphone recordings, or when constantly monitoring and logging radio channels.

There's no standard method for doing this with WAV or FLAC files. One method would be to log the start time in the filename and calculate the time based on audio position. However, this is not possible with voice-activated or squelched recorders. It also relies on the accuracy and stability of the ADC clock.

I'll take a look at some ways to include an accurate timestamp directly in the in-band audio.

Least significant bit

Time information can be encoded in the least significant bit (LSB) of the 16-bit PCM samples. This "steganographic" method requires a lossless file format and lossless conversions. The script below truncates all samples of a raw single-channel signed-integer PCM stream to 15 bits and inserts a 20-byte ISO 8601 timestamp in ASCII roughly every second, preceded by a "mark" start bit. When played back, the LSB can be zeroed out to get rid of the timestamps. The WAV can also be played as such; the "ticking" sound will be practically inaudible at an amplitude of −96 dB. The outgoing PCM stream is then sent to SoX for WAV encoding.

use warnings;
use DateTime;
 
$snum    = 0;
$writing = 0;
 
open OUT, "|sox -t .raw -e unsigned-integer -b 16 -r 44100 ".
          "-c 1 - stamped.wav";
 
while (not eof STDIN) {
 
  read STDIN, $sample, 2;
  $sample = unpack "s", $sample;
 
  if ($writing) {
    $bit = (ord(substr($code, $pos >> 3, 1)) >> ($pos % 8)) & 1;
    if (++$pos >= length($code) << 3) {
      $writing = 0;
      $bit     = 0;
    }
  } elsif ($snum++ % 44100 == 0) {
    $writing = 1;
    $pos     = 0;
    $bit     = 1;
    $code    = DateTime->now()->iso8601();
  }
 
  print OUT pack("S", ($sample + 0x7FFF) & 0xFFFE | $bit);
  
}
close OUT;

Note that the start bit of the timestamp will mark the moment the sample reached this script, and it could differ hundreds of milliseconds from the actual moment of reception at the microphone. Also, the timestamp does not mark the start of a second, but is rather timed by an arbitrary sample counter. One could also poll and write the timestamps in a continuous manner.

The above script could be modified to interface with my squelch script, by only inserting timestamps when squelch is not active. The resulting audio could then be efficiently encoded as FLAC.

lsb-time-read.pl reads back the timestamps, also printing the sample position of each. Below is a sound sample of a clean signal followed by a timestamped one.

Lossy-friendly approach

Lossy compression, by definition, does not retain the numeric values of samples, so they can't be treated as bit fields. Instead, we can use an analog modulation scheme like binary FSK. MP3 and Ogg Vorbis encoders will, at a reasonable bit rate, retain the structure of a sufficiently slow FSK burst. This method will work even if the timestamping phase is followed by an analog conversion.

Using the ultrasonic part of the spectrum comes to mind; but unfortunately such high frequencies are mainly ignored by a LPF at the encoder. However, we can use the higher end of the remaining spectrum and filter it out afterwards, if the recording consists of narrow-band speech. In the case of squelched conversation, we could write the timestamp only in the beginning of each transmission. This way it could even be in the speech frequencies.

fsk-timestamp.pl embeds the timestamps into PCM data; they can be read back using minimodem --rx --mark 11000 --space 13000 --file stamped.wav -q 1200.

A sound sample follows.