Character recognition, the simple way

Recently I solved a small problem and found it funny enough to write a post.

I had a PDF document with numbers in it (ones and zeros). I needed the numbers for a program, but they were embedded as a scanned picture instead of text. Copying them by hand would be boring and error-prone. I wouldn't want any typos, seeing as the numbers themselves were supposed to be part of an error-correcting code.

[Image: A big matrix of ones and zeros.]

Then I thought: perhaps Perl could do this for me! I came up with this:

#!/usr/bin/perl
use feature "switch";
 
open(S,"convert bitit.png gray:-|");
for $y (0..434) {
  for $x (0..699) {
    read(S,$a,1);
    $b[int($x / 27)][int($y / 27)] ++ if (ord($a) < 127);
  }
}
close(S);
 
for $y (0..15) {
  $byte = 0;
  for $x (0..25) {
    $byte <<= 1;
    given ($b[$x][$y] // 0) {
      when ($_ < 10) { print "  "; }
      when ($_ < 90) { print "1 "; $byte++; }
      default        { print "0 "; }
    }
  }
  printf (" 0x%07x\n",$byte);
}

Running the script produces:

$ perl bitit.pl 
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 1 1 1  0x2000077
0 1                           0 1 0 1 1 1 0 0 1 1 1  0x10002e7
0   1                         0 1 1 1 0 1 0 1 1 1 1  0x08003af
0     1                       0 1 1 0 0 0 0 1 0 1 1  0x040030b
0       1                     0 1 1 0 1 0 1 1 0 0 1  0x0200359
0         1                   0 1 1 0 1 1 1 0 0 0 0  0x0100370
0           1                 0 0 1 1 0 1 1 1 0 0 0  0x00801b8
0             1               0 0 0 1 1 0 1 1 1 0 0  0x00400dc
0               1             0 0 0 0 1 1 0 1 1 1 0  0x002006e
0                 1           0 0 0 0 0 1 1 0 1 1 1  0x0010037
0                   1         0 1 0 1 1 0 0 0 1 1 1  0x00082c7
0                     1       0 1 1 1 0 1 1 1 1 1 1  0x00043bf
0                       1     0 1 1 0 0 0 0 0 0 1 1  0x0002303
0                         1   0 1 1 0 1 0 1 1 1 0 1  0x000135d
0                           1 0 1 1 0 1 1 1 0 0 1 0  0x0000b72
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 1 1 1 0 0 1  0x00005b9
$ █

How does it do that? The script divides the image — read pixel-by-pixel via ImageMagick — into squares and counts the black pixels in every square. The character "0" has obviously more black than "1"; the threshold was found by experimenting. An empty square has nearly no black pixels at all, and depicts a zero in this example. Calculating a hex value for every row is simple.

I ended up having to write only slightly more characters than the image contained! :)