Character recognition, the simple way

Recently I solved a small problem and found it funny enough to write a post.

I had a PDF document with numbers in it (ones and zeros). I needed the numbers for a program, but they were embedded as a scanned picture instead of text. Copying them by hand would be boring and error-prone. I wouldn't want any typos, seeing as the numbers themselves were supposed to be part of an error-correcting code.

[Image: A big matrix of ones and zeros.]

Then I thought: perhaps Perl could do this for me! I came up with this:

#!/usr/bin/perl
use feature "switch";
 
open(S,"convert bitit.png gray:-|");
for $y (0..434) {
  for $x (0..699) {
    read(S,$a,1);
    $b[int($x / 27)][int($y / 27)] ++ if (ord($a) < 127);
  }
}
close(S);
 
for $y (0..15) {
  $byte = 0;
  for $x (0..25) {
    $byte <<= 1;
    given ($b[$x][$y] // 0) {
      when ($_ < 10) { print "  "; }
      when ($_ < 90) { print "1 "; $byte++; }
      default        { print "0 "; }
    }
  }
  printf (" 0x%07x\n",$byte);
}

Running the script produces:

$ perl bitit.pl 
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 0 1 1 1  0x2000077
0 1                           0 1 0 1 1 1 0 0 1 1 1  0x10002e7
0   1                         0 1 1 1 0 1 0 1 1 1 1  0x08003af
0     1                       0 1 1 0 0 0 0 1 0 1 1  0x040030b
0       1                     0 1 1 0 1 0 1 1 0 0 1  0x0200359
0         1                   0 1 1 0 1 1 1 0 0 0 0  0x0100370
0           1                 0 0 1 1 0 1 1 1 0 0 0  0x00801b8
0             1               0 0 0 1 1 0 1 1 1 0 0  0x00400dc
0               1             0 0 0 0 1 1 0 1 1 1 0  0x002006e
0                 1           0 0 0 0 0 1 1 0 1 1 1  0x0010037
0                   1         0 1 0 1 1 0 0 0 1 1 1  0x00082c7
0                     1       0 1 1 1 0 1 1 1 1 1 1  0x00043bf
0                       1     0 1 1 0 0 0 0 0 0 1 1  0x0002303
0                         1   0 1 1 0 1 0 1 1 1 0 1  0x000135d
0                           1 0 1 1 0 1 1 1 0 0 1 0  0x0000b72
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 1 0 1 1 1 0 0 1  0x00005b9
$ █

How does it do that? The script divides the image — read pixel-by-pixel via ImageMagick — into squares and counts the black pixels in every square. The character "0" has obviously more black than "1"; the threshold was found by experimenting. An empty square has nearly no black pixels at all, and depicts a zero in this example. Calculating a hex value for every row is simple.

I ended up having to write only slightly more characters than the image contained! :)

2 comments:

  1. Guess how the old bank checks with the magnetic ink were read...there was no fancy OCR or pixel-by-pixel recognition...each digit was shaped in a weird 1970's sci-fi computer font which had a different detectable amount of ink for each number from 0 through 9. The different amount of ink had different magnetic flux levels resulting in different voltages which became a 10 level logic signal which was then sampled and converted to binary for the computer.

    ReplyDelete

Please browse through the FAQ first, it might be that your question is already answered.

Spammers have even found comments sections, so this comments section is pre-moderated; it will take some time for the comment to show up.