How to Improve Dot Matrix OCR Performance

A while ago I needed to scan a faded report that was printed in 1989 on a 9-pin dot-matrix printer. However, current optical character recognition (OCR) software seems to work best on freshly-produced laser printer documents. After much trial and error, I developed a method of processing the images of the scanned pages to improve the accuracy of the ReadIris Pro 8 OCR application when recognizing faded, dot-matrix printouts.

I scanned the dot matrix printouts at 300dpi and saved them as greyscale images with 256 shade of grey. I used Photoshop to connect the matrix of dots to make each letter more closely resemble the solid letters that ReadIris Pro is able to recognize more accurately.

I connected the dots in four steps:

  1. Scan the document;
  2. Find the edges of the dots that make up each letter;
  3. Threshold the greyscale image to produce a black & white image; and
  4. Blur the black & white image to connect the dots.

I then used ReadIris Pro to recognize the text in the image produced in stage 4 using the following settings:

  • Settings > Font Type Dot Matrix
  • Settings > Character Pitch Automatic

The following steps show the results of applying each stage to the top-left portion of a scan of a faded, dot-matrix printout.

Step 1: Scan the Document

Scan the document at 300dpi and save it as a 256 greyscale levels image.

Dot Matrix OCR Step 1

Step 2: Find the Edges

Select Filter > Stylize > Find Edges to apply the Find Edges filter to the image scanned in at step 1.

Dot Matrix OCR Step 2

Step 3: Threshold the Greyscale Image

Select Image > Adjust > Threshold to apply a threshold of 200 to the image produced at step 2.

Dot Matrix OCR Step 3

Step 4: Blur the Black & White Image

Select Filter > Blur > Blur More to apply the Blur More filter to the image produced at step 3.

Dot Matrix OCR Step 4

I don’t have any figures—such as the increase in the number of characters correctly identified—to evaluate the performance of my method. However, it did reduce the number of incorrect characters.

blog comments powered by Disqus