How to increase accuracy of Tesseract (2024)

How to increase accuracy of Tesseract

What is OCR?

OCR stands for “Optical Character Recognition”. It is a technology that recognizes text within a image. It is commonly used to recognize text in scanned documents and images. OCR software can be used to convert a physical paper document, or an image into an accessible electronic version with text.

Why OCR?

Optical character recognition (OCR) technology is a business solution for automating data extraction from printed or written text from a scanned document or image file and then converting the text into a machine-readable form to be used for data processing like editing or searching.

What is Tesseract OCR?

Tesseract OCR is an optical character reading engine developed by HP laboratories in 1985 and open sourced in 2005. Since 2006 it is developed by Google.
Tesseract has Unicode (UTF-8) support and can recognize more than 100 languages “out of the box” and thus can be used for building different language scanning software also. Latest Tesseract version is Tesseract 4.
It adds a new neural net (LSTM) based OCR engine which is focused on line recognition but also still supports the legacy Tesseract OCR engine which works by recognizing character patterns.

Working of OCR?
Generally OCR works as follows:

  • Preprocess image data, for example: convert to gray scale, smooth, de-skew, filter.
  • Detect lines, words and characters.
  • Produce ranked list of candidate characters based on trained data set. (here the setDataPath() method is used for setting path of trainer data)
  • Post process recognized characters, choose best characters based on confidence from previous step and language data. Language data includes dictionary, grammar rules, etc.

How to increase accuracy of Tesseract (1)

OCR accuracy on Unclear image.

In most of the cases, we get a noisy image and thus we get a very noisy output. To deal with it we need to perform some processing on the image called Image processing.
Tesseract perform implicit image processing by default, but it is not enough to obtain high accuracy on a noisy image.
That’s why we need to perform some explicit image processing techniques such as

  1. fix DPI (if needed) 300 DPI is minimum
  2. fix text size (e.g. 12 pt should be ok)
  3. try to fix text lines (de-skew and de-warp text)
  4. try to fix illumination of image (e.g. no dark part of image)
  5. Convert an image into gray scale.
  6. Binarize (Gray Scaled) and de-noise image.

1. Scaling of image to right size

For better accuracy images are scaled at least 300 DPI(Dots Per Inch). Keeping DPI lower than 200 will give unclear and incomprehensible results while keeping the DPI above 600 will unnecessarily increase the size of the output file without improving the quality of the file. Thus, a DPI of 300 works best for this purpose.

Set DPI of image

Java

1

2

Tesseract it = new Tesseract();

it.setTessVariable("user_defined_dpi", "300");

2. Increasing contrast of image

Low contrast can result in poor OCR.Increasing the contrast between the text/image and its background brings out more clarity in the output.

3. Image Binarization

It is a process of converting an gray-scaled image to black and white image.
Getting the RGB content of image

1

2

3

double d = ipimage

.getRGB(ipimage.getTileWidth() / 2,

ipimage.getTileHeight() / 2);

By getting the RGB content we can set the values of scale factor and offset which are further used to scaling the image.
Creating a 2D platform on the buffer image for drawing the new image

Drawing image using Graphic2D onject

Java

1

2

BufferedImage outputImage = new BufferedImage(1050,1024,ipimage.getType());

Graphics2D graphic = outputImage.createGraphics();

Drawing new image starting from 0 ,0 of size 1050 x 1024 (zoomed images) and null is the Image Observer class object

Java

1

2

graphic.drawImage(inputImage, 0, 0,1050, 1024, null);

graphic.dispose();

Now using RescaleOp object for gray scaling images

Drawing output image

Java

1

2

3

RescaleOp rescale = new RescaleOp(scaleFactor, offset, null);

BufferedImage fopimage = rescale.filter(opimage, null);

ImageIO.write(fopimage, "IMAGE_FILE_TYPE", new File("OUTPUT_IMAGE_FILE_PATH"));

Code for converting an RGB buffered image to Gray Scaled image.

Program to perform binarization on image

Java

1

2

3

4

5

6

7

8

9

10

11

12

13

14

15

16

17

18

19

20

21

22

23

24

public static BufferedImage RGBToGrayScale(BufferedImage file) throws IOException {

BufferedImage image;

int width;

int height;

File output = new File("FILE_PATH");

try {

image = file;

width = image.getWidth();

height = image.getHeight();

for (int i = 0; i < height; i++) {

for (int j = 0; j < width; j++) {

Color c = new Color(image.getRGB(j, i));

int red = (int) (c.getRed() * 0.299);

int green = (int) (c.getGreen() * 0.587);

int blue = (int) (c.getBlue() * 0.114);

Color newColor = new Color(red + green + blue, red + green + blue, red + green + blue);

image.setRGB(j, i, newColor.getRGB());

}

}

ImageIO.write(image, "jpg", output);

} catch (Exception e) {

}

return ImageIO.read(output);

}

4. Remove Noise

Noise can drastically reduce the overall quality of the OCR process. It can be present in the background or foreground and can result from poor scanning or the poor original quality of the data.

5. De-skewing of image

De-skewing can be referred to as rotation. This means de-skewing the image to bring it in the right format and right shape. The text should appear horizontal and not tilted in any angle. If the image is skewed to any side, de-skew it by rotating it clockwise or anti clockwise direction.

6. Set Tesseract engine to read particular characters only

Java

1

it.setTessVariable("tessedit_char_whitelist","0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz/' '");

From above code ,it reads only alphabets i.e a to z , A to Z, Numbers, ‘/’, and ‘ ‘(white space character).

How to increase accuracy of Tesseract (2024)
Top Articles
Latest Posts
Article information

Author: Velia Krajcik

Last Updated:

Views: 5978

Rating: 4.3 / 5 (54 voted)

Reviews: 85% of readers found this page helpful

Author information

Name: Velia Krajcik

Birthday: 1996-07-27

Address: 520 Balistreri Mount, South Armand, OR 60528

Phone: +466880739437

Job: Future Retail Associate

Hobby: Polo, Scouting, Worldbuilding, Cosplaying, Photography, Rowing, Nordic skating

Introduction: My name is Velia Krajcik, I am a handsome, clean, lucky, gleaming, magnificent, proud, glorious person who loves writing and wants to share my knowledge and understanding with you.