Tesserocr vs. Pytesseract (2024)

Tesserocr is a Python wrapper around the Tesseract C++ API. Whereas Pytesseract is a wrapper for the tesseract-ocr CLI.

Therefore with Tesserocr you can load the model at the beginning or your program, and run the model separately (for example in loops to process videos). With pytesseract, each time you call image_to_string function, it loads the model and processes the image, therefore being slower for video processing.

Tesserocr vs. Pytesseract (3)

For my Universal Commercial Code web scraping jobs, I have to convert over 200,000 tif files to text. I had used Pytesseract for years but needed something faster. I really had not needed this until my most recent customer. After some research, I stumbled upon Tesserocr.

However, I ran into a problem, I have image files containing multiple pages. I was able to handle this problem with Pytesseract but the same solution failed with Tesserocr. If someone has a better solution than ImageMagick. I used ImageMagick to create separate png files for each page in the tiff file. I then order the png files, and finally process each page and produce one text file for each tiff image. Don’t forget to use Image.MAX_IMAGE_PIXELS = None to handle very large image files — otherwise, you receive a warning. That is pretty much it.

import tesserocr
import os
import pandas as pd
from datetime import datetime
from PIL import Image
from glob import glob
Image.MAX_IMAGE_PIXELS = None
api = tesserocr.PyTessBaseAPI()

files = glob('*.tif')
filesProcessed = []

for f, file in enumerate(files):
if f >= 0:
try:
start = datetime.now()
text = ''
os.system(f'magick convert -verbose {file[:-4]}.tif scratch/{file[:-4]}.png')
pngs = glob('scratch/*.png')
pngs = sorted(pngs)
print(len(pngs))
for png in pngs:
pil_image = Image.open(png)
api.SetImage(pil_image)
text = text + api.GetUTF8Text()
filename = file[:-4] + '.txt'
with open(filename, 'w') as n:
n.write(text)
for png in pngs:
os.remove(png)
end = datetime.now()
filesProcessed.append([file, len(pngs), start, end, end - start])
except:
print(f'{file} is a corrupt file')
pass

df =…

Tesserocr vs. Pytesseract (2024)
Top Articles
Latest Posts
Article information

Author: Roderick King

Last Updated:

Views: 5972

Rating: 4 / 5 (51 voted)

Reviews: 82% of readers found this page helpful

Author information

Name: Roderick King

Birthday: 1997-10-09

Address: 3782 Madge Knoll, East Dudley, MA 63913

Phone: +2521695290067

Job: Customer Sales Coordinator

Hobby: Gunsmithing, Embroidery, Parkour, Kitesurfing, Rock climbing, Sand art, Beekeeping

Introduction: My name is Roderick King, I am a cute, splendid, excited, perfect, gentle, funny, vivacious person who loves writing and wants to share my knowledge and understanding with you.