Published in · 2 min read · Aug 11, 2022
--
Tesserocr is a Python wrapper around the Tesseract C++ API. Whereas Pytesseract is a wrapper for the tesseract-ocr CLI.
Therefore with Tesserocr you can load the model at the beginning or your program, and run the model separately (for example in loops to process videos). With pytesseract, each time you call image_to_string
function, it loads the model and processes the image, therefore being slower for video processing.
For my Universal Commercial Code web scraping jobs, I have to convert over 200,000 tif files to text. I had used Pytesseract for years but needed something faster. I really had not needed this until my most recent customer. After some research, I stumbled upon Tesserocr.
However, I ran into a problem, I have image files containing multiple pages. I was able to handle this problem with Pytesseract but the same solution failed with Tesserocr. If someone has a better solution than ImageMagick. I used ImageMagick to create separate png files for each page in the tiff file. I then order the png files, and finally process each page and produce one text file for each tiff image. Don’t forget to use Image.MAX_IMAGE_PIXELS = None to handle very large image files — otherwise, you receive a warning. That is pretty much it.
import tesserocr
import os
import pandas as pd
from datetime import datetime
from PIL import Image
from glob import glob
Image.MAX_IMAGE_PIXELS = None
api = tesserocr.PyTessBaseAPI()files = glob('*.tif')
filesProcessed = []
for f, file in enumerate(files):
if f >= 0:
try:
start = datetime.now()
text = ''
os.system(f'magick convert -verbose {file[:-4]}.tif scratch/{file[:-4]}.png')
pngs = glob('scratch/*.png')
pngs = sorted(pngs)
print(len(pngs))
for png in pngs:
pil_image = Image.open(png)
api.SetImage(pil_image)
text = text + api.GetUTF8Text()
filename = file[:-4] + '.txt'
with open(filename, 'w') as n:
n.write(text)
for png in pngs:
os.remove(png)
end = datetime.now()
filesProcessed.append([file, len(pngs), start, end, end - start])
except:
print(f'{file} is a corrupt file')
pass
df =…