Comparing Tesseract OCR with Google Vision OCR for Text Recognition in Invoices (2024)

Ellen Schellekens

Apr 21, 2022

At IxorThink, we develop document processing algorithms specialised in invoices and receipts. We automatically extract key information from the documents, such as the supplier and the total cost. The first step of this process is to recognise and extract the text from the documents using Optical Character Recognition (OCR). Improving the quality of the extracted text is important to improve the whole document processing pipeline. With this use case in mind, we compare two OCR engines in this blog: Tesseract and Google Vision API.

Comparing Tesseract OCR with Google Vision OCR for Text Recognition in Invoices (3)

Tesseract is the most prominent opensource OCR engine. Originally developed by Hewlett-Packard, it is now sponsored by Google. The active community continually works to improve it and there are regular new updates. Since it is open source and thus runs locally, the only cost of using Tesseract are the resources the machine uses, and there is no need to communicate the document and the results over the internet. The LSTM used in the model can also be trained on additional languages or fonts.

Tesseract can be executed from command-line interface, and also has a C++ api. There are numerous wrappers that allow the api to be used in other programming languages too, for example in Python. In this comparison, we will use the command-line interface.

The Google Vision API is part of the Google Cloud and includes among many interesting services also the option for text detection. In contrast to Tesseract, there is a service cost of $1.50 per 1,000 units for the first 5 million documents per month. For documents on top of those 5 million per month, the price is reduced to $0.60 per 1,000 units. It is possible to use the service as an API, providing the image or pdf encoded as base64 string, or to store the file on Google Cloud Storage and provide the bucket information.

There are two annotation features: TEXT_DETECTION is specialised to extract text from any image, while DOCUMENT_TEXT_DETECTION is optimised for dense text and documents. We found that while DOCUMENT_TEXT_DETECTION is a little better at capturing clean text lines and columns in the document, it also splits words based on punctuation marks such as . , - and : . In our case, there is structured information such as invoice numbers and VAT numbers that often contain punctuation marks, but that we do prefer to detect as one word. This is why we use the TEXT_DETECTION feature, since it does not have the excessive splitting issue.

Methodology
To compare the two OCR engines, we first have to create a testset of invoices. Some hard cases that are added to the testset: skewed scanned documents, documents with very small text,… On each document, we first compare all found word-level bounding boxes to find matches. When there is at least a 50% overlap, we consider two bounding boxes to correspond to the same word. For the matched bounding boxes, we compute the Hamming distance between the recognised words. Next to the quality, we also evaluate the time needed for each engine.

Results
One clear result is that Tesseract is faster than Google Vision by 0.689 seconds on average. However, do the recognised words differ a lot? The figure below shows a histogram of the Hamming distances between the matched words, where the ‘-1’ and ‘-2’ categories are unmatched words found by respectively Tesseract and Google Vision. We see that for around 60% of words, the words are recognised exactly the same. For another 11%, there are small differences of one or two characters in the word. However, as much as 18% of the recognised words are only found by Google and are not matched to words found by Tesseract.

Comparing Tesseract OCR with Google Vision OCR for Text Recognition in Invoices (4)

When we look closer at this category, we find that the problem lies with the small text that often resides at the bottom of invoices. The figure below shows an example of this. Google Vision is better capable of recognising these small letters, even when additional preprocessing is added to Tesseract to improve the contrast of the letters. Since this small text sometimes contains useful information, for example about the supplier of the invoice, this is a dealbreaker to use Tesseract in this application.

Comparing Tesseract OCR with Google Vision OCR for Text Recognition in Invoices (5)

Tesseract OCR has many strengths, such as the low cost and high speed. Being in full control of the model and having the ability to further train or finetune are additional advantages. However, the quality of the Google Vision OCR is still better, especially on difficult cases such as very small text. Since the quality is most important to us, the Google Vision OCR wins the comparison in our use case.

At IxorThink, the machine learning practice of Ixor, we are constantly trying to improve our methods to create state-of-the-art solutions. As a software company, we can provide stable products from proof-of-concept to deployment. Feel free to contact us for more information.