Using OCR: How Accurate is Your Data? | TDWI (2024)

Using OCR: How Accurate is Your Data? | TDWI (1)

Using OCR: How Accurate is Your Data?

With effective use of confidence scoring, you can use OCR to automate some types of data ingestion.

  • By Greg Council
  • March 5, 2018

All organizations want accurate data to run their operations. Getting data into a useful format is the focus of significant industry attention, whether that data comes from social media, structured databases, or unstructured documents.

Leveraging Your Document Data

One popular technology used to process documents (the scanned variety) is optical character recognition (OCR). It has been around for decades, and its most common use is to convert an image into searchable text. Obviously, the accuracy of the conversion is important, and most OCR software provides 98 to 99 percent accuracy, measured at the page level. This means that in a page of 1,000 characters, 980 to 990 characters will be accurate. In most cases, this level of accuracy is acceptable.

What about putting data from documents to good use by extracting specific data and tagging it so it can be added to a database or be used as metadata describing a specific document? Operations such as accounting rely upon accurate data from invoices (such as the invoice number, date, quantities of items purchased, and taxes).

Does the 98 to 99 percent accuracy of full-page OCR translate to an adequate level of accuracy on data extraction from these documents? Absolutely not.

Accuracy Guarantees: What It Means

If you need to obtain 99 percent accuracy at a data field level, then relying on 99 percent page-level accuracy could lead to disaster. For instance, in the case of our 1,000-character page, although an OCR engine might have 99 percent accuracy at the page level, what if those 10 erroneous characters are within 10 of the 20 data fields required by the business?

Suddenly, this 99 percent accuracy drops to 50 percent accuracy. This is where field-level accuracy comes into play, using what's known as the field-level confidence score.

Also keep in mind that page-level accuracy rates are often based upon good-quality scans. If your organization has to deal with faxed documents or documents that have hard-to-read fonts such as from a dot-matrix printer, page-level accuracy is much lower.

Why Confidence Scores Matter

In using a field-level confidence score, the main objective is to identify a "threshold" that separates good data from bad data. Good data is a "correct answer," meaning an accurate, literal transcription of the field as represented on the page. If the input document has a date of birth as 1/1/1970, the field into which the data is transcribed should contain 1/1/1970 as well.

A confidence score is assigned and output by the OCR engine for each field answer. The field-level confidence score uses the "raw" OCR character- and word-level scores and synthesizes them with other available information to arrive at a final score. This other information can be, for example, the expected data type (numerals, letters) and format (phone number versus credit card number).

For instance, the OCR engine might output the "date of birth" value as 12/5/2008 along with a confidence score of 60. The field confidence scoring for each data element should output a consistent range of scores for correct answers; these scores should be higher than the scores for incorrect answers. Although confidence scores are used to distinguish likely correct answers from likely incorrect answers, confidence scores are not probabilistic -- a score of 60 does not mean that there is a 60 percent likelihood that the answer is correct.

Using OCR: How Accurate is Your Data? | TDWI (2024)
Top Articles
Latest Posts
Article information

Author: Rob Wisoky

Last Updated:

Views: 5966

Rating: 4.8 / 5 (48 voted)

Reviews: 95% of readers found this page helpful

Author information

Name: Rob Wisoky

Birthday: 1994-09-30

Address: 5789 Michel Vista, West Domenic, OR 80464-9452

Phone: +97313824072371

Job: Education Orchestrator

Hobby: Lockpicking, Crocheting, Baton twirling, Video gaming, Jogging, Whittling, Model building

Introduction: My name is Rob Wisoky, I am a smiling, helpful, encouraging, zealous, energetic, faithful, fantastic person who loves writing and wants to share my knowledge and understanding with you.