Figure 1. The Production Capture Process
A production automated data entry application is composed of several separate modules. These modules can be run on a single workstation or, in high-volume environments, can be run on multiple stations on a LAN. The following diagram shows a typical batch processing system.
Phases of Data Capture
Operation |
Definition |
Benefits |
Document Preparation |
Sort documents, remove staples, prepare batches, etc. |
Protects scanning equipment.
Beginning of organizational process. |
Scanning |
Converts paper documents into electronic files, typically PDF or Group 4 TIFF images.
|
Improves efficiency. |
Recognition |
Automatically extracts data (from a form) or index information (from a document). |
Automated recognition can reduce or in some cases eliminate labor intensive manual keying costs By automatically extracting data from a form or document. The validation process is simplified and validation operators check the results of the recognition process as opposed to keying information from scratch.
|
Verification |
Data Capture: Validates the results of automated recognition performed on a form. |
Cost saving methods: Use bar codes, optical character recognition, intelligent character recognition, optical mark recognition and advanced scripting techniques to automate the extraction of form data. For manual keying or data validation, make sure input screens are designed for efficient "heads up" operation and can keep up with professional keyboard operators.
|
Release |
Export images to long-term storage and data to a database or back end workflow or document management system.
|
Cost saving methods: Capture software should support release of documents to standard optical systems and common SQL databases. Integration with popular workflow and document management applications should be quick and easy. |
Scanning and Importing
The Scan module is used to create batches, scan and import documents, process bar codes and patch codes, and perform page based image cleanup and image enhancement. Users can also edit the contents of batches before releasing it to the next process.
Recognition
If a document or form has well defined fields, it is possible to reduce manual keying by using automated recognition techniques such as OCR, ICR, OMR or bar codes to read zones on the document and automatically convert them into data.
Verification
Data is then verified for accuracy and for unreadable characters. Data validation is the most critical step in the data capture process. Several methods are used to reduce operator errors and speed the validation process:
• Custom verification scripts can be configured to fill fields on the data entry form with default values.
• Validation scripts can also be used to detect both manual and OCR, ICR, OMR or barcode data errors. For example, if a data field is a telephone number, a validation script can require that all entries must be numbers, which prevents OCR or ICR from mistaking the number 1 with the letter l. Validation scripts can also be used to verify the value of a data field against an external database.
• For data fields in which 100% accuracy is essential, secondary verification can be specified. After a batch of documents has been validated, the batch is then routed to a second operator, who reenters the specified data fields a second time. Any data fields that do not match are flagged as errors and must be re-keyed. This method of double key entry is the most reliable way to ensure the accuracy of document and form data.
Types of Data Extraction
• Form ID is used to identify a particular form, resulting in specific fields being automatically recognized and specific image cleanup being applied. This allows the index/validation operator to simply check the accuracy of the automated recognition results rather than manually typing the required data on the data entry form.
• OCR is used to automatically fill data fields, thus allowing index operators to simply check the accuracy of OCR fields rather than manually typing the required data on the data entry form.
• ICR is similar to OCR but recognizes hand printed characters. It is generally less accurate than OCR but can produce good results if the characters are constrained within boxes or if they are limited to numeric characters.
• OMR (Optical Mark Recognition) is used to automatically recognize checkboxes, bubbles, and other filled in marks on a form.
• Bar code recognition is a highly reliable method for extracting data from documents. Bar codes can be recognized either in a predefined zone or on the entire page. If page level bar code recognition is used, bar codes do not have to be present at specific places. Instead, every bar code on the page is recognized and then associated with data fields in the order in which they are read.
Image Cleanup
• There are several techniques that can make images more readable and increase OCR accuracy. The most effective ones include:
• Deskew: This technique straightens pages that have been scanned slightly crooked due to mechanical tolerances in the scanner's document feeder. Deskewing can increase recognition accuracy by 15-20% or more, which can make the difference between using expensive manual keying and automated recognition technology.
• Deshade: Recognition engines are unable to process words against the gray shaded backgrounds that are common on forms. Removing shading allows you to recognize zones that are otherwise unreadable.
• Despeckle and streak removal: These techniques remove small speckles and streaks caused by dirt in the scanner feeder or noise in the scanner.
• Line removal: On typewritten forms, words are frequently typed so that they cross over the lines on the form, which makes them unreadable to automated recognition processes. Line removal erases the lines on the image and then reconstructs the characters so they can be recognized.
• Edge enhancement: This is a multiple set of filters that sharpens the edges of characters. The results are usually invisible to the eye, but they can increase recognition accuracy by as much as 5-10%.