OCR to help generate search text in scan document

April 19, 2012

OCR to help generate search text in scan document

Digitizing a magazine article or a printed contract is often a common needs. We could either spend hours retyping and then correcting misprints or we could convert all the required materials into digital format in several minutes using a scanner and Optical Character Recognition (OCR) software. Although scanning pages would be an expensive and time-consuming undertaking, the benefits are huge.

OCR is a process of converting different types of documents, such as scanned paper documents, PDF files or images captured by a digital camera into text or word processing files that can be easily edited and stored.

OCR is a field of research in pattern recognition, artificial intelligence and machine vision. It has been used to enter data automatically into a computer for dissemination and processing. This technology has enabled such materials to be stored using much less storage space than the hard copy materials. OCR technology has made a huge impact on the way information is stored, shared and edited. Prior to Optical Character Recognition, if someone wanted to turn a book into a word processing file, each page would have to be typed word for word.

OCR technology requires both hardware and software. A scanner creates an image or a snapshot of the document that is nothing more than a collection of black and white or color dots, known as a raster image. This image, however, cannot be edited. In order to extract and re-purpose data from scanned documents, camera images or image-only PDFs, you need an OCR software that would single out letters on the image, put them into words and then - words into sentences, thus enabling you to access and edit the content of the original document. This bitmap is then translated into computer text.

OCR has never achieved a read rate that is 100% perfect. While it has made huge advances in recent years, it still does not perform well in recognizing handwriting or fonts that look similar to handwriting. Of even greater concern is the problem of misreading a character. There will continue to
be a need for improvement to increase the accuracy of reading thorough OCR.

Search This Blog

iPCNetworking

OCR to help generate search text in scan document

Comments

Post a Comment

Popular Posts

Common Peripheral Controlling Architecture (CPCA)

IPP - Print via HTTP or HTTPS