This lesson explains how you can use Optical Character Recognition (OCR), to turn scanned (bitmapped) documents into documents whose text can be searched,selected and copied. OCR can also be used to restore text on otherwise normal (vector) PDF documents, whose textual contents, although clrearly visible on the page, have been lost or garbled. In fact, PDF Nomad will OCR any PDFdocument (for which you have printing and copying priviliges), although this feature may be of limited use in the case of normal PDF documents whose textual contents are still intact.
The OCR option is available under the Tools menu. Wen you select it, a setup dialog appears, where you can setup the OCR operation and decide how the results are to be used.
At the top left is the thumbnail pane. Here you can browse the document's pages. Click a thumbnail to select the page to be shown in the preview canvas. The preview canvas shows pages as they will be presented to the OCR engine. Depending on the quality and characteristics of the original page, you may need to adjust some settings to achieve optimal results.
Click the Preparation button to bring up a popover that allows changes to the brightness, contrast and background level. Brightness is useful, for instance, when the text on the page is relatively light. In that case it may be best to lower the brightness, so that the text becomes darker and clearer. If a page has a lightish, but uneven background, increasing the contrast often clears the evenness of the background, yielding better results during the scanning process. The background level is a value between 1 and 255, that indicates up to what level of lightness pixels are considered background. Lowering the background level often has the effect of making the text heavier and fuller, while increasing the background level tends to make text lighter/thinner. Depending on the original, adjusting this setting may yield better results.
For some documents you may need to use and interplay between the three settings to achieve optial results. Generally speaking, if the text in the preview canvas looks clear and crisp, the results tend to be good. Of the original is of poor quality, you may want to perform a test run on a single pages, and redo the OCR with adjusted settings until you achieve satisfactory results.
PDF Nomad currently offers more than 60 languages with its OCR engine (plus fraktur variants for Danish and German). You can select one or more languages to be used during the OCR step. Selecting multiple languages will slow down the OCR engine, so make sure to only select the languages that are present in the document.
The OCR Options popover is where you decide what parts of the document's pages to recognize, and what to do with the results of the OCR.
You can:
• Make pages searchable: the recognized text will be added in a layer behind the original pages, so that the appearnace of the page remains intact, but text can be searched, selected and copied. You can choose one of three settings for the resolution of the final page. The low resolution option creates the searchable pages with nominal screen resolution (72 dpi). The medium resolution option results in pages with double the nominal screen resolution (144dpi), and the high resolution option creates pages with triple nominal screen resolution (288dpi). The higher the resolution, the larger the final file size.
• Replace original with recognized text: The original pages are replaced with pages containing the freshly drawn recognized text. This results in a significantly decreased fiel size, but the appearance of the text will be altered.
• Export recognized text to RTF document: the pages of the document will not be touched. Instead, at the end of the OCR process, you can save the recognized text to an RTF document, that you can then import into a text editor or word processor.
You can ask PDF Nomad to OCR all pages in the document; only the pages selected in the page list, or only the current page in the main document's preview canvas.
Areas to recognize: here you have the option to apply the OCR only to specific regions of the pages. By default, the whole page (Media box) is OCRd. You can select one or more of the secondary display boxes to limit the OCR to the areas they encompass. If you set the boxes before OCRing, you can limit the OCR to very specific areas of the page.
Allow corrections before finalizing: If this option is not checked, the OCR enginge will go straight from the recognition phase to finalizing the pages. This may be useful if the orignal is of high quality and you are confident the results will be satisfactory. With this option unchecked, you will be offered a preview of the OCR results, with the ability of changing mistakes, before finalizing the pages.
Click the OK button to start the OCR process.
If you allow corrections before fnializing, PDF Nomad shows the recognized text overlaid over the original, with a pane next to it that shows the text as recognized. You can use the pane on the right to quickly find mistakes, which can then be corrected in the main preview canvas.
The Scope popup allow corrections to either whole lines, individual words, or single characters (symbols). You can drag items around to reposition them (or nudge selected items with the arrow keys), and you can double click words or symbols to edit them. Press the backspace key to delete selected items. Control click (or right-click) on the page to see a popup menu with options that can be applied to the selected items.
Once you have finished checking the document for mistakes, and made any necessary corrections, click the Finalize button to finalize the OCR process.