Optical Character Recognition
KeyView performs Optical Character Recognition (OCR) on raster image files, to attempt to filter text that might be visible in the image. OCR is available only on 64-bit Windows
You can enable or disable OCR by calling the setOcr
method of the Filter
class.
Optimize OCR Performance
The default settings for OCR attempt to detect as much text as possible. For example, KeyView attempts to detect text in multiple languages and alphabets, and rotated text in increments of 90 degrees from upright. This increases the amount of text that can be detected, prioritizing recall over processing time.
If you know what you will be processing in advance, you can specify OCR options to improve performance.
To configure OCR through the Java API, call the method filter.setOcr
.
For example, if the input is scanned pages that contain only English or only Japanese text, the following configuration could result in a performance improvement. However, it may fail to recognize text in some images such as landscape pages where the text is not upright.
filter.setOcr(new OCROptions("en ja", OCROptions.Orientation.UPRIGHT, OCROptions.DetectAlphabet.LISTED));
Languages
OCR supports many different languages. For a list of supported languages, see OCR Supported Languages. If you know that your files only contain text in a certain language or a small number of languages, you can improve both processing speed and accuracy by configuring OCR with this information.
Orientation
By default, OCR attempts to detect text that appears rotated, in 90-degree increments from upright. This means that KeyView can filter text from an image, even if it has been rotated or was scanned upside-down. If you know that your images contain only upright text, you can improve processing speed by disabling this feature.
Alphabet Detection
Sometimes, if you do not know the language of the input text in advance of processing, you might specify multiple languages. OCR requires more processing time for each additional language, especially when the languages span multiple alphabets (Latin, Cyrillic, Chinese, Arabic, and so on).
You can configure OCR to detect the alphabet for each image, before attempting to recognize characters. You can choose one of the following options.
Off
. By default, OCR does not detect the alphabet. Use this option when you have specified a single language or multiple languages that use the same alphabet. Micro Focus also recommends this option when you expect an image to use multiple alphabets (for example, when there is English and Arabic text on the same page).Listed
. OCR detects the alphabet, but only considers alphabets that are represented in your chosen list of languages. This option can reduce the time required to recognize characters, because languages that do not match the detected alphabet are ignored. For example, if you setlanguages="en ja ko"
(English, Japanese, and Korean) and OCR detects the Latin alphabet, OCR ignores the Japanese and Korean languages. Micro Focus recommends using this option when each source image uses a single alphabet, and the list of possible languages is known but spans multiple alphabets.Any
. OCR detects the alphabet that is used, and considers all alphabets. This option can reduce the time required to recognize characters, because languages that do not match the detected alphabet are ignored. If none of your chosen languages match the detected alphabet, OCR does not recognize characters and there is no output. Micro Focus recommends using this option instead ofListed
when you want to reject images that do not match any of the specified languages.
If your input contains Chinese, Japanese, or Korean text with some ASCII characters, you can safely set this parameter to any of the available options, because OCR includes ASCII characters for those languages.