Export has special configuration options that allow greater control over the conversion of PDF files. These options can improve the fidelity and accuracy of the HTML output.
Two graphic-based PDF readers are available. The readers display PDFs by converting each page of the PDF to an image. If you do not want to redistribute the Acrobat Reader with your application, you can use a graphic-based reader instead.
The two readers support different features. Choose the appropriate reader depending on your requirements:
The kppdfrdr reader supports highlighting, annotation, and several other features but also has several graphical limitations.
The kppdf2rdr reader produces high-fidelity raster images but is a viewer only and does not support highlighting or other features.
The kppdfrdr graphic-based reader has the following features:
supports vector images
supports rotation and scaling
supports multibyte and bidirectional text
allows you to search text in the output
The kppdfrdr reader has the following limitations:
Embedded fonts in a PDF file are not translated correctly. They are usually displayed using the question mark (?) replacement character.
If an unsupported font is encountered during conversion, the default font, Times New Roman, is substituted.
Supports 180 degree rotation only for raster images.
Supports the following color spaces: DeviceRGB, DeviceGray, DeviceCMYK, CalGray, and CalRGB color spaces. Indexed color spaces are supported as long as they are used with a supported basic color space.
Does not support hyperlinks.
The kppdf2rdr graphic-based reader produces high-fidelity raster images. However, it has the following limitations:
Does not support anything beyond viewing, such as text searching.
Does not support PDFs that contain XFA forms content.
By default, the Acrobat control is used to convert PDF documents. Use the following procedure to specify that one of the graphic-based readers be used to convert PDF documents.
To specify the graphic-based reader
Open the formats_e.ini
file with a text editor. The file is installed in the root of the Windows directory.
In the [HiFi]
section, set the following parameter to the graphic-based reader you want to use. Set one of the following values:
For the kppdfrdr
reader:
200=kppdfrdr
This is the default setting.
For the kppdf2rdr
reader:
200=kppdf2rdr
Set CFG_SETHIFIPDF
field in the HtmlExport
class.
Export allows you to convert each page of a PDF document to a raster image, providing a high-fidelity conversion of the document.
The output format depends on the value of setOutputRasterGraphicType
in HtmlOptionInfo
.
On UNIX and Linux, the conversion of PDFs to JPEG uses the Java program kvraster.class
. This Java program requires some setup. See Display Vector Graphics on UNIX and Linux.
To specify the graphic-based reader for converting PDF documents
Specify the graphic-based reader you want to use.
Create an instance of the ConfigOption
class. Set the OptionType
argument to CFG_SETHIFIPDF
, and the OptionValue
argument to 1
.
Call the setConfigOption
method and pass the
ConfigOption
object.
Call a convert method. See the Javadoc in the directory install\javaapi\javadoc
, where install
is the path name of the Export installation directory.
The HtmlConvFileToFile
sample program demonstrates how to use the setConfigOption()
method. See HtmlConvFileToFile.
The PDF format is primarily designed for presentation and printing of brochures, magazines, forms, reports, and other materials with complex visual designs. Most PDF files do not contain the logical structure of the original document—the correct reading order, for example, and the presence and meaning of significant elements such as headers, footers, columns, tables, and so on.
KeyView can convert a PDF file either by using the file's internal unstructured paragraph flow, or by applying a structure to the paragraphs to reproduce the logical reading order of the visual page. Logical reading order enables KeyView to produce PDF files that contain languages that read from right-to-left (such as Hebrew and Arabic) in the correct reading direction.
NOTE: The algorithm used to reproduce the reading order of a PDF page is based on common page layouts. The paragraph flow generated for PDFs with unique or complex page designs might not emulate the original reading order exactly.
For example, page design elements such as drop caps, callouts that cross column boundaries, and significant changes in font size might disrupt the logical flow of the output text.
By default, KeyView produces an unstructured text stream for PDF files. This means that PDF paragraphs are extracted in the order in which they are stored in the file, not the order in which they appear on the visual page. For example, a three-column article could be output with the headers and the title at the end of the output file, and the second column extracted before the first column. Although this output does not represent a logical reading order, it accurately reflects the internal structure of the PDF.
You can configure KeyView to produce a structured text stream that flows in a specified direction. This means that PDF paragraphs are extracted in the order (logical reading order) and direction (left-to-right or right-to-left) in which they appear on the page.
The following paragraph direction options are available.
The paragraph direction options control the direction of paragraphs on a page; they do not control the text direction in a paragraph. For example, let us say that a PDF file contains English paragraphs in three columns that read from left to right, but 80% of the second paragraph contains Hebrew characters. If the left-to-right logical reading order is enabled, the paragraphs are ordered logically in the output—title paragraph, then paragraph 1, 2, 3, and so on—and flow from the top left of the first column to the bottom right of the third column. However, the text direction of the second paragraph is determined independently of the page by the PDF reader, and is output from right to left.
NOTE: Extraction of metadata is not affected by the paragraph direction setting. The characters and words in metadata fields are extracted in the correct reading direction regardless of whether logical reading order is enabled.
You can enable logical reading order by using either the API or the formats_e.ini
file. Setting the direction in the API overrides the setting in the formats_e.ini
file.
To enable PDF logical reading order in the Java API
Use the setPDFLogicalOrder(int orderFlag)
method of the HtmlExport
object, and set the orderFlag
argument to one of the following flags.
For example,
objHTMLExport.setPDFLogicalOrder(Export.PDF_LOGICAL_ORDER_RTL);
The formats_e.ini
file is in the directory install\OS\bin
, where install
is the path name of the Export installation directory and OS
is the name of the operating system.
To enable logical reading order by using the formats_e.ini file
Change the PDF reader entry in the [Formats]
section of the formats_e.ini
file as follows:
[Formats] 200=lpdf
Optionally, add the following section to the end of the formats_e.ini
file:
[pdf_flags]
pdf_direction=paragraph_direction
where paragraph_direction
is one of the following:
Flag |
Description |
Left-to-right paragraph direction |
|
Right-to-left paragraph direction |
|
The PDF reader determines the paragraph direction for each PDF page, and then sets the direction accordingly. When a paragraph direction is not specified, this option is used. |
|
Unstructured paragraph flow. This is the default behavior. If logical reading order is enabled, and you want to return to an unstructured paragraph flow, set this flag. |
When you convert PDF files to HTML by using the basic reader (pdfsr
), the table of contents is generated from "bookmarks" within the PDF file. The hyperlinked table of contents can appear either at the beginning of the HTML file or in a separate frame.
Micro Focus recommends that you configure the conversion so that the table of contents appears in a separate frame (the template pdfframe.ini
demonstrates how to do this, see Set Conversion Options). Export uses absolute positioning when converting a PDF file, that is, the text appears in the exact position as in the original document. Table of contents entries do not contain absolute positioning information. Therefore, if the main document and the table of contents are generated in the same output file, the table of contents entries might overlap the body text in the document.
NOTE: When PDF bookmarks are converted to a table of contents in HTML, the generated links do not lead to the exact location of the destination marker, but jump to the page on which the destination marker exists. This is similar to the behavior of the Adobe Acrobat Reader.
By default, Export converts PDF bookmarks to a table of contents in the HTML output. However, you can configure Export not to generate a table of contents based on the PDF bookmarks.
To prevent conversion of PDF bookmarks
Create an instance of the ConfigOption
class. Set the OptionType
argument to CFG_SUPPRESSTOCPRINTIMAGE
, and the OptionValue
argument to 1
.
Call the setConfigOption
method and pass the
ConfigOption
object.
Call a convert method. See the Javadoc in the directory install
\javaapi\javadoc
, where install
is the path name of the Export installation directory.
NOTE: A table of contents is not generated when a PDF file does not contain bookmarks, or when CFG_SUPPRESSTOCPRINTIMAGE
is set.
PDF documents sometimes contain invisible text. You can search this text in Adobe PDF Reader, but you cannot view it in a web browser.
You can add a JavaScript button to the upper right corner of the exported page, which you can click to toggle between invisible and regular text. When you turn on invisible text, the invisible text is displayed and the regular content is hidden; when you turn off invisible text, the invisible text is hidden.
Invisible text is hidden by default. The toggle button only appears if invisible text is detected in the PDF document.
To add an invisible text toggle button
Set the CFG_PDFINVISTEXTTOGGLE
field of the HtmlExport
object. The parameter passed in is the label name for the toggle button.
Invisible text often occurs in PDF documents when the PDF software processes rasterized images through optical character recognition and then inserts the text in the PDF. You might want to display both the invisible text as well as the rasterized image. To do so, you can set the invisible text opacity as determined by an integer from 0 to 100, where 0 hides the invisible text and 100 displays it fully.
Invisible text opacity is set to 0 by default.
To set invisible text opacity
By default, rotated text is displayed in its original position, at the original font size, and at 0 degrees rotation in the HTML output. The text is not rotated in the HTML output because text rotation is not supported by HTML.
Because the text is the original size, but might be displayed in a smaller space (at 0 degrees), the text might overlap adjacent text in the HTML output. To avoid this problem, you can specify that the rotated text be removed from its original position and displayed at the bottom of the HTML page on which it appears.
To specify that rotated text be displayed at the bottom of the HTML page
ConfigOption
class. Set the OptionType
argument to CFG_SETTEXTROTATE
, and the OptionValue
argument to 1
.setConfigOption
method
and pass the ConfigOption
object.Call a convert method. See the Javadoc in the directory install
\javaapi\javadoc
, where install
is the path name of the Export installation directory.
NOTE: When this feature is enabled, white space is added to the bottom of every HTML page to accommodate any rotated text.
There are two types of hyphens in a PDF document:
A soft hyphen is added to a word by a word processor to divide the word across two lines. This is a discretionary hyphen and is used to ensure proper text flow in justified text.
A hard hyphen is intentionally added to a word regardless of the word's position in the text flow. It is required by the rules of grammar or word usage. For example, compound words, such as "three-week vacation" and "self-confident" contain hard hyphens.
By default, KeyView maintains the source document's soft hyphens in the output HTML to more accurately represent the source document's layout. However, if you are using Export to generate text output for an indexing engine or are not concerned with maintaining the document's layout, Micro Focus recommends that you remove soft hyphens from the HTML output. To remove soft hyphens, you must enable the soft hyphen flag.
NOTE: If the soft hyphen flag is enabled, every hyphen at the end of a line is considered a soft hyphen and removed from the HTML output. If a hard hyphen appears at the end of a line, it is also removed. This might result in an intentionally hyphenated word being extracted without a hyphen.
To remove soft hyphens from the HTML output
ConfigOption
class. Set the OptionType
argument to CFG_DELSOFTHYPHEN
and the OptionValue
argument to 1
.setConfigOption
method
and pass the ConfigOption
object.install\javaapi\javadoc
, where install
is the path name of the Export installation directory.To extract custom metadata from your PDF files, add the custom metadata names to the pdfsr.ini
file provided, and copy the modified file to the \bin
directory. You can then extract metadata as you normally would.
The pdfsr.ini
is in the directory samples\pdfini
, and has the following structure:
<META> <TOTAL>total_item_number</TOTAL>, /metadata_tag_name datatype, </META>
Parameter |
Description |
|
The total number of metadata tags that are listed. |
|
The metadata tag name used in the PDF files. |
|
The data type of the metadata field.
|
For example:
<META> <TOTAL> 4 </TOTAL> /part_number INT4 /volume INT4 /purchase_date DATETIME /customer STRING </META>
NOTE: Metadata cannot be extracted from PDFs when the PDF is converted to JPEG. See Convert PDF Files to Raster Images.
|