Export has special configuration options that allow greater control over the conversion of PDF files. These options can improve the fidelity and accuracy of the XML output.
The PDF format is primarily designed for presentation and printing of brochures, magazines, forms, reports, and other materials with complex visual designs. Most PDF files do not contain the logical structure of the original document—the correct reading order, for example, and the presence and meaning of significant elements such as headers, footers, columns, tables, and so on.
KeyView can convert a PDF file either by using the file's internal unstructured paragraph flow, or by applying a structure to the paragraphs to reproduce the logical reading order of the visual page. Logical reading order enables KeyView to produce PDF files that contain languages that read from right-to-left (such as Hebrew and Arabic) in the correct reading direction.
By default, KeyView produces an unstructured text stream for PDF files. This means that PDF paragraphs are extracted in the order in which they are stored in the file, not the order in which they appear on the visual page. For example, a three-column article could be output with the headers and the title at the end of the output file, and the second column extracted before the first column. Although this output does not represent a logical reading order, it accurately reflects the internal structure of the PDF.
You can configure KeyView to produce a structured text stream that flows in a specified direction. This means that PDF paragraphs are extracted in the order (logical reading order) and direction (left-to-right or right-to-left) in which they appear on the page.
The following paragraph direction options are available.
The paragraph direction options control the direction of paragraphs on a page; they do not control the text direction in a paragraph. For example, let us say that a PDF file contains English paragraphs in three columns that read from left to right, but 80% of the second paragraph contains Hebrew characters. If the left-to-right logical reading order is enabled, the paragraphs are ordered logically in the output—title paragraph, then paragraph 1, 2, 3, and so on—and flow from the top left of the first column to the bottom right of the third column. However, the text direction of the second paragraph is determined independently of the page by the PDF reader, and is output from right to left.
You can enable logical reading order by using either the API or the formats_e.ini
file. Setting the direction in the API overrides the setting in the formats_e.ini
file.
To enable PDF logical reading order in the Java API
Use the setPDFLogicalOrder(int orderFlag)
method of the XmlExport
object, and set the orderFlag
argument to one of the following flags.
For example,
objXMLExport.setPDFLogicalOrder(Export.PDF_LOGICAL_ORDER_RTL);
The formats_e.ini
file is in the directory install\OS\bin
, where install
is the path name of the Export installation directory and OS
is the name of the operating system.
To enable logical reading order by using the formats_e.ini file
Change the PDF reader entry in the [Formats]
section of the formats_e.ini
file as follows:
[Formats] 200=lpdf
Optionally, add the following section to the end of the formats_e.ini
file:
[pdf_flags]
pdf_direction=paragraph_direction
where paragraph_direction
is one of the following:
Flag |
Description |
Left-to-right paragraph direction |
|
Right-to-left paragraph direction |
|
The PDF reader determines the paragraph direction for each PDF page, and then sets the direction accordingly. When a paragraph direction is not specified, this option is used. |
|
Unstructured paragraph flow. This is the default behavior. If logical reading order is enabled, and you want to return to an unstructured paragraph flow, set this flag. |
There are two types of hyphens in a PDF document:
A soft hyphen is added to a word by a word processor to divide the word across two lines. This is a discretionary hyphen and is used to ensure proper text flow in justified text.
A hard hyphen is intentionally added to a word regardless of the word's position in the text flow. It is required by the rules of grammar or word usage. For example, compound words, such as "three-week vacation" and "self-confident" contain hard hyphens.
By default, KeyView maintains the source document's soft hyphens in the output XML to more accurately represent the source document's layout. However, if you are using Export to generate text output for an indexing engine or are not concerned with maintaining the document's layout, Micro Focus recommends that you remove soft hyphens from the XML output. To remove soft hyphens, you must enable the soft hyphen flag.
To remove soft hyphens from the XML output
ConfigOption
class. Set the OptionType
argument to CFG_DELSOFTHYPHEN
and the OptionValue
argument to 1
.setConfigOption
method
and pass the ConfigOption
object.install\javaapi\javadoc
, where install
is the path name of the Export installation directory.To extract custom metadata from your PDF files, add the custom metadata names to the pdfsr.ini
file provided, and copy the modified file to the \bin
directory. You can then extract metadata as you normally would.
The pdfsr.ini
is in the directory samples\pdfini
, and has the following structure:
<META> <TOTAL>total_item_number</TOTAL>, /metadata_tag_name datatype, </META>
Parameter |
Description |
|
The total number of metadata tags that are listed. |
|
The metadata tag name used in the PDF files. |
|
The data type of the metadata field.
|
For example:
<META> <TOTAL> 4 </TOTAL> /part_number INT4 /volume INT4 /purchase_date DATETIME /customer STRING </META>
|