Use Your Content > Improve > Eduction > Eduction Performance

Eduction Performance

Eduction is very fast at grammar compilation and entity extraction. However, some expressions in the grammar patterns can increase the extraction and compilation times significantly.

The grammar files that are included in the Eduction SDK package are designed to be as fast as possible. The following section describes some ways to ensure that user-created Eduction grammars also work quickly. As a general rule, the more concise a grammar is, the faster it returns matches.

Note: Some configuration settings affect extraction speed, for example MatchCase, MatchWholeWords, AllowOverlaps, and NonGreedyMatch.

Match Entities with Reference or Copy

When you create a custom grammar, you can match a previously defined entity and either:

For very simple grammar files, it is generally faster to use (?A:, because this method creates an .ECR with slightly more efficient instructions for extraction.

For complicated Eduction grammars, copying entities by using (?A: results in a very large grammar file, which can take an extremely long time to compile. In addition, the resulting file can take longer to load and scan than the equivalent file created by using (?A^.

HPE recommends that you use (?A^ in all cases, unless your grammar file is very simple, and extraction speed is critical.

Whitespace

You can define a space in grammars by using a space character ( ), or by using the \s syntax. The \s syntax matches all types of whitespace, such as spaces, tabs, and newlines. In most practical situations, the matches you want from the input text only include spaces. In these cases, it is slightly faster to use a space, rather than \s.

Entries and Patterns

You can extract regular expressions by using patterns (the <pattern> element), or by explicitly listing each possible match as an entry (the <entry> element). For example, the following alternatives are equivalent:

Example 1:

   <pattern>[Ee]xtract(ed|ing|s)?</pattern>

Example 2:

   <entry headword="Extract"/>
   <entry headword="Extracted"/>
   <entry headword="Extracting"/>
   <entry headword="Extracts"/>
   <entry headword="extract"/>
   <entry headword="extracted"/>
   <entry headword="extracting"/>
   <entry headword="extracts"/>

The first alternative is faster to code and maintain, and slightly faster for extraction. However, if there are fewer than about 50 entries represented by one pattern, the compilation time is faster for entries.

HPE recommends that you use patterns unless the compilation time becomes too slow, in which case you might consider replacing the simplest patterns with entries.

Quantifiers

The syntax expression{n,m} matches at least n, but at most m consecutive occurrences of the specified expression. When m is large, it can result in a large .ECR file and slow extraction.

In this situation, HPE recommends that you use {n,} unless the upper bound m is important.

Patterns that Start with Multiple Optional Phrases

The following type of pattern can be slow during extraction:

<pattern>(?A^entity_A)?(?A^entity_B)*(?A^entity_C)?(?A^entity_D)</pattern>

In this example, each time Eduction encounters a new word, it must check whether it matches entity_A, then check whether it matches entity_B, then entity_C, and then entity_D. If the word does match entity_A, Eduction must then check whether the following word is matched by entity_B, entity_C, or entity_D, and so on.

This process can be time-consuming, especially if each of the optional entities occurs regularly in the input text. When extraction speed is critical, HPE recommends that you remove any unnecessary optional entities at the beginning of a pattern.

Note: This issue does not occur if the optional phrases are not at the start of the pattern.

Components

In some cases, including components in an Eduction grammar XML file increases the time for extraction by up to 50%, even if the components are not enabled during the extraction. This occurs because the resulting .ECR is less compact than the equivalent file that does not describe components. Do not use components if you do not need them.

When you do use components, HPE recommends that you make the structure of the components as uniform as possible.

For example:

   <pattern>(?A=COMPONENT:(?A^entity_A) )(?A^entity_B) (?A^entity_C) (?A^entity_D)</pattern>
   <pattern>(?A=COMPONENT:(?A^entity_A)) (?A^entity_B)(?A^entity_D) (?A^entity_C)</pattern>
   <pattern>(?A=COMPONENT:(?A^entity_A) (?A^entity_B))(?A^entity_C) (?A^entity_E)</pattern>

might be slower than:

   <pattern>(?A=COMPONENT:(?A^entity_A) (?A^entity_B)) (?A^entity_C) (?A^entity_D)</pattern>
   <pattern>(?A=COMPONENT:(?A^entity_A) (?A^entity_B)) (?A^entity_D) (?A^entity_C)</pattern>
   <pattern>(?A=COMPONENT:(?A^entity_A) (?A^entity_B)) (?A^entity_C) (?A^entity_E)</pattern>

Score of Zero

You can set the score of an entry or pattern to zero to exclude that entry or pattern. However, this method can reduce performance during extraction.

Where possible, and when extraction speed is critical, HPE recommends that you consider ways to list matches to extract, rather than the matches to exclude.

For example, to extract mobile phone numbers that do not end in 25:

   <pattern>07[0-9]{9}</pattern>
   <pattern score="0">07[0-9]{7}25</pattern>

is slower than

   <pattern>07[0-9]{7}[013-9][0-9]</pattern>
   <pattern>07[0-9]{7}[0-9][0-46-9]</pattern>

Avoid Multiple Routes to Find the Same Match

The following set of entities run fast most of the time:

<entity name="entity1"/>
   <pattern>[0-9]{2,3}</pattern>
</entity>
<entity name="entity2"/>
   <pattern>[0-9]{3,4}</pattern>
</entity>
<entity name="entity1"/>
   <pattern>((?A^entity1)|(?A^entity2))+</pattern>
</entity>

However on some data, they might run extremely slowly - for example when the input text includes:

123 234 345 456 567 678 789 890 900 000

In this example, either entity can match every number, so there are 210 (1024) different ways that Eduction can match this phrase. It might try many methods while looking for a longer match.

In general, HPE recommends that you avoid having a large number of possible ways to match a given phrase.

Merge Referenced Entities

When patterns refer to entities by reference, Eduction checks for a match using each entity separately. While this is usually fast in practice, the following:

<pattern>(?A^entity_ABC)</pattern>

is likely to be faster than:

<pattern>((?A^entity_A)|(?A^entity_B)|(?A^entity_C))</pattern>

Where <entity_ABC> contains everything in entities A, B and C.

HPE recommends that you merge any entities that can be merged, unless merging them alters what the grammar can match.


_HP_HTML5_bannerTitle.htm