Eduction

23.3.1

Resolved Issues

  • In table mode, when processing tabular input where columns were separated with tabs or commas, Eduction sometimes encountered a memory error, if the input data had a table with a differing number of columns in each row (for example, the header row has fewer rows than a body row). This error did not occur when using KeyView to insert the delimiters.

23.3.0

New Features

  • When you compile and save a grammar file, EDK now removes any data associated with inaccessible entities before it writes the grammar to disk. Previously, data with the type private was included from ECR files that were included by the grammar definition, increasing the file size of the output grammar.

    You can now use edktool to create a reduced version of a grammar file, containing only selected entities. For example:

    edktool compile -l license.dat -i original.ecr -e entity1,entity2,... -o selected.ecr

    Previously, this command produced an ECR of the same size, where the unselected entities were made private. Now, edktool removes the definitions of unreferenced entities, resulting in a reduced-footprint ECR.

  • Improvements have been made to the Eduction Table mixed mode:

    • You can now use the addTableCell and endTableRow API calls in mixed mode in the same way as for normal table mode.

    • Setting the finalRow argument to true for the addTableCell and endTableRow API calls now sets Eduction up to process another table, rather than resetting the session. This change allows you to feed in multiple tables and get the matches afterward.  Eduction now also keeps track of the table numbering, where previously the session reset meant that it treated every table as the first table.

  • Speed improvements have been made within Eduction sessions, relating to match identification and Lua post-processing of identified matches.

  • Creating a session from an Eduction engine is now more efficient.

Enhancements to Eduction Grammars

  • In the PHI package, the profession agentboolean IDX has been replaced with the new profession.ecr grammar.

  • Scoring has been improved so that matches of certain patterns of Swedish no-context valid national IDs now have a score above the 0.4 threshold.

  • Contextual entities now allow an arbitrary number of spaces or tabs before a colon that separates the landmark from the nocontext entity. For example, the PII telephone context grammar (for GB) now matches the text Mobile :07928 875 419.

  • The PII medical terms grammar has new entities to match ICD10 medical condition codes and ICD10 procedure codes.

  • The PII medical terms grammar has been updated to add Turkish, Ukrainian (cyrillic) and Russian (cyrillic) languages.

  • A new PII internet grammar has been added, with entities to match email addresses for many countries.

  • The driving license grammar has new formats for Australian driving licenses. The grammar still supports older formats, with a 10% matching penalty, so that older formats without context now score at 0.36 (below the typical 0.4 minimum threshold).

  • Post-processing for names has been improved:

    • Eduction now rejects certain matches where a title such as Prince, King, or Queen occurs in a title_surname entity match, to reduce spurious matches of names that were partially stoplisted.

      For example, the name Prince Edward Island (a place in Canada) initially matches as Title+Forename+Surname. Island is removed because it is in the stoplist, and then Edward is promoted to surname, which produced a match of just EDWARD. Eduction now rejects this match.

    • Eduction now rejects matches that have at least one INITIAL component and do not have a multi character SURNAME component. This change prevents acronyms and abbreviations such as A.I matching (where I is a Korean surname).

    • Eduction now removes stoplisted components, such as when a stoplisted word is hyphenated with another word. For example, Saint is stoplisted, so Eduction now also removes Dié in Saint-Dié Christian Pierret, resulting in a match for Christian Pierret. This change includes cases where the stoplisted word appears after the hyphen, for example Dié-Saint.

    • Eduction no longer includes ampersands (&) as part of the original matched text. For example, Extraterrestrial Life, John Wiley & Sons previously produced the match John Wiley &. It now matches John Wiley.

    • Eduction now checks stop names again after removing any leading or trailing stopwords. For example, the text "Rio Grande River" no longer matches. "Rio Grande" is a stop name and "River" is in the stoplist, so previously Eduction dropped "River" and returned "Rio Grande" as a match. Now, after dropping "River", Eduction checks the remaining text, "Rio Grande", against the stopnames and drops it, so it no longer returns a match.

  • The National ID grammar for Belgium now accepts a dot before the final two digits, in addition to what was previously accepted. For example, 85.07.30-033.28 now matches.

  • The telephone grammar has been updated with new area codes for Canada (236, 365, 367, 368, 431, 437, 474, 548, 584, 639, 672, 782, 825, 873).

  • In the PII address grammar for Brazil, the neighborhood component has been added to full address matching as an optional component, which can come before the city.

  • Precision has been improved by penalizing stoplist exception components in some cases, depending on the number of such components.

  • The PII banking grammar now supports greater variations of number groupings and delimiters for US bank account numbers.

Resolved Issues

  • The function EdkSessionSetEntityMatchLimit(), in the C API, could incorrectly return an error when the Eduction engine had been configured to match entities by calling EdkLoadResourceFile() and EdkAddTargetEntity().
  • The method session.setEntityMatchLimit(), in the Java API, could incorrectly throw an exception when the Eduction engine had been configured to match entities by calling engine.loadResourceFile() and engine.addTargetEntity().
  • Telephone numbers in the PII and PHI entity packages sometimes failed to match examples that included extra spacing around the area code (for example, "(310)    840-7089"). This format can occur, for example, when the area code and local subscriber number are extracted from distinct fields in a PDF form.

  • Post-processing for PII, PHI, and PCI names entities could fail on certain names listed in reverse format (for example "Smith, John Washington (2023)").

  • An Eduction engine configured with both Entity and HeaderEntity/CellEntity (mixed mode) failed to return results for a structured table API call (for example EdkAddTableCell).

  • The PHI validation scripts Lua table had an incorrect key for DEA entities, which caused a Lua error at runtime.

  • The combined banking grammars were missing entities for non-country-specific landmarks (IBAN and SWIFT).

  • When a file had a byte order mark (BOM) at the start of the file, Eduction did not make any matches at the start of the file (first word). The BOM is now treated as punctuation to allow these matches.

  • When using edktool, if an incorrect path to a licensefile was provided in the -l flag, edktool would return a misleading Error: License key is not valid for Eduction error. It now returns the same Error: Open file error as for an incorrect path to the -i input.txt or -c config.cfg.

  • In the PII national ID grammars, the Lua checksum function for Saudi Arabian IDs could raise an error unexpectedly if the calculated value to perform check digit validation was not a two-digit number.

  • In mixed table mode, setting the final flag to true on addInputText after passing a whole table in did not always correctly reset the session. In this case, the table number was not reset properly.

23.2.0

New Features

  • In table mode, Eduction now provides a zero-indexed table number for matches, to avoid ambiguity when extracting entities from an input stream that contains multiple tables.

    In the Eduction SDK, the following methods and attributes are available for obtaining the table number (which is -1 if the match was not sourced from a table):

    • C API: EdkError EdkGetMatchTableNumber(EdkSessionHandle pSession, int * pnTableNumber)

    • .NET API: IExtractionMatchTablePosition.TableNumber

    • Java API: EdkMatch.getTableNumber()

    When Eduction is in table mode, edktool and Eduction Server now output the match table number and Eduction Server now outputs the row and column details of a match, as was already the case for edktool.

  • You can now configure both table and free text (non-table) entities at the same time. In this mixed mode, Eduction identifies tables and searches them for table entity matches, and it searches any blocks of free text for free text entity matches.

    In addition, when Eduction identifies a table but does not find a header match for a particular column, it searches the rows of that column for free text entity matches instead. In this way, Eduction can still search for entity matches even if it does not match the headers. Similarly, if you configure MaxSearchHeaderRow to search for tables beyond the first line of the input, Eduction can now search the initial rows that do not contain header matches for free text entity matches.

    You can use the new TableEntityFieldN parameter to avoid ambiguity in mixed mode. Use this parameter to configure a field for table entities where you have set EntityFieldN for the free text entities. For example:

    [Eduction]
    ResourceFiles=testfiles/simple_pii.xml
    # Free text entities
    Entity0=simple_pii/name
    EntityField0=FREE_TEXT_MATCH_NAME
    Entity1=simple_pii/weather
    EntityField1=FREE_TEXT_MATCH_WEATHER
    # Table entities
    HeaderEntity0=simple_pii/name_header
    CellEntity0=simple_pii/name
    TableEntityField0=TABLE_MATCH_NAME
    HeaderEntity1=simple_pii/number_header
    CellEntity1=simple_pii/number
    TableEntityField1=TABLE_MATCH_NUMBER
  • Two new functions, setMatchOffset and setMatchOffsetLength, have been added to the Eduction match component in Lua to allow you to set the offset for a component in your post-processing scripts. The setMatchOffset function sets the offset for the component inside the matched text in bytes. The setMatchOffsetLength function sets the offset for the component inside the matched text in codepoints. Both functions take a single integer argument.

  • You can now configure Eduction to select a higher scoring match over a longer or shorter match (depending on your NonGreedyMatch configuration) when you have set AllowMultipleResults to False or OnePerEntity.

    To use this option, set the new PrioritiseScore configuration parameter to True. The default value is False. When two entities have equal scores, Eduction uses the length as a tie breaker. You can also set this option in the C API by using the EdkSetPrioritiseScore function.

Enhancements to Eduction Grammars

  • The Eduction standard grammar psi_api_credentials.ecr grammar (in the Eduction standard grammars) has been updated with additional entities for authorization headers and JSON Web Token (JWT).

  • The PHI dea.ecr grammar has been updated with new entities for National Drug Codes (NDC) and NDC billing derivatives.

  • The PHI healthplan.ecr grammar has been updated with new entities for National Provider Identifiers (NPI), Medicare Beneficiary Identifiers (MBI), Health Insurance Claim Number (HICN), and Healthcare Common Procedure Codes (HCPCS) level I and II.

  • A new PII grammar, voter_id.ecr is available, which contains entities for matching voter IDs for the UK, India, and Mexico. This grammar is also available in EJR format, and a combined gramamr combined_voter_id.ecr.

  • The PII national_id grammar now includes national ID entities for Cambodia (kh), Honduras (hn), Vietnam (vn), and Qatar (qa).

  • The PII names grammar has the following improvements:

    • Handling of known surnames that begin with a prefix (for example, Mc) has been improved.

    • Handling of surname prefixes that have more than one part (for example van der) has been improved.

    • The ability to match speculative names for various countries has been improved, by expanding the permissible character set for those countries.

    • Stop list handling has been improved for known first names and surnames for various countries (for example Snow is acceptable as a surname for certain countries, despite it appearing in the stoplist).

    • Matching of multi-character initials (for example "Hans Chr. Schmidt" and "Alekos St. Papadopoulos") has been improved.

    • Matching of hyphenated forenames and surnames where either one of the hyphenated names is known and the other is known or unknown (for example "Jean-Léon Huens" and "Christiane Teschl-Hofmeister") has been improved.

    • Precision has been improved by reducing the score of non-CJKVT name matches that contain uppercase and a title case components (not including values that match as initials, titles, or surname prefixes), for example "AC Milan" or "ABBA Gold".

  • The PII names grammar has been expanded to include Russian (ru) and Ukrainian (ua) names.

  • The PII address grammar has been expanded to include Russian (ru) and Ukrainian (ua) addresses.

  • The PII postcode grammar has been expanded to include Russian (ru) and Ukrainian (ua) post codes.

  • Recall for US addresses (in the PHI and PII address grammars) has been improved, by adding direction and apartment data and by matching buildings as part of addresses, for example 'One Irvington Center' in the following address:

    One Irvington Center
    700 King Farm Boulevard
    Suite 125
    Rockville, MD 20850-5736
    USA
  • The PCI grammars now include the combined_name.ecr, combined_name_cjkvt.ecr and scripts/names_stoplist.lua, to allow you to find names from any of the supported countries.

  • In the GOV grammar entity_identifiers.ecr, matching of Legal Entity Identifier (LEI) numbers has been improved. There is no longer a restriction for the reserved numbers (fifth and sixth character) being 00, and there is no longer a penalty for having a prefix (first four characters) that is not in the predefined list (any four characters are allowed). With these changes, any 20 character code matches in the nocontext entities, but those with an incorrect checksum are discarded by postprocessing.

  • In the GOV grammar us_dod_markings.ecr, the classification_authority_block/downgrade, classification_authority_block/declassify, and classification_authority_block/reason entities have been updated so that the normalized text is more consistent with other entities, removing the start. For example, where the text/normalized text was Downgrade To: UNCLASSIFIED on 20200319 it is now UNCLASSIFIED on 20200319.

  • The PII and PHI medical_terms.ecr grammars have been updated to improve precision and recall.

Enhancements to Eduction Server

Eduction Server is an ACI server. For details of changes that affect all ACI servers, see ACI Server Framework.

Resolved Issues

  • The Malta EHIC format could also match valid codes for other countries, because it is very broad. The scoring of the /context/mt entity has been reduced for matches with a non-country-specific landmark (such as "EHIC").

  • In the PII names CJKVT grammar, Eduction sometimes matched characters after a title as part of the title, resulting in incorrect name matches.

  • When a component name was changed (for example SURNAME changed to FORENAME), Eduction did not respect all stoplist values and exceptions.

  • Entities for Vietnam (vn) were available in the banking.ecr grammar, rather than banking_cjkvt.ecr.

  • In the C SDK, calling the EdkFillMatches or EdkFillMatchesTimed functions could result in a memory leak. These functions were also called indirectly by the deprecated EDKProcessableMatchesCollection class in the Java Eduction SDK.

  • The example EDK C code eduction_from_config.c could leak memory.

  • If a synonym was used in a case-insensitive entity, Eduction could produce an incorrect headword when matching the alternative case.

  • In the PII and PHI grammars, after post-processing, matched text was not returned correctly for names that contained a stoplisted component.

  • In the PII and PHI grammars, stoplisted name components were removed from valid name components when Eduction updated the matched text, where the stoplist component was a substring of the valid component (for example And Andrew Adamsmatched as rew Adams).

  • When the configured ResourceFile path was absolute but also contained relative elements (for example, /my/path/to/../../grammar.xml), then inclusions in that grammar failed because Eduction did not correctly resolve the parent path.

  • Eduction could add erroneous extra characters to the output string when it matched a synonym that was longer than the headword.

  • The Eduction SDK C API documentation package was incomplete.

  • When a file contained multiple tables, if a potential header row inside a table delimiter contained a comma, Eduction treated the whole table as a comma-separated values (CSV) table, and could miss matches.

    NOTE: As part of this change, files with multiple tables can now use only TSV tables.

  • During post-processing for names, if a component (such as a forename) was removed because of stoplist rules, Eduction did not adjust the offsetlength correctly. Eduction now gives the correct offsetlength for the remainder of the name.

Notes

  • Deprecated functions in the Eduction C SDK have been moved into edk_deprecated.h. If you still need to use these deprecated functions you must explicitly include this header.