Certain speech-to-text language packs, such as Hebrew (HBIL
), are based on vocabulary that has been broken down into its smallest parts to maintain a feasible vocabulary size. Using one of these language packs to perform speech-to-text on audio can produce results that contain word fragments that need to be joined back together in the final results. The postproc
module can recombine these fragments into complete words. If adjacent words in the results file include hyphens as an indication of a word break, the module treats these as prefixes or suffixes and joins them to the stem of the word. For example, the postproc
module could receive the following word sequence (shown in CTM format):
1
|
A
|
0.000
|
0.351
|
a
|
0.513
|
1
|
A
|
0.351
|
0.194
|
pre-
|
0.325
|
1
|
A
|
0.545
|
0.419
|
exist
|
0.457
|
1
|
A
|
0.964
|
0.140
|
-ing
|
0.621
|
1
|
A
|
1.104
|
0.855
|
condition
|
0.369
|
The module would combine the prefix “pre-“, the stem “exist”, and the suffix “-ing”, as shown in the following example:
1
|
A
|
0.000
|
0.351
|
a
|
0.513
|
1
|
A
|
0.351
|
0.753
|
preexisting
|
0.457
|
1
|
A
|
1.104
|
0.855
|
condition
|
0.369
|
Speech-to-text results can contain errors, potentially leading to word fragments that would combine to form invalid words. To avoid producing invalid words, you can supply the postproc
module with a list of all valid words for a language (this list file is provided in the language pack). The module then combines word fragments only if they form words that are in the list. The other word fragments are left uncombined.
To specify valid words for a language
postproc
module section of the tasks configuration file, set the RcmpValidList
parameter to the name of the .wds file supplied in the language pack. If you do not specify a valid words list, the module combines word fragments wherever indicated by hyphens, without attempting to validate the resulting words.
|