HPE recommends iterative alignment if the alignment quality is poor or if large sections of audio have not been aligned. This situation can arise when aligning very long audio. In iterative alignments, alignment occurs over two or more steps:
TranscriptAlign
task with the MatchType
configuration parameter value set to words
to align the audio. Retrieve the alignment output in a .ctm file format. prons
mode.Converting the .ctm file involves normalization and ensuring that there is only one word on each line, for example:
Article one All human beings are born free
You can optionally follow words with a pair of numbers that specify the earliest start time and latest end time in seconds at which the word can appear in the aligned output, for example:
Article
|
0.000
|
1.000
|
one
|
0.000
|
1.000
|
All
|
0.000
|
1.000
|
human
|
0.500
|
1.500
|
beings
|
0.500
|
1.500
|
are
|
1.000
|
2.000
|
born
|
1.000
|
2.000
|
free
|
1.000
|
2.000
|
This example indicates that the word Article must appear between 0.000 and 1.000 seconds in the aligned output, human must appear between 0.500 and 1.500 seconds, and so on.
IDOL Speech Server cannot perform this step automatically. HPE recommends that you subtract a small amount of time from the word start positions and add it to the word end positions generated by the initial alignment. This step allows the second alignment stage to make small adjustments to the word start and end points.
|