A Method for constructing Korean spontaneous spoken language corpus based on an imitation of abbreviated and transformed particles


The respected Comrade Kim Jong Un said:

"The present is the era of science and technology, the era of the knowledge economy, and the national strength and the position and future of a country and nation are dependant on the level of development of science and technology."

The targets of large-vocabulary continuous speech recognition (LVCSR) research have been extended in recent years to spontaneous speech such as telephone conversations, lectures, and meetings. Large corpora of conversational telephone speech (CTS), such as Switchboard and Fisher corpora, were collected and a number of LVCSR techniques have been developed with these corpora.

Spontaneous speech has different acoustic and linguistic characteristics from those of reading or broadcasting news speech, since speakers are neither professional announcers nor narrators, and their behavior is spontaneous. Specifically, fast speaking rates, unclear articulation, and pronunciation variants are frequently observed. Redundant expressions, ungrammatical sentences, and disfluencies such as fillers and repairs are also observed. These acoustic and linguistic phenomena should be modeled in an LVCSR system to accurately transcribe spontaneous speech.

Hundreds of hours of well-matched training data have been collected to build LVCSR systems that cover these phenomena in some of spontaneous speech recognition tasks. However, collecting such large-scale corpora is usually impractical, because of the cost of manual transcriptions. Therefore, many LVCSR systems combine corpora representing domain-specific characteristics such as the proceedings of lectures or newspapers with one representative of the characteristics of spontaneous speech such as the Switchboard corpus, but still suffer from an intrinsic problem where the two kinds of characteristics cannot truly be harmonized. For example, it is impossible to estimate N-gram entries that have both topic words and fillers with this mixture-based training. Also, resulting models inevitably contain irrelevant N-gram entries, which may cause confusion in LVCSR and thus degrade performance.

In some works to resolve both of the topic adaptability and spontaneous characteristic problems of corpus, a number of Web pages on a comprehensive range of topics were downloaded, and spoken-like texts were selected from the downloaded Web data, then typical linguistic phenomena such as fillers and pauses were added. But, these works focused on imitating fillers and pauses that are a part of spoken vocabulary distinguished from written one.

For constructing Korean spontaneous spoken language model, Laboratory of Speech Information Processing, Intelligence Science Institute, Faculty of Information Science, Kim Il Sung University developed a method for creating automatically a spoken language corpus imitating typical abbreviated and transformed particles among spoken vocabulary distinguished from written one. We classified by grammatical functions and pronouncing features of particles that are distinguished between written and spoken language, then replaced particles with abbreviated and transformed particles in written-style text according to correspondence of written particles to spoken ones, which resulted in spoken-style text. In experiments, proposed model achieved 0.46% and 0.51% absolute decrease in WER compared to baseline models of combined morpheme unit and minimum morpheme unit, respectively.

The result of this paper was published under the title of "A method for constructing Korean spontaneous spoken language corpus based on an imitation of abbreviated and transformed particles" (https://doi.org/10.1007/s10772-021-09937-6) of the international journal "International Journal of Speech Technology" (2021).