WEKO3
アイテム
A study on Mongolian text-to-speech system based on deep neural network
https://tokushima-u.repo.nii.ac.jp/records/2009963
https://tokushima-u.repo.nii.ac.jp/records/20099630f77e816-bd4e-4651-ae67-d9236c740b6e
名前 / ファイル | ライセンス | アクション |
---|---|---|
k3634_abstract.pdf (85 KB)
|
|
|
k3634_review.pdf (46.7 KB)
|
|
|
k3634_fulltext.pdf (2.36 MB)
|
|
Item type | 文献 / Documents(1) | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
公開日 | 2022-06-09 | |||||||||||
アクセス権 | ||||||||||||
アクセス権 | open access | |||||||||||
資源タイプ | ||||||||||||
資源タイプ識別子 | http://purl.org/coar/resource_type/c_db06 | |||||||||||
資源タイプ | doctoral thesis | |||||||||||
出版タイプ | ||||||||||||
出版タイプ | NA | |||||||||||
出版タイプResource | http://purl.org/coar/version/c_be7fb7dd8ff6fe43 | |||||||||||
タイトル | ||||||||||||
タイトル | A study on Mongolian text-to-speech system based on deep neural network | |||||||||||
言語 | en | |||||||||||
タイトル別表記 | ||||||||||||
その他のタイトル | ディープニューラルネットワークに基づくモンゴル語のテキスト音声合成システムに関する研究 | |||||||||||
言語 | ja | |||||||||||
著者 |
ビヤンバドルジ, ゾルザヤ
× ビヤンバドルジ, ゾルザヤ
|
|||||||||||
抄録 | ||||||||||||
内容記述タイプ | Abstract | |||||||||||
内容記述 | There are about 7,000 languages spoken today in the world. However, most natural language processing and speech processing studies have been conducted for high resource languages such as English, Japanese and Mandarin. Preparing large amounts of training data is expensive and time-consuming, which creates a significant hurdle when developing some systems for the world’s many, less widely spoken languages. Mongolian is one of these low-resource languages. We proposed to build a text-to-speech system (TTS, also called speech synthesis) for the low resource Mongolian language. We present two studies within this TTS system, “text normalization” and “speech synthesis,” on the Mongolian language with limited training data. TTS system converts written text into machine-generated synthetic speech. One of the biggest challenges to developing a TTS system for a new language is converting transcripts into a real “spoken” form, the exact words that the speaker said. This is an important preprocessing for TTS systems known as text normalization. In other words, text normalization is transforming text into a standard form and is an essential part of the speech synthesis system. Later it also became important for processing social media text because of the rapid expansion in user-generated content on social media sites. As the use of social media grows rapidly, there is no doubt that the TTS system will need to generate speech from social media text. Therefore, we were more interested in social media text normalization. Thus, this thesis consists of two main parts, text normalization and speech synthesis. We experimentally demonstrated how to improve the output of the model used for each using a small amount of training data. The followings are brief descriptions of each part. Text normalization: The huge increase in social media use in recent years has resulted in new forms of social interaction, changing our daily lives. Social media websites are a rich source of text data, but the processing and analysis of social media text is a challenging task because written social media messages are usually informal and ‘noisy’. Due to increasing contact between people from different cultures as a result of globalization, there has also been an increase in the use of the Latin alphabet, and as a result a large amount of transliterated text is being used on social media. Although there is a standard for the use of Latin letters in the language, the public does not generally observe it when writing on social media. Therefore, social media text also contains many noisy, transliterated words. For example, many people who speak Mongolian are using the Latin alphabet to write Mongolian words on social media, instead of using the Cyrillic alphabet. These messages are informal and ‘noisy’ however, because everyone uses their own judgement as to which Latin letters should be substituted for particular Cyrillic letters, since there are 35 letters in the Mongolian Cyrillic alphabet, versus 26 letters in the modern Latin alphabet (not counting letters with diacritical marks such as accents, umlauts, etc.). In most research on noisy text normalization, both the source text and target text are in the same language. In other words, the alphabets used in the source and target texts are the same. Text normalization is difficult to perform with noisy text even when it is not transliterated. In this thesis, our first goal is to convert noisy, transliterated text into formal writing in a different alphabet. Therefore, it poses more challenges in the text normalization task. We propose a variety of character level sequence-to-sequence (seq2seq) models for normalizing noisy, transliterated text written in Latin script into Mongolian Cyrillic script, for scenarios in which there is a limited amount of training data available. When there is a limited amount of training data, and the rules for writing noisy, transliterated text are not limited, we encounter a difficult challenge when attempting to normalize out-of-vocabulary (OOV) words. Therefore, we applied performance enhancement methods, which included various beam search strategies, N-gram-based context adoption, edit distance-based correction and dictionary-based checking, in novel ways to two basic seq2seq models. We experimentally evaluated these two basic models as well as fourteen enhanced seq2seq models, and compared their noisy text normalization performance with that of a transliteration model and a conventional statistical machine translation (SMT) model. The proposed seq2seq models improved the robustness of the basic seq2seq models for normalizing OOV words, and most of our models achieved higher normalization performance than the conventional method. Speech synthesis: Deep learning techniques are currently being applied in automated TTS systems, resulting in significant improvements in performance. These methods require large amounts of text-speech pair data for model training however, and collecting this data is costly. Tacotron 2 we used, a state-of-the-art end-to-end speech synthesis system, requires more than 10 hours of training data to produce good synthesized speech. Therefore, our second goal is to build a single-speaker TTS system containing both a spectrogram prediction network and a neural vocoder for the target Mongolian language, using only 30 minutes of target Mongolian language text-speech paired data for training. We evaluate three methods for training the spectrogram prediction models of our TTS system, which produce mel-spectrograms from the input phoneme sequence; (1) cross-lingual transfer learning, (2) data augmentation, and (3) a combination of the previous two methods. In the cross-lingual transfer learning method, we used two high-resource language datasets, English (24 hours) and Japanese (10 hours). We also used 30 minutes of target language data for training in all three methods, and for generating the augmented data used for training in methods (2) and (3) mentioned above. We found that using both cross-lingual transfer learning and augmented data during training resulted in the most natural synthesized target speech output. We also compare single-speaker and multi-speaker training methods, using sequential and simultaneous training, respectively. The multi-speaker models were found to be more effective for constructing a single-speaker, low-resource TTS model. In addition, we trained two Parallel WaveGAN (PWG) neural vocoders, one using 13 hours of our augmented data with 30 minutes of target language data and one using the entire 12 hours of the original target language dataset. Our subjective AB preference test indicated that the neural vocoder trained with augmented data achieved almost the same perceived speech quality as the vocoder trained with the entire target language dataset. We found that our proposed TTS system consisting of a spectrogram prediction network and a PWG neural vocoder was able to achieve reasonable performance using only 30 minutes of target language training data. We also found that by using 3 hours of target language data, for training the model and for generating augmented data, our proposed TTS model was able to achieve performance very similar to that of the baseline model, which was trained with 12 hours of target language data. |
|||||||||||
言語 | en | |||||||||||
キーワード | ||||||||||||
言語 | en | |||||||||||
主題Scheme | Other | |||||||||||
主題 | Text normalization | |||||||||||
キーワード | ||||||||||||
言語 | en | |||||||||||
主題Scheme | Other | |||||||||||
主題 | text to speech | |||||||||||
キーワード | ||||||||||||
言語 | en | |||||||||||
主題Scheme | Other | |||||||||||
主題 | low resource language | |||||||||||
キーワード | ||||||||||||
言語 | en | |||||||||||
主題Scheme | Other | |||||||||||
主題 | noisy text | |||||||||||
キーワード | ||||||||||||
言語 | en | |||||||||||
主題Scheme | Other | |||||||||||
主題 | transliterated text | |||||||||||
キーワード | ||||||||||||
言語 | en | |||||||||||
主題Scheme | Other | |||||||||||
主題 | language model | |||||||||||
キーワード | ||||||||||||
言語 | en | |||||||||||
主題Scheme | Other | |||||||||||
主題 | seq2seq model | |||||||||||
キーワード | ||||||||||||
言語 | en | |||||||||||
主題Scheme | Other | |||||||||||
主題 | character conversion | |||||||||||
キーワード | ||||||||||||
言語 | en | |||||||||||
主題Scheme | Other | |||||||||||
主題 | speech synthesis | |||||||||||
キーワード | ||||||||||||
言語 | en | |||||||||||
主題Scheme | Other | |||||||||||
主題 | transfer learning | |||||||||||
キーワード | ||||||||||||
言語 | en | |||||||||||
主題Scheme | Other | |||||||||||
主題 | data augmentation | |||||||||||
書誌情報 |
発行日 2022-03-23 |
|||||||||||
備考 | ||||||||||||
言語 | ja | |||||||||||
値 | 内容要旨・審査要旨・論文本文の公開 学位授与者所属 : 徳島大学大学院先端技術科学教育部(システム創生工学専攻) |
|||||||||||
言語 | ||||||||||||
言語 | eng | |||||||||||
報告番号 | ||||||||||||
学位授与番号 | 甲第3634号 | |||||||||||
学位記番号 | ||||||||||||
言語 | ja | |||||||||||
値 | 甲先第431号 | |||||||||||
学位授与年月日 | ||||||||||||
学位授与年月日 | 2022-03-23 | |||||||||||
学位名 | ||||||||||||
言語 | ja | |||||||||||
学位名 | 博士(工学) | |||||||||||
学位授与機関 | ||||||||||||
言語 | ja | |||||||||||
学位授与機関名 | 徳島大学 |