Text-to-speech system for low-resource language using cross-lingual transfer learning and data augmentation

Byambadorj, Zolzaya; 西村, 良太; ニシムラ, リョウタ; Nishimura, Ryota; Ayush, Altangerel; オオタ, ケンゴ; オオタ, ケンゴ; Ohta, Kengo; 北岡, 教英; キタオカ, ノリヒデ; Kitaoka, Norihide

doi:https://doi.org/10.1186/s13636-021-00225-4

WEKO3

lat lon distance

[[sub_check.contents]]

[[sub_radio.contents]]

Field does not validate

[[sub_attr.contents]]　

インデックスツリー

アイテム

Text-to-speech system for low-resource language using cross-lingual transfer learning and data augmentation

https://tokushima-u.repo.nii.ac.jp/records/2009777

名前 / ファイル	ライセンス	アクション
eurasip_2021_42.pdf (5.9 MB)

Item type

文献 / Documents(1)

公開日

2022-04-21

アクセス権

open access

資源タイプ

資源タイプ識別子

http://purl.org/coar/resource_type/c_6501

資源タイプ

journal article

出版社版DOI

識別子タイプ

DOI

関連名称

10.1186/s13636-021-00225-4

出版タイプ

VoR

出版タイプResource

http://purl.org/coar/version/c_970fb48d4fbd8a85

タイトル

Text-to-speech system for low-resource language using cross-lingual transfer learning and data augmentation

言語

著者

Byambadorj, Zolzaya
西村, 良太

WEKO 942
徳島大学教育研究者総覧 346405/profile-ja.html

ja	西村, 良太 ISNI
ja-Kana	ニシムラ, リョウタ
en	Nishimura, Ryota

Search repository

Ayush, Altangerel
オオタ, ケンゴ

ja	オオタ, ケンゴ
ja-Kana	オオタ, ケンゴ
en	Ohta, Kengo

Search repository

北岡, 教英

WEKO 728
e-Rad 10333501

ja	北岡, 教英 ISNI
ja-Kana	キタオカ, ノリヒデ
en	Kitaoka, Norihide

Search repository

抄録

内容記述タイプ

Abstract

内容記述

Deep learning techniques are currently being applied in automated text-to-speech (TTS) systems, resulting in significant improvements in performance. However, these methods require large amounts of text-speech paired data for model training, and collecting this data is costly. Therefore, in this paper, we propose a single-speaker TTS system containing both a spectrogram prediction network and a neural vocoder for the target language, using only 30 min of target language text-speech paired data for training. We evaluate three approaches for training the spectrogram prediction models of our TTS system, which produce mel-spectrograms from the input phoneme sequence: (1) cross-lingual transfer learning, (2) data augmentation, and (3) a combination of the previous two methods. In the cross-lingual transfer learning method, we used two high-resource language datasets, English (24 h) and Japanese (10 h). We also used 30 min of target language data for training in all three approaches, and for generating the augmented data used for training in methods 2 and 3. We found that using both cross-lingual transfer learning and augmented data during training resulted in the most natural synthesized target speech output. We also compare single-speaker and multi-speaker training methods, using sequential and simultaneous training, respectively. The multi-speaker models were found to be more effective for constructing a single-speaker, low-resource TTS model. In addition, we trained two Parallel WaveGAN (PWG) neural vocoders, one using 13 h of our augmented data with 30 min of target language data and one using the entire 12 h of the original target language dataset. Our subjective AB preference test indicated that the neural vocoder trained with augmented data achieved almost the same perceived speech quality as the vocoder trained with the entire target language dataset. Overall, we found that our proposed TTS system consisting of a spectrogram prediction network and a PWG neural vocoder was able to achieve reasonable performance using only 30 min of target language training data. We also found that by using 3 h of target language data, for training the model and for generating augmented data, our proposed TTS model was able to achieve performance very similar to that of the baseline model, which was trained with 12 h of target language data.

言語

キーワード

言語

主題Scheme

Other

主題

Speech synthesis

キーワード

言語

主題Scheme

Other

主題

Text to speech

キーワード

言語

主題Scheme

Other

主題

Transfer learning

キーワード

言語

主題Scheme

Other

主題

Data augmentation

キーワード

言語

主題Scheme

Other

主題

Low-resource language

書誌情報

en : EURASIP Journal on Audio, Speech, and Music Processing

巻 2021, p. 42, 発行日 2021-12-04

収録物ID

収録物識別子タイプ

ISSN

収録物識別子

16874722

出版者

BioMed Central

言語

出版者

Springer Nature

言語

権利情報

言語

権利情報

This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

EID

識別子

384158

識別子タイプ

URI

言語

eng

戻る

views

See details

	Views

Versions

Ver.1

2024-12-12 02:08:23.309662

Show All versions

Cite as

エクスポート

OAI-PMH

JPCOAR 2.0
JPCOAR 1.0
DublinCore
DDI

Other Formats

JSON
BIBTEX

インデックスリンク

インデックスツリー

アイテム

Text-to-speech system for low-resource language using cross-lingual transfer learning and data augmentation

× Byambadorj, Zolzaya

× 西村, 良太

× Ayush, Altangerel

× オオタ, ケンゴ

× 北岡, 教英

Versions

Share

Cite as

エクスポート