NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality

0upvotes

By: Xu Tan, Jiawei Chen, Haohe Liu, Jian Cong, Chen Zhang, Yanqing Liu, Xi Wang, Yichong Leng, Yuanhao Yi, Lei He, Frank Soong, Tao Qin, Sheng Zhao, Tie-Yan Liu

Text to speech (TTS) has made rapid progress in both academia and industry in recent years. Some questions naturally arise that whether a TTS system can achieve human-level quality, how to define/judge that quality and how to achieve it. In this paper, we answer these questions by first defining the human-level quality based on the statistical significance of subjective measure and introducing appropriate guidelines to judge it, and then de... more

Audio and Speech ProcessingOctober 27, 2023 4:10am

Comments (0)
Views (516)

NaturalSpeech: End-to-End Text to Speech Synthesis with Human-Level Quality

1upvote

By: Xu Tan, Jiawei Chen, Haohe Liu, Jian Cong, Chen Zhang, Yanqing Liu, Xi Wang, Yichong Leng, Yuanhao Yi, Lei He, Frank Soong, Tao Qin, Sheng Zhao, Tie-Yan Liu

Audio and Speech ProcessingOctober 27, 2023 4:02am

Comments (0)
Views (625)

Feature Embeddings from Large-Scale Acoustic Bird Classifiers Enable Few-Shot Transfer Learning

1upvote

By: Burooj Ghani, Tom Denton, Stefan Kahl, Holger Klinck

Automated bioacoustic analysis aids understanding and protection of both marine and terrestrial animals and their habitats across extensive spatiotemporal scales, and typically involves analyzing vast collections of acoustic data. With the advent of deep learning models, classification of important signals from these datasets has markedly improved. These models power critical data analyses for research and decision-making in biodiversity mo... more

Audio and Speech ProcessingJuly 15, 2023 5:16pm

Comments (0)
Views (288)

Unsupervised Voice Activity Detection by Modeling Source and System Information using Zero Frequency Filtering

0upvotes

By: Eklavya Sarkar, RaviShankar Prasad, Mathew Magimai. -Doss

Voice activity detection (VAD) is an important pre-processing step for speech technology applications. The task consists of deriving segment boundaries of audio signals which contain voicing information. In recent years, it has been shown that voice source and vocal tract system information can be extracted using zero-frequency filtering (ZFF) without making any explicit model assumptions about the speech signal. This paper investigates the p... more

Audio and Speech ProcessingMay 24, 2023 8:58am