Eliminating Stability Hallucinations in LLM-based TTS models via Attention Guidance

ShiMing Wang1,2, ZhiHao Du2, Yang Xiang2, TianYu Zhao2, Han Zhao2, Qian Chen2,XianGang Li2, HanJie Guo1, ZhenHua Ling1

1University of Science and Technology of China

2Speech Lab, Alibaba Group, China

Mail: wsmzzz@mail.ustc.edu.cn zhling@ustc.edu.cn


Appendix

Appendix A: Visualization of Four Types of Cross-Attention Heads

Visualization of attention probability for the four types of attention heads in CosyVoice2. The first row shows the original attention probability, while the second row presents the logarithmic version, enabling more intuitive observation of the regions attended to by the attention head. The red dashed line indicates the boundary between text tokens and speech tokens in the input sequence, with text tokens below the boundary and speech tokens above it.

In CosyVoice2, the LLM's attention heads can be categorized into four distinct types based on the distribution of their attention weights: global, alignment, mixed, and speech heads. The first of these, global attention heads, are predominantly found in the lower Transformer layers, where their primary role is to perform a gloabal analysis and modeling of relationships across all text and speech tokens. Alignment heads are primarily located in the middle Transformer layers. They focus on the alignment between input text tokens and output speech tokens by mapping a continuous and monotonic path. This function is similar to the cross-attention mechanism in encoder-decoder architectures. Netx, speech attention heads are concentrated near the output layers, where they model local details around the currently generated token. Finally, mixed attention heads are a hybrid of the alignment and speech types. They can also map the alignment between text and speech tokens, but these alignments are highly unstable.

Visualization of Attention Scores in the Transformer Layer-1 of the LLM
Visualization of Attention Scores in the Transformer Layer-8 of the LLM
Visualization of Attention Scores in the Transformer Layer-24 of the LLM

Appendix B: The effectiveness of LOAS

We manually designate heads-1 ~ head-7 in Transformer Layers-7 and Layer-8 as alignment heads. The two figures below visualize the attention scores of Transformer Layer-8. As can be seen, applying the LOAS significantly improves the alignment between text and speech tokens in the alignment heads, making it more continuous and focused. However, designating these alignment heads manually can introduce functional redundancy. Consequently, some heads fail to learn a proper alignment, which manifests as straight lines in the visualization, indicating alignment failure.

Visualization of Attention Scores in the Transformer Layer-8 of the LLM without LOAS
Visualization of Attention Scores in the Transformer Layer-18 of the LLM with LOAS