Visualization of attention probability for the four types of attention heads in CosyVoice2. The first row shows the original attention probability, while the second row presents the logarithmic version, enabling more intuitive observation of the regions attended to by the attention head. The red dashed line indicates the boundary between text tokens and speech tokens in the input sequence, with text tokens below the boundary and speech tokens above it.
In CosyVoice2, the LLM's attention heads can be categorized into four distinct types based on the distribution of their attention weights: global, alignment, mixed, and speech heads. The first of these, global attention heads, are predominantly found in the lower Transformer layers, where their primary role is to perform a gloabal analysis and modeling of relationships across all text and speech tokens. Alignment heads are primarily located in the middle Transformer layers. They focus on the alignment between input text tokens and output speech tokens by mapping a continuous and monotonic path. This function is similar to the cross-attention mechanism in encoder-decoder architectures. Netx, speech attention heads are concentrated near the output layers, where they model local details around the currently generated token. Finally, mixed attention heads are a hybrid of the alignment and speech types. They can also map the alignment between text and speech tokens, but these alignments are highly unstable.
We manually designate heads-1 ~ head-7 in Transformer Layers-7 and Layer-8 as alignment heads. The two figures below visualize the attention scores of Transformer Layer-8. As can be seen, applying the LOAS significantly improves the alignment between text and speech tokens in the alignment heads, making it more continuous and focused. However, designating these alignment heads manually can introduce functional redundancy. Consequently, some heads fail to learn a proper alignment, which manifests as straight lines in the visualization, indicating alignment failure.