| 摘 要: 针对现有音视频联合语音分离算法中音视频流关联度不足、模态融合深度有限的问题,本文提出一种改进的端到端音视频联合语音分离模型。该模型以跨模态注意力模块为核心融合机制,通过在时域内实现语音特征与输入视觉特征的多次动态融合,借助注意力权重自适应平衡不同模态在各时间帧的贡献,并通过一维扩张卷积有效扩大时域信息的感受野,强化语音信号的上下文信息和音视频流间的时序关联,有效提升语音分离性能。本文使用的数据集为由LRS2生成的两个说话人的音视频。实验结果说明,该模型在尺度不变信噪比改进(Scale-Invariant Signal-to-Noise Ratio Improvement,SI-SNRi)和信号失真比改进(Signal-to-Distortion Ratio Improvement,SDRi)两个指标上分别达到13.6dB和13.9dB,感知语音质量评估(Perceptual Evaluation of Speech Quality,PESQ)结果达到3.59。与纯音频语音分离模型和传统的视听融合语音分离模型相比,本模型在性能上有明显的提升。 |
| 关键词: 端到端语音分离 视听融合 跨模态注意力机制 |
|
中图分类号:
文献标识码:
|
|
| End-to-end speech separation algorithm based on audio-visual multimodal attention fusion mechanism |
|
huanghaoran, wangxianghui, chenxiaoyi
|
Shaanxi University of Science and Technology
|
| Abstract: To address the problems of insufficient correlation between audio and visual streams and limited depth of modal fusion in existing audio-visual joint speech separation algorithms, this paper proposes an improved end-to-end audio-visual joint speech separation model. The model adopts the cross-modal attention module as its core fusion mechanism, and realizes multiple dynamic fusions of speech features and input visual features in the time domain. The contribution of different modalities at each time frame is adaptively balanced through attention weights. In addition, 1D dilated convolution is used to effectively expand the receptive field of time-domain information, strengthen the contextual information of speech signals and the temporal correlation between audio and visual streams, so as to significantly improve the performance of speech separation.The dataset used in this paper is the two-speaker audio-visual dataset generated from LRS2. Experimental results show that the model achieves 13.6 dB and 13.9 dB in terms of Scale-Invariant Signal-to-Noise Ratio Improvement (SI-SNRi) and Signal-to-Distortion Ratio Improvement (SDRi), respectively, and the Perceptual Evaluation of Speech Quality (PESQ) score reaches 3.59. Compared with audio-only speech separation models and traditional audio-visual fusion speech separation models, the proposed model obtains obvious performance improvements. |
| Keywords: End-to-End Speech Separation Audio-Visual Fusion Cross-Modal Attention Mechanism |