软件工程

引用本文:

【点击复制】

【打印本页】【下载PDF全文】【查看/发表评论】【下载PDF阅读器】

←前一篇|后一篇→

过刊浏览

分享到：微信更多

基于音视频多模态注意力融合机制的端到端语音分离算法

黄浩然, 王向辉, 陈晓屹

陕西科技大学

摘要: 针对现有音视频联合语音分离算法中音视频流关联度不足、模态融合深度有限的问题,本文提出一种改进的端到端音视频联合语音分离模型。该模型以跨模态注意力模块为核心融合机制,通过在时域内实现语音特征与输入视觉特征的多次动态融合,借助注意力权重自适应平衡不同模态在各时间帧的贡献,并通过一维扩张卷积有效扩大时域信息的感受野,强化语音信号的上下文信息和音视频流间的时序关联,有效提升语音分离性能。本文使用的数据集为由LRS2生成的两个说话人的音视频。实验结果说明,该模型在尺度不变信噪比改进(Scale-Invariant Signal-to-Noise Ratio Improvement,SI-SNRi)和信号失真比改进(Signal-to-Distortion Ratio Improvement,SDRi)两个指标上分别达到13.6dB和13.9dB,感知语音质量评估(Perceptual Evaluation of Speech Quality,PESQ)结果达到3.59。与纯音频语音分离模型和传统的视听融合语音分离模型相比,本模型在性能上有明显的提升。

关键词: 端到端语音分离视听融合跨模态注意力机制

中图分类号: 文献标识码:

End-to-end speech separation algorithm based on audio-visual multimodal attention fusion mechanism

huanghaoran, wangxianghui, chenxiaoyi

Shaanxi University of Science and Technology

Abstract: To address the problems of insufficient correlation between audio and visual streams and limited depth of modal fusion in existing audio-visual joint speech separation algorithms, this paper proposes an improved end-to-end audio-visual joint speech separation model. The model adopts the cross-modal attention module as its core fusion mechanism, and realizes multiple dynamic fusions of speech features and input visual features in the time domain. The contribution of different modalities at each time frame is adaptively balanced through attention weights. In addition, 1D dilated convolution is used to effectively expand the receptive field of time-domain information, strengthen the contextual information of speech signals and the temporal correlation between audio and visual streams, so as to significantly improve the performance of speech separation.The dataset used in this paper is the two-speaker audio-visual dataset generated from LRS2. Experimental results show that the model achieves 13.6 dB and 13.9 dB in terms of Scale-Invariant Signal-to-Noise Ratio Improvement (SI-SNRi) and Signal-to-Distortion Ratio Improvement (SDRi), respectively, and the Perceptual Evaluation of Speech Quality (PESQ) score reaches 3.59. Compared with audio-only speech separation models and traditional audio-visual fusion speech separation models, the proposed model obtains obvious performance improvements.

Keywords: End-to-End Speech Separation Audio-Visual Fusion Cross-Modal Attention Mechanism

用微信扫一扫