| 摘 要: 针对智能语音通信终端资源受限、云端部署延迟高的问题,现有的情感识别模型在精度与效率间难以取得平衡。基于PS-AC1D-FIF轻量化架构,设计波形模型(AudioResNet)与频谱模型(SpectrogramResNet)双模态模型,通过4层卷积、批量归一化和全局平均池化实现模型压缩。在CASIA数据集上的实验结果表明,AudioResNet准确率达73.33%,参数量0.13M,推理时间3.35ms;SpectrogramResNet准确率达71.67%、参数量0.39M、推理时间1.72ms。与ResNet18相比,AudioResNet参数量减少86倍、推理速度提升7倍;与PS-AC1D-FIF相比,参数量减少75%。两种模型都能满足终端的实时性需求,验证了该轻量化架构在中文语音场景的有效性。 |
| 关键词: 语音情感识别 轻量化模型 智能语音通信 双模态输入 终端部署 |
|
中图分类号: TP391???????????????????????????????????
文献标识码:
|
| 基金项目: 国家自然科学基金面上项目(62471301);郑州市基础研究与应用基础研究项目(ZZSZX202438);河南省民办高等学校品牌专业建设(ZLG201903)。 |
|
| Lightweight Speech Emotion Recognition Method for IntelligentVoice Communication |
|
HanFang, ZhangXu, XiaoDa, GuWanting
|
Huanghe University of Science and Technology
|
| Abstract: To address the issues of resource-constrained intelligent voice communication terminals and high latency of cloud deployment, existing emotion recognition models have struggled to balance accuracy and efficiency. Based on the PS-AC1D-FIF lightweight architecture, a dual-modal framework consisting of a waveform-based model (AudioResNet) and a spectrogram-based model (SpectrogramResNet) is proposed, achieving model compression through 4-layer convolution, batch normalization, and global average pooling. Experimental results on the CASIA Chinese speech emotion dataset demonstrate that AudioResNet achieves an accuracy of 73.33% with 0.13M parameters and an inference time of 3.35ms, while SpectrogramResNet achieves an accuracy of 71.67% with 0.39M parameters and an inference time of 1.72ms. Compared with ResNet18, AudioResNet reduces parameters by a factor of 86 and improves inference speed by a factor of 7. Compared with PS-AC1D-FIF, the parameter count is reduced by 75%. Both models meet the real-time requirements of terminals, validating the effectiveness of the proposed lightweight architecture in Chinese speech scenarios. |
| Keywords: Speech emotion recognition Lightweight model Intelligent voice communication Dual-modal input Terminal deployment |