计算机工程与应用 ›› 2023, Vol. 59 ›› Issue (1): 149-155.DOI: 10.3778/j.issn.1002-8331.2107-0053

• 模式识别与人工智能 • 上一篇    下一篇

两级特征联合学习的情感说话人识别

刘金琳,李冬冬,王喆,蔡立志   

  1. 1.华东理工大学 信息科学与工程学院,上海 200237
    2.苏州大学 江苏省计算机信息处理技术重点实验室,江苏 苏州 215006
  • 出版日期:2023-01-01 发布日期:2023-01-01

Segment-Level Feature and Frame-Level Feature Joint Learning for Emotional Speaker Recognition

LIU Jinlin, LI Dongdong, WANG Zhe, CAI Lizhi   

  1. 1.School of Information Science and Engineering, East China University of Technology, Shanghai 200237, China
    2.Provincial Key Laboratory for Computer Information Processing Technology, Soochow University, Suzhou, Jiangsu 215006, China
  • Online:2023-01-01 Published:2023-01-01

摘要: 针对说话人识别的性能易受到情感因素影响的问题,提出利用片段级别特征和帧级别特征联合学习的方法。利用长短时记忆网络进行说话人识别任务,提取时序输出作为片段级别的情感说话人特征,保留了语音帧特征原本信息的同时加强了情感信息的表达,再利用全连接网络进一步学习片段级别特征中每一个特征帧的说话人信息来增强帧级别特征的说话人信息表示能力,最后拼接片段级别特征和帧级别特征得到最终的说话人特征以增强特征的表征能力。在普通话情感语音语料库(MASC)上进行实验,验证所提出方法有效性的同时,探究了片段级别特征中包含语音帧数量和不同情感状态对情感说话人识别的影响。

关键词: 情感说话人识别, 长短时记忆网络, 深度神经网络

Abstract: The performance of speaker recognition is easily affected by emotional factors. A joint learning method using segment-level features and frame-level features is proposed in this paper. To retain the original speaker information of each frame and fully express the emotional information, long short-term memory-network is used to extract sequence output as segment-level emotional speaker embedding. Then each frame of the segment-level feature is learned by full-connected network to improve the frame-level feature representation ability. At last, the final speaker embedding is the concatenation of the segment-level feature and the frame-level feature, which can further improve the ability of feature expression. Experiments are conducted on Mandarin emotional speech corpus(MASC) to verify the effectiveness of the proposed method. Meanwhile, this paper discusses the suitable number of frames contained in segment-level feature and the effects of different emotional states on emotional speaker recognition.

Key words: emotional speaker recognition, long short-term memory, deep neutral network