Segment-Level Feature and Frame-Level Feature Joint Learning for Emotional Speaker Recognition

doi:10.3778/j.issn.1002-8331.2107-0053

Abstract

Abstract: The performance of speaker recognition is easily affected by emotional factors. A joint learning method using segment-level features and frame-level features is proposed in this paper. To retain the original speaker information of each frame and fully express the emotional information, long short-term memory-network is used to extract sequence output as segment-level emotional speaker embedding. Then each frame of the segment-level feature is learned by full-connected network to improve the frame-level feature representation ability. At last, the final speaker embedding is the concatenation of the segment-level feature and the frame-level feature, which can further improve the ability of feature expression. Experiments are conducted on Mandarin emotional speech corpus（MASC） to verify the effectiveness of the proposed method. Meanwhile, this paper discusses the suitable number of frames contained in segment-level feature and the effects of different emotional states on emotional speaker recognition.

Key words: emotional speaker recognition, long short-term memory, deep neutral network

摘要： 针对说话人识别的性能易受到情感因素影响的问题，提出利用片段级别特征和帧级别特征联合学习的方法。利用长短时记忆网络进行说话人识别任务，提取时序输出作为片段级别的情感说话人特征，保留了语音帧特征原本信息的同时加强了情感信息的表达，再利用全连接网络进一步学习片段级别特征中每一个特征帧的说话人信息来增强帧级别特征的说话人信息表示能力，最后拼接片段级别特征和帧级别特征得到最终的说话人特征以增强特征的表征能力。在普通话情感语音语料库（MASC）上进行实验，验证所提出方法有效性的同时，探究了片段级别特征中包含语音帧数量和不同情感状态对情感说话人识别的影响。

关键词: 情感说话人识别, 长短时记忆网络, 深度神经网络

LIU Jinlin, LI Dongdong, WANG Zhe, CAI Lizhi. Segment-Level Feature and Frame-Level Feature Joint Learning for Emotional Speaker Recognition[J]. Computer Engineering and Applications, 2023, 59(1): 149-155.

刘金琳, 李冬冬, 王喆, 蔡立志. 两级特征联合学习的情感说话人识别[J]. 计算机工程与应用, 2023, 59(1): 149-155.

References

[1] KINNUNEN T，LI H.An overview of text-independent speaker recognition：From features to supervectors[J].Speech Communication，2010，52（1）：12-40.
[2] TIRUMALA S S，SHAHAMIRI S R，GARHWAL A S，et al.Speaker identification features extraction methods：A systematic review[J].Expert Systems with Applications，2017，90：250-271.
[3] ALCORN S，MEEMANN K，WALPOLE E，et al.Acoustic cues and linguistic experience as factors in regional dialect classification[J].Journal of the Acoustical Society of America，2020，147（1）：657-670.
[4] MOHAMMADI Z，FROUNCHI J，AMIRI M.Wavelet-based emotion recognition system using EEG signal[J].Neural Computing and Applications，2017，28（8）：1985-1990.
[5] REYNOLDS D A，QUATIERI T F.Speaker verification using adapted Gaussian mixture models[J].Digit Signal Process，2000，10：19-41.
[6] DEHAK N，KENNY P J，DEHAK R，et al.Front-end factor analysis for speaker verification[J].Audio，Speech，and Language Processing，2011，19（4）：788-798.
[7] SNYDER D，GARCIA-ROMERO D，POVEY D，et al.Deep neural network embeddings for text-independent speaker verification[C]//Proceedings of Interspeech 2017，2017：999-1003.
[8] CHEN Z，LIN Y.Improving x-vector and PLDA for text-dependent speaker verification[C]//Proceedings of Interspeech 2020，2020：726-730.
[9] BAO H，XU M，ZHENG T F.Emotion attribute projection for speaker recognition on emotional speech[C]//Proceedings of Interspeech 2007，2007：758-761.
[10] KENNY P，BOULIANNE G，OUELLET P，et al.Improvements in factor analysis based speaker verification[C]//Proceedings of 2006 IEEE International Conference on Acoustics Speech and Signal Processing，2006：113-116.
[11] MACKOVA L，CIZMAR A.Emotional speaker verification based on i-vectors[C]//Proceedings of Conference on Cognitive Infocommunications，2015：533-536.
[12] MACKOVA L，CIAMAR A，JUHAR J.Best feature selection for emotional speaker verification in i-vector representation[C]//Proceedings of Conference on Radioelektronika，2015：209-212.
[13] MANSOUR A，CHENCHAH F，LACHIRI Z.Emotional speaker recognition based on i-vector space model[C]//Proceedings of International Conference on Control Engineering & Information Technology（CEIT），2017：1-6.
[14] MANSOUR A，CHENCHAH F，LACHIRI Z.Emotional speaker recognition in real life conditions using multiple descriptors and i-vector speaker modeling technique[J].Multimedia Tools and Applications，2019，78（6）：6441-6458.
[15] KOOLAGUDI S G，SHARMA K，RAO K S.Speaker recognition in emotional environment[J].Communications in Computer & Information Science，2012，305：117-124.
[16] BAI Z，ZHANG X L.Speaker recognition based on deep learning：An overview[J].arXiv：2012.00931，2020.
[17] ROUVIER M，DUFOUR R，BOUSQUET P M.Review of different robust x-vector extractors for speaker verification[C]//Proceedings of European Signal Processing Conference （EUSIPCO），2021：1-5.
[18] HONG Q B，WU C H，WANG H M，et al.Statistics pooling time delay neural network based on x-vector for speaker verification[C]//Proceedings of Conference on ICASSP 2020，2020：6849-6853.
[19] SHAHAMIRI S R，THABTAH F.An investigation towards speaker identification using a single-sound-frame[J].Multimed Tools Appl，2020，79：31265-31281.
[20] TIAN W，YANG Y，WU Z，et al.MASC：A speech corpus in mandarin for emotion analysis and affective speaker recognition[C]//Proceedings of Speaker & Language Recognition Workshop，2006：1-5.
[21] CAO H，COOPER D G，KEUTMANN M K，et al.CREMA-D：Crowd-sourced emotional multimodal actors dataset[J].IEEE Transactions Affect Computer，2014，5（4）：377-390.