面向不平衡数据集的语音情感识别研究

doi:10.3778/j.issn.1002-8331.2209-0099

摘要/Abstract

摘要： 样本平衡对机器学习至关重要，在不平衡数据集中，虽然某些类别的样本数量可能很少，但其重要性可能更高。研究了基于不平衡数据集的语音情感识别技术。在不同信噪比下采用不同噪声对不平衡基线数据集EMODB和IEMOCAP进行扩充，构建含噪数据集EMODBM和IEMOCAPM；采用SMOTE、RandomOverSampler、SMOTEENN、ADASYN、TomekLinks以及SMOTETomek等6种技术对基线数据集和含噪数据集进行重采样，实现类别样本平衡；在基线数据集和扩充数据集上分别提取21维的低级描述符特征；采用新提出的模型MA-CapsNet验证重采样技术的有效性。实验表明，重采样后各类情感样本基本平衡，使模型的学习更公平、更客观，并且模型在重采样数据集上的鲁棒性更好。

关键词: 语音情感识别, 重采样, 胶囊网络, 数据扩充

Abstract: The sample balance is crucial for machine learning. The importance of certain classes may be higher than its number on the imbalanced datasets. This paper studies the imbalanced datasets for speech emotion recognition. Firstly, the imbalanced baseline datasets EMODB and IEMOCAP are augmented with different signal-to-noise?ratios, and the datasets EMODBM and IEMOCAPM are constructed. Secondly, six techniques namely SMOTE, RandomOverSampler, SMOTEENN, ADASYN, TomekLinks and SMOTETomek are adopted to resample the baseline datasets, and the augmented datasets are constructed to achieve the category balance. Thirdly, 21-dimensional low-level descriptor features are extracted from the baseline datasets and the augmented datasets. Finally, a novel model MA-CapsNet is proposed to validate the effectiveness of the resampling techniques. The results show that all types of emotion samples are basically balanced after resampling, which makes the learning of the model MA-CapsNet fairer. In addition, the model MA-CapsNet has better robustness on the resampling datasets.

Key words: speech emotion recognition, resampling, capsule network, data augmentation

张会云, 黄鹤鸣. 面向不平衡数据集的语音情感识别研究[J]. 计算机工程与应用, 2024, 60(4): 122-132.

ZHANG Huiyun, HUANG Heming. Speech Emotion Recognition for Imbalanced Datasets[J]. Computer Engineering and Applications, 2024, 60(4): 122-132.

参考文献

[1] DODANGEH E, CHOUBIN B, EIGDIR A N, et al. Integrated machine learning methods with resampling algorithms for flood susceptibility prediction[J]. Science of the Total Environment, 2020, 705: 135983.
[2] BORGES T A, NEVES R F. Ensemble of machine learning algorithms for cryptocurrency investment with different data resampling methods[J]. Applied Soft Computing, 2020, 90: 106187.
[3] BATISTA G E, PRATI R C, MONARD M C. A study of the behavior of several methods for balancing machine learning training data[J]. ACM SIGKDD Explorations Newsletter, 2004, 6(1): 20-29.
[4] LIU Z T, XIE Q, WU M, et al. Speech emotion recognition based on an improved brain emotion learning model[J]. Neurocomputing, 2018, 309: 145-156.
[5] 刘小洋, 唐婷, 何道兵. 融合社交网络用户自身属性的信息传播数学建模与舆情演化分析[J]. 中文信息学报, 2019, 33(9): 115-122.
      LIU X Y, TANG T, HE D B. Mathematical modeling and public opinion evolution analysis of information diffusion with the user attributes[J]. Journal of Chinese Information Processing, 2019, 33(9): 115-122.
[6] 张晨昕, 饶元, 樊笑冰, 等. 基于社交媒体的事件脉络挖掘研究进展[J]. 中文信息学报, 2019, 33(11): 15-30.
      ZHANG C X, RAO Y, FAN X B, et al. Research progress of event summarization based on social media[J]. Journal of Chinese Information Processing, 2019, 33(11): 15-30.
[7] 韩鹏宇, 高盛祥, 余正涛, 等. 基于案件要素指导的涉案舆情新闻文本摘要方法[J]. 中文信息学报, 2020, 34(5): 56-63.
      HAN P Y, GAO S X, YU Z T. Case-involved public opinion news summarization with case elements guidance[J]. Journal of Chinese Information Processing, 2020, 34(5): 56-63.
[8] WANG Z, YANG H, WU Z, et al. In silico prediction of blood-brain barrier permeability of compounds by machine learning and resampling methods[J]. ChemMedChem, 2018, 13(20): 2189-2201.
[9] DEB S, DANDAPAT S. Emotion classification using segmen-tation of vowel-like and non-vowel-like regions[J]. IEEE Transactions on Affective Computing, 2019, 10(3): 360-373.
[10] CARLOS B, MURTAZA B, LEE C, et al. IEMOCAP: interactive emotional dyadic motion capture database[J]. Language Resources and Evaluation, 2008, 42(4): 335-359.
[11] LALITHA S, MUDUPU A, NANDYALA B V, et al. Speech emotion recognition using DWT[C]//Proceedings of the 2015 IEEE International Conference on Computational Intelligence and Computing Research, Madurai, 2015: 1-4.
[12] WANG K, AN N, LI B N, et al. Speech emotion recognition using Fourier parameters[J]. IEEE Transactions on Affective Computing, 2015, 6(1): 69-75.
[13] VANARASE R. Building farsighted intrusion discovery employing ML algorithms[C]//Proceedings of the 2018 4th International Conference on Computing Communication Control and Automation, Pune, 2018: 1-4.
[14] YI H, JIANG Q, YAN X, et al. Imbalanced classification based on minority clustering synthetic minority oversampling technique with wind Turbine fault detection application[J]. IEEE Transactions on Industrial Informatics, 2021, 17(9): 5867-5875.
[15] GOSAIN A, SARDANA S. Handling class imbalance problem using oversampling techniques: a review[C]//Proceedings of the 2017 International Conference on Advances in Computing, Communications and Informatics, Udupi, 2017: 79-85.
[16] HE H, BAI Y, GARCIA E A, et al. ADASYN: adaptive synthetic sampling approach for imbalanced learning[C]//Proceedings of the 2008 IEEE International Joint Conference on Neural Networks, Hong Kong, China, 2008: 1322-1328.
[17] ZENG M, ZOU B, WEI F, et al. Effective prediction of three common diseases by combining SMOTE with Tomek links technique for imbalanced medical data[C]//Proceedings of the 2016 International Conference of Online Analysis and Computing Science, Chongqing, 2016: 225-228.
[18] NING Q, ZHAO X, MA Z. A novel method for identification of glutarylation sites combining Borderline-SMOTE with Tomek links technique in imbalanced data[J]. IEEE/ACM Transactions on Computational Biology and Bioinformatics, 2022, 19(5): 2632-2641.
[19] LOTFIAN R, BUSSO C. Oversampling emotional speech data based on subjective evaluations provided by multiple individuals[J]. IEEE Transactions on Affective Computing, 2021, 12(4): 870-882.
[20] ZENG C, ZHOU C Y, LV S K, et al. GCN2defect: graph convolutional network for SMOTETomek-based software defect prediction[C]//Proceedings of the 2021 IEEE 32nd International Symposium on Software Reliability Engineering, Wuhan, 2021: 69-79.
[21] SABOUR S, FROSST N, HINTON G E. Dynamic routing between capsules[C]//Proceedings of the 31st Conference on Neural Information Processing Systems, Long Beach, 2017: 3856-3866.
[22] FEI H, ZHANG Y, REN Y F, et al. Topic-enhanced capsule network for multi-label emotion classification[J]. IEEE Transa-ctions on Audio, Speech, and Language Processing, 2020, 28: 1839-1848.
[23] XIANG C Q, ZHANG L, TANG Y, et al. MS-CapsNet: a novel multi-scale capsule network[J]. IEEE Signal Processing Letters, 2018, 25(12): 1850-1854.
[24] TZIRAKIS P, ZHANG J H, SCHULLER W B. End-to-end speech emotion recognition using deep neural networks[C]//Proceedings of the 2018 International Conference on Acoustics, Speech, and Signal Processing, Calgary, 2018: 5089-5093.
[25] WEN X C, LIU K H, ZHANG W M, et al. The application of capsule neural network-based CNN for speech emotion recognition[C]//Proceedings of the 2021 International Confer-ence on Pattern Recognition, Milan, 2021: 9356-9362.
[26] CHEN K, DING H, HUO Q. Parallelizing Adam optimizer with block-wise model-update filtering [C]//Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing, Barcelona, 2020: 3027-3031.
[27] TSENG C C, LEE S L. Design of digital differentiator using supervised learning on Keras framework[C]//Proceedings of the IEEE 8th Global Conference on Consumer Electronics, Osaka, 2019: 162-163.
[28] WANG L, NAKAGAWA S, ZHANG Z, et al. Spoofing speech detection using modified relative phase information[J]. IEEE Journal of Selected Topics in Signal Processing, 2017, 11(4): 660-670.
[29] SATT A, ROZENBERG S, HOORY R. Efficient emotion recognition from speech using deep learning on spectrograms[C]//Proceedings of the 18th Annual Conference of the International Speech Communication Association, Stockholm, 2017: 1089-1093.
[30] GUO L, WANG L, DANG J, et al. A feature fusion method based on extreme learning machine for speech emotion recognition[C]//Proceedings of the 2018 International Conference on Acoustics, Speech and Signal Processing, Calgary, 2018: 2666-2670.
[31] ZHANG L, WANG L, DANG J, et al. Convolutional neural network with spectrogram and perceptual features for speech emotion recognition[C]//Proceedings of the 2018 International Conference on Neural Information Processing, Siem Reap, 2018: 62-71.
[32] GUO L, WANG L, DANG J. Learning affective representations based on magnitude and dynamic relative phase information for speech emotion recognition[J]. Speech Communication, 2022, 136: 118-127.
[33] CHERNYKH V, STERLING G, PRIKHODKO P. Emotion recognition from speech with recurrent neural networks[J]. arXiv:1701.08071, 2017.
[34] NEUMANN M, VU N T. Attentive convolutional neural network-based speech emotion recognition: a study on the impact of input features, signal length, and acted speech[C]//Proceedings of the 18th Annual Conference of the International Speech Communication Association, Stockholm, 2017: 1263-1267.