计算机工程与应用 ›› 2024, Vol. 60 ›› Issue (4): 122-132.DOI: 10.3778/j.issn.1002-8331.2209-0099

• 模式识别与人工智能 • 上一篇    下一篇

面向不平衡数据集的语音情感识别研究

张会云,黄鹤鸣   

  1. 1. 青海师范大学  计算机学院,西宁  810008
    2. 藏语智能语音信息处理及应用国家重点实验室,西宁  810008
  • 出版日期:2024-02-15 发布日期:2024-02-15

Speech Emotion Recognition for Imbalanced Datasets

ZHANG Huiyun, HUANG Heming   

  1. 1. School of Computer Science, Qinghai Normal University, Xining 810008, China
    2. State Key Laboratory of Tibetan Intelligent Information Processing and Application, Xining 810008, China
  • Online:2024-02-15 Published:2024-02-15

摘要: 样本平衡对机器学习至关重要,在不平衡数据集中,虽然某些类别的样本数量可能很少,但其重要性可能更高。研究了基于不平衡数据集的语音情感识别技术。在不同信噪比下采用不同噪声对不平衡基线数据集EMODB和IEMOCAP进行扩充,构建含噪数据集EMODBM和IEMOCAPM;采用SMOTE、RandomOverSampler、SMOTEENN、ADASYN、TomekLinks以及SMOTETomek等6种技术对基线数据集和含噪数据集进行重采样,实现类别样本平衡;在基线数据集和扩充数据集上分别提取21维的低级描述符特征;采用新提出的模型MA-CapsNet验证重采样技术的有效性。实验表明,重采样后各类情感样本基本平衡,使模型的学习更公平、更客观,并且模型在重采样数据集上的鲁棒性更好。

关键词: 语音情感识别, 重采样, 胶囊网络, 数据扩充

Abstract: The sample balance is crucial for machine learning. The importance of certain classes may be higher than its number on the imbalanced datasets. This paper studies the imbalanced datasets for speech emotion recognition. Firstly, the imbalanced baseline datasets EMODB and IEMOCAP are augmented with different signal-to-noise?ratios, and the datasets EMODBM and IEMOCAPM are constructed. Secondly, six techniques namely SMOTE, RandomOverSampler, SMOTEENN, ADASYN, TomekLinks and SMOTETomek are adopted to resample the baseline datasets, and the augmented datasets are constructed to achieve the category balance. Thirdly, 21-dimensional low-level descriptor features are extracted from the baseline datasets and the augmented datasets. Finally, a novel model MA-CapsNet is proposed to validate the effectiveness of the resampling techniques. The results show that all types of emotion samples are basically balanced after resampling, which makes the learning of the model MA-CapsNet fairer. In addition, the model MA-CapsNet has better robustness on the resampling datasets.

Key words: speech emotion recognition, resampling, capsule network, data augmentation