[1] KUMAR A, VEPA J. Gated mechanism for attention based multi modal sentiment analysis[C]//Proceedings of the International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Barcelona, Spain, 2020: 4477-4481.
[2] 张亚洲, 戎璐, 宋大为, 等. 多模态情感分析研究综述[J]. 模式识别与人工智能, 2020, 33(5): 426-438.
ZHANG Y Z, RONG L, SONG D W, et al. A review of multimodal sentiment analysis[J]. Pattern Recognition and Artificial Intelligence, 2020, 33(5): 426-438.
[3] MENG Y, HUANG J X, ZHANG Y, et al. Generating training data with language models: towards zero-shot language understanding[J/OL]. (2022-10-12)[2023-01-10]. https://arxiv.org/abs/2202.04538v2.
[4] ZHANG F, LI X C, LIM C P, et al. Deep emotional arousal network for multimodal sentiment analysis and emotion recognition[J]. Information Fusion, 2022, 5(7): 88-91.
[5] YANG L, NA J C, YU J F. Cross-modal multitask Transformer for end-to-end multimodal aspect-based sentiment analysis[J]. Information Processing and Management, 2022, 4(8): 59-64.
[6] YANG M P, LI Y Y, ZHANG H. GME-Dialogue-NET: gated multi-modal sentiment analysis model based on fusion mechanism[J]. Academic Journal of Computing & Information Science, 2021, 5(3): 4-12.
[7] XIAO G R, TU G, ZHENG L, et al. Multimodality sentiment analysis in social Internet of things based on hierarchical attentions and CSAT-TCN with MBM network[J]. IEEE Internet of Things Journal, 2021, 6(5): 8-24.
[8] HUDDAR M G, SANNAKKI S, RAJPUROHIT V. Attention-based multi-modal sentiment analysis and emotion detection in conversation using RNN[J]. International Journal of Interactive Multimedia and Artificial Intelligence, 2021, 7(8): 6-12.
[9] HUDDAR M G, SANNAKKI S, RAJPUROHIT V. Multi-level context extraction and attention-based contextual inter-modal fusion for multimodal sentiment analysis and emotion classification[J]. International Journal of Multimedia Information Retrieval, 2019, 2(3): 9-11.
[10] WANG Y K, CHEN X H, CAO L L, et al. Multimodal token fusion for vision Transformers[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 2022: 12176-12185.
[11] 杨杨, 詹德川, 姜远, 等. 可靠多模态学习综述[J]. 软件学报, 2021, 32(4): 1067-1081.
YANG Y, ZHAN D C, JIANG Y, et al. A survey of reliable multimodal learning[J]. Journal of Software, 2021, 32(4): 1067-1081.
[12] ARJMAND M, DOUSTI M, MORADI H. TEASEL: a Transformer-based speech-prefixed language model[J/OL]. (2021-09-12)[2022-11-13]. https://arxiv.org/abs/2109.05522v1.
[13] TAN H H, BANSAL M. Lxmert: learning cross-modality encoder representations from Transformers[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, 2019: 5100-5111.
[14] YU W M, XU H, YUAN Z Q, et al. Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2021: 10790-10797.
[15] RAHMAN W, HASAN M, LEE S W, et al. Integrating multimodal information in large pretrained Transformers[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020: 2359-2369.
[16] GUO X D, WANG Y D, MIAO Z J, et al. ER-MRL: emotion recognition based on multimodal representation learning[C]//Proceedings of the 2022 12th International Conference on Information Science and Technology (ICIST), Kaifeng, China, 2022: 421-428.
[17] BERREBBI D, SHI J T, YAN B, et al. Combining spectral and self-supervised features for low-resource speech recognition and translation[J/OL]. (2022-04-18) [2022-11-15]. https://arxiv.org/abs/2204.02470v2.
[18] KIKUTSUJI T, MORI Y, OKAZAKI K, et al. Explaining reaction coordinates of alanine dipeptide isomerization obtained from deep neural networks using explainable artificial intelligence[J/OL]. (2022-04-01)[2022-11-18]. https://arxiv.org/abs/2202.07276v3.
[19] BASEVSKI A, ZHOU H, MOHAMED A R, et al. Wav2vec 2.0: a framework for self-supervised learning of speech representations[J/OL]. (2020-10-22)[2022-12-13]. https://arxiv.org/abs/2006.11477.
[20] AKHTAR M S, CHAUHAN D S, EKBAL A. A deep multi-task contextual attention framework for multi-modal affect analysis[J]. ACM Transactions on Knowledge Discovery from Data, 2020, 5(9): 14-17.
[21] LUPPINO L T, HANSEN M A, KAMPFFMEYER M, et al. Code-aligned autoencoders for unsupervised change detection in multimodal remote sensing images[J]. IEEE Transactions on Neural Networks and Learning Systems, 2024, 35(1): 60-72.
[22] HUANG J, LIN Z H, YANG Z G, et al. Temporal graph convolutional network for nultimodal sentiment analysis[C]// Proceedings of the 2021 International Conference on Multimodal Interaction. Association for Computing Machinery, New York, NY, USA, 2021: 239-247.
[23] HAZARIKA D, ZIMMERMANN R, PORIA S. MISA: modality-invariant and specific representations for multimodal sentiment analysis[C]//Proceedings of the 28th ACM International Conference on Multimedia, New York, NY, USA, 2020: 1122-1131.
[24] CHAUHAN D S, EKBAL A, BHATTACHARYYA P. An efficient fusion mechanism for multimodal low-resource setting[C]//Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, New York, NY, USA, 2022: 2583-2588.
[25] FU Z W, LIU F, XU Q, et al. NHFNET: a non-homogeneous fusion network for multimodal sentiment analysis[C]//Proceedings of the 2022 IEEE International Conference on Multimedia and Expo (ICME), Taipei, China, 2022: 1-6.
[26] AL-AZANI S, EI S M, EI A. Enhanced video analytics for sentiment analysis based on fusing textual, auditory and visual information[J]. IEEE Access, 2020: 136843-136857.
[27] 杜鹏飞, 李小勇, 高雅丽. 多模态视觉语言表征学习研究综述[J]. 软件学报, 2021, 32(2): 327-348.
DU P F, LI X Y, GAO Y L. A survey of multimodal visual language representation learning[J]. Journal of Software, 2021, 32(2): 327-348.
[28] DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, Minnesota, 2019: 4171-4186.
[29] 朱张莉, 饶元, 吴渊, 等. 注意力机制在深度学习中的研究进展[J]. 中文信息学报, 2019, 33(6): 1-11.
ZHU Z L, RAO Y, WU Y, et al. Research progress of attention mechanism in deep learning[J]. Journal of Chinese Information Processing, 2019, 33(6): 1-11.
[30] XU H F, GENABITH J V, XIONG D Y, et al. Learning source phrase representations for neural machine translation[C]// Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020: 386-396.
[31] VASWANI A, SHAZEER N M, PARMAR N, et al. Attention is all you need[J/OL]. (2017-12-06)[2022-12-08]. https://arxiv.org/abs/1706.03762v5.
[32] ZADEH A, ZELLERS R, PINCUS E, et al. MOSI: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos[J/OL]. (2016-08-12)[2022-12-06]. https://arxiv.org/abs/1606.06259.
[33] HAN W, CHEN H, PORIA S. Improving multimodal fusion with hierarchical mutual information maximization for multimodal sentiment analysis[C]//Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, Punta Cana, 2021: 9180-9192.
[34] HAN W, CHEN H, GELBUKH A, et al. Bi-bimodal modality fusion for correlation-controlled multimodal sentiment analysis[C]//Proceedings of the 2021 International Conference on Multimodal Interaction, New York, NY, USA, 2021: 6-15.
[35] WANG Z L, WAN Z H, WAN X J. TransModality: an End2End fusion method with Transformer for multimodal sentiment analysis[C]//Proceedings of The Web Conference 2020, New York, NY, USA, 2020: 2514-2520.
[36] SUN H, WANG H Y, LIU J Q, et al. CubeMLP: an MLP-based model for multimodal sentiment analysis and depression estimation[C]//Proceedings of the 30th ACM International Conference on Multimedia, New York, NY, USA, 2022: 3722-3729.
[37] YANG K C, XU H, GAO K. CM-BERT: cross-modal BERT for text-audio sentiment analysis[C]//Proceedings of the 28th ACM International Conference on Multimedia, New York, NY, USA, 2020: 521-528. |