[1] GEBRU I D, BA S, LI X, et al. Audio-visual speaker diarization based on spatiotemporal Bayesian fusion[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 40(5): 1086-1099.
[2] ZADEH A, CHEN M H, PORIA S, et al. Tensor fusion network for multimodal sentiment analysis[J]. arXiv:1707.07250, 2017.
[3] MA D H, LI S J, ZHANG X D, et al. Interactive attention networks for aspect-level sentiment classification[C]//Proceedings of the 26th International Joint Conference on Artificial Intelligence, 2017: 4068-4074.
[4] LIN H, MA Z H, JI R R, et al. Boosting crowd counting via multifaceted attention[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2022: 19596-19605.
[5] ZADEH A, LIANG P P, PORIA S, et al. Multi-attention recurrent network for human communication comprehension[C]//Proceedings of the 32nd AAAI Conference on Artificial Intelligence, 2018: 5642-5649.
[6] GU Y, YANG K, FU S, et al. Multimodal affective analysis using hierarchical attention strategy with word-level alignment[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, 2018: 2225-2235.
[7]VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceedings of the 31st International Conference on Neural Information Processing Systems, 2017: 6000-6010.
[8] MORENCY L P, MIHALCEA R, DOSHI P. Towards multimodal sentiment analysis: harvesting opinions from the web[C]//Proceedings of the 13th International Conference on Multimodal Interfaces. New York: ACM, 2011: 169-176.
[9] WILLIAMS J, KLEINEGESSE S, COMANESCU R, et al. Recognizing emotions in video using multimodal DNN feature fusion[C]//Proceedings of the Grand Challenge and Workshop on Human Multimodal Language. Stroudsburg: ACL, 2018: 11-19.
[10] LAN Z Z, BAO L, YU S I, et al. Multimedia classification and event detection using double fusion[J]. Multimedia Tools and Applications, 2014, 71(1): 333-347.
[11] KHALIGH-RAZAVI S M, KRIEGESKORTE N. Deep supervised, but not unsupervised, models may explain IT cortical representation[J]. PLoS Computational Biology, 2014, 10(11): e1003915.
[12] Gü?Lü U, MARCEL A J, GERVEN V. Deep neural networks reveal a gradient in the complexity of neural representations across the ventral stream[J]. Journal of Neuroscience, 2015, 35(27): 10005-10014.
[13] DWIVEDI K, BONNER M F, CICHY R M, et al. Unveiling functions of the visual cortex using task-specific deep neural networks[J]. PLoS Computational Biology, 2021, 17(8): e1009267.
[14] BERSCH D, DWIVEDI K, VILAS M, et al. Net2Brain: a toolbox to compare artificial vision models with human brain responses[J]. arXiv:2208.09677, 2022.
[15] SHAFER R L, SOLOMON E M, NEWELL K M, et al. Visual feedback during motor performance is associated with increased complexity and adaptability of motor and neural output[J]. Behavioural Brain Research, 2019, 376: 112214.
[16] DICARLO J J, ZOCCOLAN D, RUST N C. How does the brain solve visual object recognition?[J]. NEURON, 2012, 73(3): 415-434.
[17] TEUFEL C, NANAY B. How to (and how not to) think about top-down influences on visual perception[J]. Consciousness and Cognition, 2017, 47: 17-25.
[18] PARASKEVOPOULOS G, GEORGIOU E, POTAMIANOS A. MMLatch: bottom-up top-down fusion for multimodal sentiment analysis[C]//Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. Piscataway: IEEE, 2022: 4573-4577.
[19] OTMAKHOVA Y, SHIN H. Do we really need lexical information?towards a top-down approach to sentiment analysis of product reviews[C]//Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg: ACL, 2015: 1559-1568.
[20] QIU Y, LIU Y, YANG H, et al. A simple saliency detection approach via automatic top-down feature fusion[J]. Neurocomputing, 2020, 388: 124-134.
[21] DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[J]. arXiv:1810.04805, 2018.
[22] HOCHREITER S, SCHMIDHUBER J. Long short-term memory[J]. Neural Computation, 1997, 9(8): 1735-1780.
[23] ZADEH A, ZELLERS R, PINCUS E, et al. Multimodal sentiment intensity analysis in videos: facial gestures and verbal messages[J]. IEEE Intelligent Systems, 2016, 31(6): 82-88.
[24] ZADEH B A, LIANG P P, PORIA S, et al. Multimodal language analysis in the wild: cmu-MOSEI dataset and interpretable dynamic fusion graph[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2018: 2236-2246.
[25] ZADEH A, LIANG P P, MAZUMDER N, et al. Memory fusion network for multi-view sequential learning[C]//Proceedings of the 32nd AAAI Conference on Artificial Intelligence, 2018: 5634-5641.
[26] SAHAY S, OKUR E, KUMAR S H, et al. Low rank fusion based transformers for multimodal sequences[J]. arXiv:2007.02038, 2020.
[27] HAZARIKA D, ZIMMERMANN R, PORIA S. MISA: modality-invariant and-specific representations for multimodal sentiment analysis[C]//Proceedings of the 28th ACM International Conference on Multimedia. New York: ACM, 2020: 1122-1131.
[28] YU W M, XU H, YUAN Z Q, et al. Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2021: 10790-10797.
[29] SUN H, WANG H Y, LIU J Q, et al. CubeMLP: an MLP-based model for multimodal sentiment analysis and depression estimation[C]//Proceedings of the 30th ACM International Conference on Multimedia. New York: ACM, 2022: 3722-3729. |