[1] CHEEMA G S, HAKIMOV S, MüLLER-BUDACK E, et al. A fair and comprehensive comparison of multimodal tweet sentiment analysis methods[C]//Proceedings of the Workshop on Multi-Modal Pre-Training for Multimedia Understanding.New York: ACM, 2021: 37-45.
[2] JIANG L, YU M, ZHOU M, et al. Target-dependent twitter sentiment classification[C]//Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, 2011: 151-160.
[3] XU N, MAO W J, CHEN G D. Multi-interactive memory network for aspect based multimodal sentiment analysis[C]// Proceedings of the AAAI Conference on Artificial Intelligence, 2019: 371-378.
[4] YU J F, JIANG J, XIA R. Entity-sensitive attention and fusion network for entity-level multimodal sentiment classification[J]. ACM Transactions on Audio, Speech, and Language Processing, 2020, 28: 429-439.
[5] YU J F, JIANG J. Adapting BERT for target-oriented multimodal sentiment classification[C]//Proceedings of the 28th International Joint Conference on Artificial Intelligence, 2019: 5408-5414.
[6] VASWANI A, SHAZEER N, PARMAR N, et al.Attention is all you need[C]//Advances in Neural Information Processing Systems, 2017: 6000-6010.
[7] KHAN Z, FU Y. Exploiting BERT for multimodal target sentiment classification through input space translation[C]//Proceedings of the 29th ACM International Conference on Multimedia. New York: ACM, 2021: 3034-3042.
[8] LI J N, LI D X, SAVARESE S, et al. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models[C]//Proceedings of the 40th International Conference on Machine Learning, 2023: 19730-19742.
[9] RADFORD A, KIM JW, HALLACY C, et al. Learning transferable visual models from natural language supervision[C]//Proceedings of the International Conference on Machine Lea-rning, 2021: 8748-8763.
[10] JIA C, YANG Y, XIA Y, et al. Scaling up visual and vision-language representation learning with noisy text supervision[C]//Proceedings of the International Conference on Machine Learning, 2021: 4904-4916.
[11] LI J, SELVARAJU R, GOTMARE A, et al. Align before fuse: vision and language representation learning with momentum distillation[C]//Advances in Neural Information Processing Systems, 2021: 9694-9705.
[12] WANG P, YANG A, MEN R, et al. OFA: unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework[J]. arXiv:2202.03052, 2022.
[13] ALAYRAC J B, DONAHUE J, LUC P, et al.Flamingo: a visual language model for few-shot learning[C]//Advances in Neural Information Processing Systems, 2022: 23716-23736.
[14] WANG W H, BAO H B, DONG L, et al. Image as a foreign language: Beit pretraining for all vision and vision-language tasks[J]. arXiv:2301.05781, 2023.
[15] LI J, LI D, XIONG C, et al. Blip: bootstrapping language-image pre-training for unified vision-language understan-ding and generation[C]//Proceedings of the International Conference on Machine Learning, 2022: 12888-12900.
[16] YANG R, WANG S, SUN Y Z, et al. Multimodal fusion remote sensing image audio retrieval[J]. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2022, 15: 6220-6235.
[17] LIU X R, WANG Z J, WANG L. Multimodal fusion for image and text classification with feature selection and dimension reduction[J]. Journal of Physics: Conference Series, 2021, 1871(1): 012064.
[18] HUANG F R, ZHANG X M, ZHAO Z H, et al. Image text sentiment analysis via deep multimodal attentive fusion[J]. Knowledge-Based Systems, 2019, 167: 26-37.
[19] ZHU Q, YEH M C, CHENG K T. Multimodal fusion using learned text concepts for image categorization[C]//Proceedings of the 14th ACM International Conference on Multimedia. New York: ACM, 2006: 211-220.
[20] WILKINSON T, BRUN A. Semantic and verbatim word spotting using deep neural networks[C]//Proceedings of the 15th International Conference on Frontiers in Handwriting Recognition. Piscataway: IEEE, 2016: 307-312.
[21] HE K M, ZHANG X Y, REN S Q, et al. Deep residual lear-ning for image recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2016: 770-778.
[22] PASZKE A, GROSS S, CHINTALA S, et al. Automatic differentiation in pytorch[C]//Proceedings of the 31st Conference on Neural Information Processing Systems, 2017.
[23] LOSHCHILOV I, HUTTER F. Fixing weight decay regularization in Adam[J]. arXiv:1711.05101, 2017.
[24] WANG Y Q, HUANG M L, ZHU X Y, et al. Attention-based LSTM for aspect-level sentiment classification[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing. Stroudsburg: ACL, 2016: 606-615.
[25] CHEN P, SUN Z Q, BING L D, et al. Recurrent attention network on memory for aspect sentiment analysis[C]//Proceedings of the Conference on Empirical Methods in Natural Language Processing. Stroudsburg: ACL, 2017: 452-461.
[26] DEVLIN J, CHANG M-W, LEE K, et al. BERT: pre-training of deep bidirectional Transformers for language understa-nding[J]. arXiv:1810.04805, 2018.
[27] LIU Y, OTT M, GOYAL N, et al. RoBERTa: a robustly optimized BERT pretraining approach[J]. arXiv:1907.11692, 2019.
[28] AN J Y, ZAINON N W, MOHDW, et al. Improving targeted multimodal sentiment classification with semantic description of images[J]. Computers, Materials & Continua, 2023, 75(3): 5801-5815.
[29] YU J F, CHEN K, XIA R. Hierarchical interactive multimodal Transformer for aspect-based multimodal sentiment analysis[J]. IEEE Transactions on Affective Computing, 2023, 14(3): 1966-1978. |