融合知识图谱和多模态的文本分类研究

doi:10.3778/j.issn.1002-8331.2202-0051

摘要/Abstract

摘要： 传统文本分类方法主要是基于单模态数据所驱动的经验主义统计学习方法，缺乏对数据的理解能力，鲁棒性较差，单个模态的模型输入也难以有效分析互联网中越来越丰富的多模态化数据。针对此问题提出两种提高分类能力的方法：引入多模态信息到模型输入，旨在弥补单模态信息的局限性；引入知识图谱实体信息到模型输入，旨在丰富文本的语义信息，提高模型的泛化能力。模型使用BERT提取文本特征，改进的ResNet提取图像特征，TransE提取文本实体特征，通过前期融合方式输入到BERT模型中进行分类，在研究多标签分类问题的MM-IMDB数据集上F1值达到66.5%，在情感分析数据集Twitter15&17上ACC值达到71.1%，结果均优于其他模型。实验结果表明，引入多模态信息和实体信息能够提高模型的文本分类能力。

关键词: 自然语言处理, 知识图谱, 多模态, 文本分类, BERT模型

Abstract: Traditional text classification methods are mainly empirical statistical learning methods driven by single-modal data, which lack the ability to understand the data, and have poor robustness. The single-modal input is also difficult to effectively analyze the increasingly rich multi-modal data in the Internet. To solve this problem, two methods to improve the classification ability are proposed：introducing multi-modal information into the model input in order to make up for the limitation of single-modal information;?introducing knowledge graph entity information into the model input aims to enrich the semantic information of the text and improve model’s generalization ability.?The model uses BERT to extract text features, improved ResNet to extract image features, and TransE to extract text entity features, which are input into the BERT model for classification through early fusion. On the MM-IMDB data set which studies the multi label classification problem, the F1 score reaches 66.5%, on the Twitter15&17 data set which studies sentiment analysis problem, the ACC score reaches 71.1%, and the results are better than other models.?Experimental results show that introducing multimodal information and entity information can improve the text classification ability of the model.

Key words: natural language processing（NLP）, knowledge graph, multimodal, text classification, bidirectional encoder representation from transformers（BERT）

景丽, 姚克. 融合知识图谱和多模态的文本分类研究[J]. 计算机工程与应用, 2023, 59(2): 102-109.

JING Li, YAO Ke. Research on Text Classification Based on Knowledge Graph and Multimodal[J]. Computer Engineering and Applications, 2023, 59(2): 102-109.

参考文献

[1] 贺鸣，孙建军，成颖.基于朴素贝叶斯的文本分类研究综述[J].情报科学，2016，34（7）：147-154.
HE Ming，SUN Jianjun，CHENG Ying.Text classification based on naive bayes：a review[J].Information Science，2016，34（7）：147-154.
[2] 崔建明，刘建明，廖周宇.基于SVM算法的文本分类技术研究[J].计算机仿真，2013，30（2）：299-302.
CUI Jianming，LIU Jianming，LIAO Zhouyu.Research of text categorization based on support vector machine[J].Computer Simulation，2013，30（2）：299-302.
[3] 张宁，贾自艳，史忠植.使用KNN算法的文本分类[J].计算机工程，2005，31（8）：171-172.
ZHANG Ning，JIA Ziyan，SHI Zhongzhi.Text categorization with KNN algorithm[J].Computer Engineering，2005，31（8）：171-172.
[4] HINTON G E，SALAKHUTDINOV R R.Reducing the dimensionality of data with neural networks[J].Science，2006，313（5786）：504-507.
[5] LECUN Y，BOTTOU L.Gradient-based learning applied to document recognition[J].Proceedings of the IEEE，1998，86（11）：2278-2324.
[6] LIU P，QIU X，HUANG X.Recurrent neural network for text classification with multi-task learning[J].arXiv：1605.
05101，2016.
[7] MIKOLOV T，CHEN K，CORRADO G，et al.Efficient estimation of word representations in vector space[J].arXiv：1301.3781，2013.
[8] PETERS M，NEUMANN M，IYYER M，et al.Deep contextualized word representations[C]//Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics：Human Language Technologies（HLT-NAACL），Volume 1（Long Papers），2018：2227-2237.
[9] RADFORD A，NARASIMHAN K，SALIMANS T.Improving language understanding by generative pre-training[EB/OL].[2021-11-10].https：//s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_
understanding_paper.pdf.
[10] DEVLIN J，CHANG M W，LEE K，et al.BERT：pre-training of deep bidirectional transformers for language understanding[J].arXiv：1810.04805，2018.
[11] HOCHREITER S，SCHMIDHUBER J.Long short-term memory[J].Neural Computation，1997，9（8）：1735-1780.
[12] KIM Y.Convolutional neural networks for sentence classification[J].arXiv：1408.5882，2014.
[13] KALCHBRENNER N，GREFENSTETTE E，BLUNSOM P.A convolutional neural network for modelling sentences[J].arXiv：1404.2188，2014.
[14] BAHDANAU D，CHO K，BENGIO Y.Neural machine translation by jointly learning to align and translate[J].arXiv：1409.0473，2014.
[15] VASWANI A，SHAZEER N，PARMAR N，et al.Attention is all you need[C]//Advances in Neural Information Processing Systems（NIPS），2017：5998-6008.
[16] VRANDECIC D，KRTOETZSCH M.Wikidata：a free collaborative knowledgebase[J].Communications of the ACM，2014，57（10）：78-85.
[17] SUCHANEK F M，KASNECI G，WEIKUM G.YAGO：a core of semantic knowledge unifying WordNet and Wikipedia[C]//International Conference on World Wide Web（ICWWW），2007：697-706.
[18] AUER S，BIZER C，KOBILAROV G，et al.DBpedia：a nucleus for a web of open data[C]//Proceedings of International Semantic Web Conference（ISWC），2007：722-735.
[19] MILLER G A.WordNet：a lexical database for English[J].Communications of the ACM，1995，38（11）：39-41.
[20] WANG J，WANG Z，ZHANG D，et al.Combining knowledge with deep convolutional neural networks for short text classification[C]//Twenty-Sixth International Joint Conference on Artificial Intelligence（AAAI），2017：2915-2921.
[21] CHEN J，HU Y，LIU J，et al.Deep short text classification with knowledge powered attention[C]//Proceedings of the AAAI Conference on Artificial Intelligence，2019：6252-6259.
[22] ZHANG Z，HAN X，LIU Z，et al.ERNIE：enhanced language representation with informative entities[J].arXiv：1905.07129，2019.
[23] LIU W，ZHOU P，ZHAO Z，et al.K-bert：enabling language representation with knowledge graph[J].arXiv：1909.07606，2019.
[24] ANASTASOPOULOS A，KUMAR S，LIAO H.Neural language modeling with visual features[J].arXiv：1903.02930，2019.
[25] ZADEH A，CHEN M，PORIA S，et al.Tensor fusion network for multimodal sentiment analysis[C]//Empirical Methods in Natural Language Processing，2017：1103-1114.
[26] NAM H，HA J W，KIM J.Dual attention networks for multimodal reasoning and matching[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition，2017：299-307.
[27] LU J，BATRA D，PARIKH D，et al.ViLBERT：pretraining task-agnostic visiolinguistic representations for vision-and-language tasks[J].arXiv：1908.02265，2019.
[28] LI L H，YATSKAR M，YIN D，et al.VisualBERT：a simple and performant baseline for vision and language[J].arXiv：1908.03557，2019.
[29] ALBERTI C，LING J，COLLINS M，et al.Fusion of detected objects in text for visual question answering[J].arXiv：1908.05054，2019.
[30] KIELA D，BHOOSHAN S，FIROOZ H，et al.Supervised multimodal bitransformers for classifying images and text[J].arXiv：1909.02950，2019.
[31] WU L，PETRONI F，JOSIFOSKI M，et al.Scalable zero-shot entity linking with dense entity retrieval[C]//Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing（EMNLP），2020：6397-6407.
[32] BORDES A，USUNIER N，GARCIA-DURAN A，et al.Translating embeddings for modeling multi-relational data[C]//Proceedings of the 26th International Conference on Neural Information Processing Systems（NIPS）-Volume 2，2013：2787-2795.
[33] HAN X，CAO S，LV X，et al.OpenKE：an open toolkit for knowledge embedding[C]//Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing：System Demonstrations（EMNLP），2018：139-144.
[34] HE K，ZHANG X，REN S，et al.Deep residual learning for image recognition[C]//2016 IEEE Conference on Computer Vision and Pattern Recognition，2016：770-778.
[35] DOSOVITSKIY A，BEYER L，KOLESNIKOV A，et al.An image is worth 16x16 words：transformers for image recognition at scale[J].arXiv：2010.11929，2020.

[36] AREVALO J，SOLORIO T，MONTES-Y-GóMEZ M，et al.Gated multimodal units for information fusion[J].arXiv：1702.01992，2017.

[37] YU J，JIANG J，XIA R.Entity-sensitive attention and fusion network for entity-level multimodal sentiment classification[J].IEEE/ACM Transactions on Audio，Speech，and Language Processing，2019，28：429-439.