开放信息抽取研究综述

doi:10.3778/j.issn.1002-8331.2212-0251

摘要/Abstract

摘要： 开放信息抽取（open information extraction，OpenIE）旨在从自然语言文本中以关系短语及参数的形式生成信息的结构化表示，为知识库自动化构建、开放域问答和显式推理等下游任务提供基础支持。近年来，该领域的研究与应用不断深入，涌现了众多卓有成效的OpenIE研究思路和拓展模型。从OpenIE的定义、数据集和基准度量出发，详细深入地综述和比较了传统的OpenIE模型和基于神经网络的模型。针对传统方法，分类介绍了基于学习的模型和基于规则的模型，并深入研究了不同模型的评估方法，分析了不同类别模型之间的差异。针对基于神经网络的模型，根据抽取谓词的不同方式，将其分为联合抽取和分步抽取两种类型，并对每种模型进行了综述和对比分析。对OpenIE常用的数据集以及主要的评估基准进行了概述，并在此基础上进行了对比分析。从训练、改进以及应用三个角度对OpenIE的工作进行了总结，并对该工作的未来进行了展望。

关键词: 自然语言处理, 开放信息抽取（OpenIE）, 神经网络

Abstract: Open information extraction（OpenIE） aims to generate a structured representation of information from natural language text in the form of relational phrases and parameters, providing basic support for downstream tasks such as knowledge base automatic construction, open domain question answering, and explicit reasoning. In recent years, with the deepening of research in this field, researchers have expanded OpenIE from multiple directions and proposed many OpenIE models based on neural networks. Starting from the definition, dataset and benchmark measurement of OpenIE, this paper summarizes and compares the traditional OpenIE model and the model based on neural network in detail. First of all, according to the traditional methods, the learning-based model and rule-based model are introduced, the evaluation methods of different models are deeply studied, and the differences between different types of models are analyzed. Secondly, according to the different ways of extracting predicates, the models based on neural networks are divided into two types：joint extraction and step extraction, and each model is reviewed and compared. Then, the datasets commonly used by OpenIE and the main evaluation benchmarks are summarized, and a comparative analysis is made on this basis. Finally, the work of OpenIE is summarized from three aspects of training, improvement and application, and the future of this work is prospected.

Key words: natural language processing, open information extraction（OpenIE）, neural network

胡杭乐, 程春雷, 叶青, 彭琳, 沈友志. 开放信息抽取研究综述[J]. 计算机工程与应用, 2023, 59(16): 31-49.

HU Hangle, CHENG Chunlei, YE Qing, PENG Lin, SHEN Youzhi. Survey of Open Information Extraction Research[J]. Computer Engineering and Applications, 2023, 59(16): 31-49.

参考文献

[1] JURAFSKY D，MARTIN J H.Na?ve Bayes classifier approach to word sense disambiguation[J].Computational Lexical Semantics，2009.
[2] YATES A，BANKO M，BROADHEAD M，et al.TextRunner：open information extraction on the web[C]//Proceedings of Human Language Technologies：the Annual Conference of the North American Chapter of the Association for Computational Linguistics，2007：25-26.
[3] NIKLAUS C，CETTO M，FREITAS A，et al.A survey on open information extraction[J].arXiv：1806.05599，2018.
[4] STANOVSKY G，DAGAN I.Creating a large benchmark for open information extraction[C]//Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing，2016：2300-2305.
[5] BHARDWAJ S，AGGARWAL S，MAUSAM M.CaRB：a crowdsourced benchmark for open IE[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing，2019：6262-6267.
[6] GASHTEOVSKI K，YU M，KOTNIS B，et al.BenchIE：open information extraction evaluation based on facts，not tokens[J].arXiv：2109.06850，2021.
[7] LI J，SUN A，HAN J，et al.A survey on deep learning for named entity recognition[J].IEEE Transactions on Knowledge and Data Engineering，2020，34（1）：50-70.
[8] YANG S，WANG Y，CHU X.A survey of deep learning techniques for neural machine translation[J].arXiv：2002.
07526，2020.
[9] VASILKOVSKY M，ALEKSEEV A，MALYKH V，et al.DETIE：multilingual open information extraction inspired by object detection[C]//Proceedings of the 36th AAAI Conference on Artificial Intelligence，2022.
[10] CABRAL B S，SOUZA M，CLARO D B.Explainable OpenIE classifier with morpho-syntactic rules[C]//Proceedings of the 2020 Workshop on Hybrid Intelligence for Natural Language Processing Tasks Co-located with 24th European Conference on Artificial Intelligence，2020：7-15.
[11] KOTNIS B，GASHTEOVSKI K，RUBIO D，et al.MILIE：modular & iterative multilingual open information extraction[C]//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics（Volume 1：Long Papers），2022：6939-6950.
[12] MAUSAM M.Open information extraction systems and downstream applications[C]//Proceedings of the 25th International Joint Conference on Artificial Intelligence，2016：4074-4077.
[13] WU F，WELD D S.Open information extraction using Wikipedia[C]//Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics，2010：118-127.
[14] SCHMITZ M，SODERLAND S，BART R，et al.Open language learning for information extraction[C]//Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning，2012：523-534.
[15] SAHA S，PAL H.Bootstrapping for numerical open IE[C]//Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics（Volume 2：Short Papers），2017：317-323.
[16] CHITICARIU L，LI Y，REISS F.Rule-based information extraction is dead! Long live rule-based information extraction systems![C]//Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing，2013：827-832.
[17] FADER A，SODERLAND S，ETZIONI O.Identifying relations for open information extraction[C]//Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing，2011：1535-1545.
[18] AKBIK A，L?SER A.KRAKEN：[N]-ary facts in open information extraction[C]//Proceedings of the Joint Workshop on Automatic Knowledge Base Construction and Web-scale Knowledge Extraction，2012：52-56.
[19] MESQUITA F，SCHMIDEK J，BARBOSA D.Effectiveness and efficiency of open relation extraction[C]//Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing，2013：447-457.
[20] STANOVSKY G，FICLER J，DAGAN I，et al.Getting more out of syntax with PROPS[J].arXiv：1603.01648，2016.
[21] FALKE T，STANOVSKY G，GUREVYCH I，et al.Porting an open information extraction system from English to German[C]//Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing，2016：892-898.
[22] KUEBLER J，TONG L，JIANG M.Multi-round parsing-based multiword rules for scientific OpenIE[J].arXiv：2108.
02074，2021.
[23] DEL CORRO L，GEMULLA R.CLAUSIE：clause-based open information extraction[C]//Proceedings of the 22nd International Conference on World Wide Web，2013：355-366.
[24] SCHMIDEK J，BARBOSA D.Improving open relation extraction via sentence re-structuring[C]//Proceedings of the 9th International Conference on Language Resources and Evaluation，2014：3720-3723.
[25] ANGELI G，PREMKUMAR M J J，MANNING C D.Leveraging linguistic structure for open domain information extraction[C]//Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing（Volume 1：Long Papers），2015：344-354.
[26] CHRISTENSEN J，SODERLAND S，ETZIONI O.An analysis of open information extraction based on semantic role labeling[C]//Proceedings of the 6th International Conference on Knowledge Capture，2011：113-120.
[27] PAL H.Demonyms and compound relational nouns in nominal open IE[C]//Proceedings of the 5th Workshop on Automated Knowledge Base Construction，2016：35-39.
[28] SAHA S.Open information extraction from conjunctive sentences[C]//Proceedings of the 27th International Conference on Computational Linguistics，2018：2288-2299.
[29] BAST H，HAUSSMANN E.Open information extraction via contextual sentence decomposition[C]//2013 IEEE 7th International Conference on Semantic Computing，2013：154-159.
[30] BHUTANI N，JAGADISH H V，RADEV D.Nested propositions in open information extraction[C]//Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing，2016：55-64.
[31] GASHTEOVSKI K，GEMULLA R，CORRO L.MINIE：minimizing facts in open information extraction[C]//Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing，2017：2630-2640.
[32] CETTO M，NIKLAUS C，FREITAS A，et al.Graphene：semantically-linked propositions in open information extraction[J].arXiv：1807.11276，2018.
[33] MANN W C，THOMPSON S A.Rhetorical structure theory：toward a functional theory of text organization[J].Text-Interdisciplinary Journal for the Study of Discourse，1988，8（3）：243-281.
[34] DE MARNEFFE M C，MANNING C D.The Stanford typed dependencies representation[C]//Proceedings of the Workshop on Cross-Framework and Cross-Domain Parser Evaluation，2008：1-8.
[35] MERHAV Y，MESQUITA F，BARBOSA D，et al.Extracting information networks from the blogosphere[J].ACM Transactions on the Web，2012，6（3）：1-33.
[36] BALLESTEROS M，BOHNET B，MILLE S，et al.Deep-syntactic parsing[C]//Proceedings of the 25th International Conference on Computational Linguistics：Technical Papers，2014：1402-1413.
[37] MADAAN A，MITTAL A，RAMAKRISHNAN G，et al.Numerical relation extraction with minimal supervision[C]//Proceedings of the 30th AAAI Conference on Artificial Intelligence，2016.
[38] NAKASHOLE N，WEIKUM G，SUCHANEK F.PATTY：a taxonomy of relational patterns with semantic types[C]//Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning，2012：1135-1145.
[39] XU Y，KIM M Y，QUINN K M，et al.Open information extraction with tree kernels[C]//Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics：Human Language Technologies，2013：868-877.
[40] JOHANSSON R，NUGUES P.Dependency-based semantic role labeling of PropBank[C]//Proceedings of the 2008 Conference on Empirical Methods in Natural Language Processing，2008：69-78.
[41] KOLLURU K，ADLAKHA V，AGGARWAL S，et al.OpenIE6：iterative grid labeling and coordination analysis for open information extraction[J].arXiv：2010.03147，2020.
[42] KOLLURU K，AGGARWAL S，RATHORE V，et al.IMOJIE：iterative memory-based joint open information extraction[J].arXiv：2005.08178，2020.
[43] NAYAK N，KOWARSKY M，ANGELI G，et al.A dictionary of nonsubsective adjectives：CSTR 2014-04[R].Stanford University.Department of Computer Science，2014.
[44] JI H，GRISHMAN R，DANG H T，et al.Overview of the TAC 2010 knowledge base population track[C]//Proceedings of the 3rd Text Analysis Conference，2010.
[45] SURDEANU M.Overview of the TAC2013 knowledge base population evaluation：English slot filling and temporal slot filling[J].Theory and Applications of Categories，2013，8：2.
[46] SODERLAND S，GILMER J，BART R，et al.Open information extraction to KBP relations in 3 hours[C]//Proceedings of the 6th Text Analysis Conference，2013.
[47] SCHNEIDER R，OBERHAUSER T，KLATT T，et al.Analysing errors of open information extraction systems[J].arXiv：1707.07499，2017.
[48] CUI L，WEI F，ZHOU M.Neural open information extraction[J].arXiv：1805.04270，2018.
[49] SUN M，LI X，WANG X，et al.Logician：a unified end-to-end neural approach for open-domain information extraction[C]//Proceedings of the 11th ACM International Conference on Web Search and Data Mining，2018：556-564.
[50] LIU G，LI X，WANG J，et al.Extracting knowledge from web text with Monte Carlo tree search[C]//Proceedings of the Web Conference 2020，2020：2585-2591.
[51] DEVLIN J，CHANG M W，LEE K，et al.BERT：pre-training of deep bidirectional transformers for language understanding[J].arXiv：1810.04805，2018.
[52] STANOVSKY G，MICHAEL J，ZETTLEMOYER L，et al.Supervised open information extraction[C]//Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics：Human Language Technologies，（Volume 1：Long Papers），2018：885-895.
[53] ROY A，PARK Y，LEE T，et al.Supervising unsupervised open information extraction models[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing，2019：728-737.
[54] SCHUSTER M，PALIWAL K K.Bidirectional recurrent neural networks[J].IEEE Transactions on Signal Processing，1997，45（11）：2673-2681.
[55] SARHAN I，SPRUIT M R.Contextualized word embeddings in a neural open information extraction model[C]//Proceedings of the 2019 International Conference on Applications of Natural Language to Information Systems.Cham：Springer，2019：359-367.
[56] HU H，XING Q，CHEN M.Enhanced distant supervised open information extraction[C]//Proceedings of the 2021 IEEE 6th International Conference on Cloud Computing and Big Data Analytics，2021：619-624.
[57] SUI D，CHEN Y，LIU K，et al.Joint entity and relation extraction with set prediction networks[J].arXiv：2011.
01675，2020.
[58] ZHANG R H，LIU Q，FAN A X，et al.Minimize exposure bias of Seq2Seq models in joint entity and relation extraction[J].arXiv：2009.07503，2020.
[59] YU B，WANG Y，LIU T，et al.Maximal clique based non-autoregressive open information extraction[C]//Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing，2021：9696-9706.
[60] ZENG D，LIU K，LAI S，et al.Relation classification via convolutional deep neural network[C]//Proceedings of the 25th International Conference on Computational Linguistics：Technical Papers，2014：2335-2344.
[61] HAN J，WANG H.Generative adversarial networks for open information extraction[J].Advances in Computational Intelligence，2021，1（4）：1-11.
[62] GOODFELLOW I，POUGET-ABADIE J，MIRZA M，et al.Generative adversarial networks[J].Communications of the ACM，2020，63（11）：139-144.
[63] ZHAN J，ZHAO H.Span model for open information extraction on accurate corpus[C]//Proceedings of the 34th AAAI Conference on Artificial Intelligence，2020：9523-9530.
[64] RO Y，LEE Y，KANG P.Multi2OIE：multilingual open information extraction based on multi-head attention with BERT[J].arXiv：2009.08128，2020.
[65] TSAI Y H H，BAI S，LIANG P P，et al.Multimodal transformer for unaligned multimodal language sequences[C]//Proceedings of the 57th Conference of the Association for Computational Linguistics，2019：6558-6569.
[66] KOLLURU K，MOHAMMED M，MITTAL S，et al.Alignment-augmented consistent translation for multilingual open information extraction[C]//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics（Volume 1：Long Papers），2022：2502-2517.
[67] LYU Z，SHI K，LI X，et al.Multi-grained dependency graph neural network for Chinese open information extraction[C]//Proceedings of the 25th Pacific-Asia Conference on Knowledge Discovery and Data Mining.Cham：Springer，2021：155-167.
[68] VELI?KOVI? P，CUCURULL G，CASANOVA A，et al.Graph attention networks[J].arXiv：1710.10903，2017.
[69] DOZAT T，MANNING C D.Deep biaffine attention for neural dependency parsing[J].arXiv：1611.01734，2016.
[70] ATMANI M，LAFOURCADE M.Universal dependencies for multilingual open information extraction[C]//Proceedings of the 3rd Conference on Language，Data and Knowledge，2021.
[71] QI P，ZHANG Y，ZHANG Y，et al.Stanza：a Python natural language processing toolkit for many human languages[J].arXiv：2003.07082，2020.
[72] NIVRE J，DE MARNEFFE M C，GINTER F，et al.Universal dependencies v1：a multilingual treebank collection[C]//Proceedings of the 10th International Conference on Language Resources and Evaluation，2016：1659-1666.
[73] LI Y，YANG Y，HU Q，et al.An argument extraction decoder in open information extraction[C]//Proceedings of the 43rd European Conference on Information Retrieval.Cham：Springer，2021：313-326.
[74] WANG J，ZHENG X，YANG Q，et al.Towards nested and fine-grained open information extraction[C]//Proceedings of the 6th China Conference on Knowledge Graph and Semantic Computing.Singapore：Springer，2021：185-197.
[75] BAYAT F F，BHUTANI N，JAGADISH H V.CompactIE：compact facts in open information extraction[J].arXiv：2205.02880，2022.
[76] WANG Y，SUN C，WU Y，et al.UniRE：a unified label space for entity relation extraction[J].arXiv：2107.04292，2021.
[77] PONTI E M，VULI? I，COTTERELL R，et al.Towards zero-shot language modeling[J].arXiv：2108.03334，2021.
[78] SOLAWETZ J，LARSON S.LSOIE：a large-scale dataset for supervised open information extraction[J].arXiv：2101.
11177，2021.
[79] HE L，LEWIS M，ZETTLEMOYER L.Question-answer driven semantic role labeling：using natural language to annotate natural language[C]//Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing，2015：643-653.
[80] LéCHELLE W，GOTTI F，LANGLAIS P.Wire57：a fine-grained benchmark for open information extraction[J].arXiv：1809.08962，2018.
[81] WHITE A S，REISINGER D，SAKAGUCHI K，et al.Universal decompositional semantics on universal dependencies[C]//Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing，2016：1713-1723.
[82] HAN J，WANG H.Improving open information extraction with distant supervision learning[J].Neural Processing Letters，2021，53（5）：3287-3306.
[83] TANG J，LU Y，LIN H，et al.Syntactic and semantic-driven learning for open information extraction[J].arXiv：2103.03448，2021.
[84] VAN LE D，MONTGOMERY J，KIRKBY K，et al.Adding an inception network to neural network open information extraction[J].IEEE Intelligent Systems，2022，37（3）：85-97.
[85] ROTH M，LAPATA M.Neural semantic role labeling with dependency path embeddings[J].arXiv：1605.07515，2016.
[86] SZEGEDY C，LIU W，JIA Y，et al.Going deeper with convolutions[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition，2015：1-9.
[87] GASHTEOVSKI K，WANNER S，HERTLING S，et al.OPIEC：an open information extraction corpus[J].arXiv：1904.12324，2019.
[88] BROSCHEIT S，GASHTEOVSKI K，ACHENBACH M.OpenIE for slot filling at TAC KBP 2017-system description[C]//Proceedings of the 2017 Text Analysis Conference，2017.
[89] GASHTEOVSKI K，GEMULLA R，KOTNIS B，et al.On aligning OpenIE extractions with knowledge bases：a case study[C]//Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems，2020：143-154.
[90] GAMALLO P，GARCIA M.Multilingual open information extraction[C]//Proceedings of the 17th Portuguese Conference on Artificial Intelligence.Cham：Springer，2015：711-722.
[91] BENDER E.English isn’t generic for language，despite what NLP papers might lead you to believe[C]//Symposium on Data Science & Statistics，2019.
[92] BENDER E M.Linguistically na?ve!= language independent：Why NLP needs linguistic typology[C]//Proceedings of the EACL 2009 Workshop on the Interaction Between Linguistics and Computational Linguistics：Virtuous，Vicious or Vacuous，2009：26-32.
[93] YU B，ZHANG Z，SHENG J，et al.Semi-open information extraction[C]//Proceedings of the Web Conference 2021，2021：1661-1672.
[94] YAN Z，TANG D，DUAN N，et al.Assertion-based QA with question-aware open information extraction[C]//Proceedings of the 32nd AAAI Conference on Artificial Intelligence，2018.
[95] BHUTANI N，SUHARA Y，TAN W C，et al.Open information extraction from question-answer pairs[J].arXiv：1903.00172，2019.
[96] GROTH P，LAURUHN M，SCERRI A，et al.Open information extraction on scientific text：an evaluation[J].arXiv：1802.05574，2018.