Computer Engineering and Applications ›› 2023, Vol. 59 ›› Issue (8): 97-104.DOI: 10.3778/j.issn.1002-8331.2205-0518

• Pattern Recognition and Artificial Intelligence • Previous Articles     Next Articles

Research on COVID-19 Text Entity Relation Extraction and Dataset Construction Methods

YANG Chongluo, SHENG Long, WEI Zhongcheng, WANG Wei   

  1. 1.College of Information and Electrical Engineering, Hebei University of Engineering, Handan, Hebei 056038, China
    2.Hebei Key Laboratory of Security Protection Information Sensing and Processing, Hebei University of Engineering, Handan, Hebei 056038, China
  • Online:2023-04-15 Published:2023-04-15

新冠文本实体关系抽取及数据集构建方法研究

杨崇洛,生龙,魏忠诚,王巍   

  1. 1.河北工程大学 信息与电气工程学院,河北 邯郸 056038
    2.河北省安防信息感知与处理重点实验室,河北 邯郸 056038

Abstract: Entity relationship extraction can effectively obtain key information in the text, and using the key information in the COVID-19 text can help cut off the transmission route of the epidemic and discover the source of the epidemic. However, there is no suitable public annotated dataset in this field. To solve this problem, by analyzing the semantic representation and structural characteristics of the COVID-19 text, an entity relationship definition for the COVID-19 text is proposed, and the collected data is analyzed according to the entity relationship definition. Entity annotation and relationship annotation, after the annotation is completed, through data preprocessing and other operations to generate a COVID-19 text entity relationship extraction dataset. Compared with public datasets, the datasets in this field have denser distribution of entities and relationships, and the feature extraction capability of a single neural network model is poor. Therefore, a method of splicing multiple neural network models is used to construct a named entity recognition model and a relationship extraction model. The data set is experimentally verified by the results of the model, and the experimental results prove that the data set can be applied to the entity relation extraction task in this field.

Key words: dataset, entity and relationship definition, data labeling, bidirectional recurrent neural network, convolutional neural network

摘要: 实体关系抽取可有效地获取文本中的关键信息,利用新冠文本中的关键信息有助于切断疫情传播途径,发掘疫情传播源头。但该领域没有适合的公开有标注的数据集,针对该问题,通过分析新冠文本的语义表示和结构特点,提出一种针对新冠文本的实体关系定义,并根据实体关系定义对收集的数据进行实体标注和关系标注,在标注完成后,通过数据预处理等操作生成新冠文本实体关系抽取数据集。与公开数据集相比,该领域的数据集本文实体和关系分布较为密集,单一神经网络模型特征抽取能力较差,因此采用多种神经网络模型拼接的方法构建命名实体识别模型和关系抽取模型。通过模型的结果对数据集进行实验验证,实验结果证明该数据集可以应用于该领域的实体关系抽取任务。

关键词: 数据集, 实体关系定义, 数据标注, 双向循环神经网络, 卷积神经网络