Computer Engineering and Applications ›› 2023, Vol. 59 ›› Issue (22): 322-328.DOI: 10.3778/j.issn.1002-8331.2207-0471

• Engineering and Applications • Previous Articles     Next Articles

Research on College Academic Text Named Entity Recognition and Dataset Construction

HE Chen, YUAN Yingchun, WANG Kejian, TAO Jia   

  1. 1.College of Information Science and Technology, Hebei Agricultural University, Baoding, Hebei 071001, China
    2.Hebei Key Laboratory of Agricultural Big Data, Baoding, Hebei 071001, China
  • Online:2023-11-15 Published:2023-11-15

高校学业文本命名实体识别及数据集构建研究

何晨,苑迎春,王克俭,陶佳   

  1. 1.河北农业大学 信息科学与技术学院,河北 保定 071001
    2.河北省农业大数据重点实验室,河北 保定 071001

Abstract: In recent years, the number of students who fail to graduate rises year by year because of their academic problems, which brings great pressure to teaching management in universities. Accurate entity recognition can effectively extract key information from academic management texts in universities, but there is no publicly available annotated dataset in this field. Therefore, it is extremely urgent to develop a named entity recognition dataset for the general university academic management. A data construction standard with 8 categories is developed for academic texts longer than 130 000 words in a university with the domain expertise of academic management professionals. The annotation work is completed according to the construction standard and text characteristics. Four recognition models, such as BILSTM-CRF, are tested in the public datasets and the constructed datasets. The results show that the datasets constructed in this paper can be applied to the task of named entity recognition in the academic field of universities, and the construction method is universal. Moreover, the recognition effect of the classified labeled datasets is significantly improved, compared with the unclassified datasets, which further verifies the effectiveness of the classification standard.

Key words: college academic, named entity recognition, dataset construction, entity labeling, BiLSTM-CRF

摘要: 近年来,我国高校因学业问题无法顺利毕业的学生数量逐年上升,给高校教学管理工作带来极大压力。利用知识图谱技术快速自动解答学业困惑成为亟待解决的重要问题。实体精准识别可有效提取学业管理文本中的关键信息,但该领域尚未存在公开适用的标注数据集,因此开展面向具有普遍性和通识性的高校学业命名实体识别数据集变得极为迫切。依据学业管理专家的领域知识,对某高校13万余字学业文本制定了8类学业数据构建标准,并根据构建标准以及文本特性完成了标注工作。将BiLSTM-CRF等4种识别模型在公开数据集和构建数据集上进行实验测试,结果表明构建的数据集可以应用于高校学业领域的命名实体识别任务,构建方法具有普适性,而且分类标注后的数据集识别效果相较未分类数据集有明显提升,进一步验证了该分类标准的有效性。

关键词: 高校学业, 命名实体识别, 数据集构建, 实体标注, BiLSTM-CRF