计算机工程与应用 ›› 2025, Vol. 61 ›› Issue (7): 222-232.DOI: 10.3778/j.issn.1002-8331.2311-0310

• 模式识别与人工智能 • 上一篇    下一篇

基于数据增强和损失平衡的机电领域命名实体识别

林娜,岳希,唐聃   

  1. 1.成都信息工程大学 软件工程学院,成都 610225
    2.四川省信息化应用支撑软件工程技术研究中心,成都 610225
  • 出版日期:2025-04-01 发布日期:2025-04-01

Named Entity Recognition in Electromechanical Field Based on Data Enhancement and Loss Balancing

LIN Na, YUE Xi, TANG Dan   

  1. 1.School of Software Engineering, Chengdu University of Information Technology, Chengdu 610225, China
    2.Sichuan Province Engineering Technology Research Center of Support Software of Informatization Application, Chengdu 610225, China
  • Online:2025-04-01 Published:2025-04-01

摘要: 机电领域命名实体识别是机电创新设计信息检索最基础的过程。目前命名实体识别任务的数据在机电领域较少,且大部分存在不平衡问题。通过构建机电领域命名实体识别数据集,根据数据集文本结构特点设计多维数据增强方法,并提出基于改进loss的命名实体识别模型BERT-BiGRU-CRF(BL)。对互联网机电领域文本语料进行爬取并进行标注构成机电领域命名实体识别数据集;根据不同方式对数据集的影响从同类实体替换、同义词替换、语料裁减和语料拼接四个方面进行多维数据增强后按一定比例进行数据扩充增加数据丰富度;针对数据集数据不平衡问题设计使用Weigh loss平衡focal loss 与CRF loss权重的模型,该模型采用BERT进行词向量编码,利用BiGRU完成文本向量的特征提取,使用CRF进行标签约束与解码。经实验证明,多维数据增强方法对模型效果有显著提升,并且经过改进的模型在原始和增强后数据集上表现均为最优,F1值分别为78.23%和83.3%。

关键词: 机电领域, 命名实体识别, 数据增强, focal loss, Weigh loss

Abstract: Named entity recognition in electromechanical field is mechatronic innovation design’s most essential information retrieval process. The named entity recognition task data in the electromechanical field is small and has a data imbalance problem. In this paper, a named entity recognition dataset in the electromechanical field is constructed, a multidimensional data enhancement method is designed according to the dataset features, and a named entity recognition model BERT-BiGRU-CRF (BL) based on improved loss is proposed. Firstly, it crawls and annotates the text corpus in the field of electromechanics on the Internet to constitute the named entity recognition dataset in the field of electromechanics; then, it carries out multidimensional data enhancement according to the effects of different ways on the dataset from four aspects, nam ely, similar entity substitution, synonym substitution, corpus culling, and corpus splicing, and then it increases the data richness by a specific ratio of data expansion; finally, it designs the use of the data imbalance in the dataset. For the problem of data imbalance in the data set, Weigh loss is designed to use the model of balancing the weights of focal loss and CRF loss, which firstly adopts BERT to encode the word vectors, then uses BiGRU to complete the feature extraction of the text vectors, and finally uses CRF to perform label constraints and decoding. Experimentally, it is proved that the multidimensional data enhancement method in this paper has significantly improved the model effect, and the improved model performs optimally on both the original and the enhanced dataset, with F1 values of 78.23% and 83.3%, respectively.

Key words: electromechanical field, named entity recognition, data enhancement, focal loss, Weigh loss