计算机工程与应用 ›› 2025, Vol. 61 ›› Issue (2): 219-226.DOI: 10.3778/j.issn.1002-8331.2309-0214

• 模式识别与人工智能 • 上一篇    下一篇

基于MVBCN-FLW的中文法律文书命名实体识别

杨书新,刘天扬,黄伟东   

  1. 江西理工大学 信息工程学院,江西 赣州 341000
  • 出版日期:2025-01-15 发布日期:2025-01-15

Chinese Legal Document Named Entity Recognition Based on MVBCN-FLW

YANG Shuxin, LIU Tianyang, HUANG Weidong   

  1. School of Information Engineering, Jiangxi University of Science and Technology, Ganzhou, Jiangxi 341000, China
  • Online:2025-01-15 Published:2025-01-15

摘要: 中文法律文书命名实体识别是智慧司法的基础性任务。目前的中文法律文书命名实体识别研究中已经取得一些成果,但其中大部分方法依赖于已标注的法律语料而未有效利用未标注的法律语料,且不能深入获取法律文书的特征。针对上述问题,提出一种中文法律文书命名实体识别框架。该框架使用基于双向编码器的转换器模型来学习中文法律文书的向量表示,并使用能够融合法律术语特征的双向长短时记忆网络语言模型来捕捉法律文书序列的上下文特征向量。该框架将中文法律文书的向量表示和上下文特征向量进行融合,融合后的特征向量被输入到一个由双向门控循环单元、自注意力机制和条件随机场组成的模块中进行训练。此外,为了使框架在缺少已标注的法律语料时也能得到更加充分的训练,使用未标注的法律语料进行自训练,生成新标注的法律语料并将其与初始标注的法律语料合并,通过进行迭代训练来提升框架性能。实验结果表明,该框架优于其他基于主流神经网络的命名实体识别模型。

关键词: 法律文书, 实体命名识别, 半监督学习

Abstract: Recognition of named entities in Chinese legal documents is a basic task in the judicial field. At present, some achievements have been made in the research of named entity recognition of Chinese legal documents, but most of them rely on marked legal corpus without effective use of unlabeled legal corpus, and can not deeply obtain the characteristics of legal documents. In order to solve the above problems, this paper proposes a named entity recognition framework for Chinese legal documents. Firstly, the framework uses the converter model based on bidirectional encoder to learn the vector representation of Chinese legal documents, and uses the bidirectional long-term and short-term memory network language model which can integrate the characteristics of legal terms to capture the context feature vectors of legal document sequences. Secondly, the framework fuses the vector representation of Chinese legal documents with the context feature vector, and the fused feature vector is input into a structure composed of two-way gated cycle unit, self-attention mechanism and conditional random field for training. In addition, in order to make the framework more fully trained when it lacks the legal corpus that has been marked, this paper uses the unlabeled legal corpus for self-training, generates the newly marked legal corpus and merges it with the initially marked legal corpus, and improves the framework performance through iterative training. Experimental results show that this framework is superior to other named entity recognition models based on mainstream neural networks.

Key words: legal instrument, entity naming recognition, semi-supervised learning