计算机工程与应用 ›› 2022, Vol. 58 ›› Issue (21): 156-162.DOI: 10.3778/j.issn.1002-8331.2103-0531

• 模式识别与人工智能 • 上一篇    下一篇

基于BERT-CRF的领域词向量生成研究

郭振东,林民,李成城,赵佳鹏   

  1. 1.内蒙古师范大学 计算机科学技术学院,呼和浩特 010022
    2.中国科学院大学 网络空间安全学院,北京 100089
    3.中国科学院 信息工程研究所,北京 100089
  • 出版日期:2022-11-01 发布日期:2022-11-01

Research on Domain-Specific Word Vector Generation Based on BERT-CRF

GUO Zhendong, LIN Min, LI Chengcheng, ZHAO Jiapeng   

  1. 1.College of Computer Science and Technology, Inner Mongolia Normal University, Hohhot 010022, China
    2.School of Cyber Security, University of Chinese Academy of Sciences, Beijing 100089, China
    3.Institute of Information Engineering, Chinese Academy of Sciences, Beijing 100089, China
  • Online:2022-11-01 Published:2022-11-01

摘要: 如何在中文BERT字向量基础上进一步得到高质量的领域词向量表示,用于各种以领域分词为基础的文本分析任务是一个亟待解决的问题。提出了一种基于BERT的领域词向量生成方法。建立一个BERT-CRF领域分词器,在预训练BERT字向量基础上结合领域文本进行fine-tuning和领域分词学习;通过领域分词解码结果进一步得到领域词向量表示。实验表明,该方法仅利用少量的领域文本就可以学习出符合领域任务需求的分词器模型,并能获得相比原始BERT更高质量的领域词向量。

关键词: BERT, 领域分词器, 领域词向量, 条件随机场, 词向量可视化

Abstract: How to obtain a high-quality domain-specific word vector representation based on the Chinese BERT word vector for various text analysis tasks based on domain word segmentation is an urgent problem to be solved. This paper proposes a domain-specific word vector generation method based on BERT. A BERT-CRF domain-specific word segmenter is established, and the domain text is combined with the domain text to perform fine-tuning and domain word segmentation learning based on the pre-trained BERT word vector. The domain-specific word vector representation is further obtained through the domain-specific word segmentation decoding results. Experiments show that this method can learn a tokenizer model that meets the requirements of the domain task using only a small amount of domain text, and can obtain a higher-quality domain-specific word vector than the original BERT.

Key words: bidirectional encoder representations from transformers(BERT), domain tokenizer, domain-specific word vector, conditional random field, word vector visualization