Computer Engineering and Applications ›› 2023, Vol. 59 ›› Issue (2): 129-134.DOI: 10.3778/j.issn.1002-8331.2107-0102

• Pattern Recognition and Artificial Intelligence • Previous Articles     Next Articles

BERT Mongolian Word Embedding Learning

WANG Yurong, LIN Min, LI Yanling   

  1. College of Computer Science and Technology, Inner Mongolia Normal University, Hohhot 010022, China
  • Online:2023-01-15 Published:2023-01-15

BERT蒙古文词向量学习

王玉荣,林民,李艳玲   

  1. 内蒙古师范大学 计算机科学技术学院,呼和浩特 010022

Abstract: The static Mongolian word embedding learning method represented by Word2Vec comprehensively represents a variety of semantic words in different contexts into a word embedding. Such context-independent text representation method has limited improvement on subsequent tasks. Through the second training, the multilingual BERT pre-training model is combined with CRF, and adopting the fusion method of two seed words, a new dynamic Mongolian word embedding learning method is proposed, which can solve the problem of lexical aggregation. In order to verify the effectiveness of this method, a comparative experiment is carried out on the data sets of education and literature fields of Masters and Doctrines dissertations of Inner Mongolia Normal University, and the clustering analysis of Mongolian words is carried out by using [K]-means clustering algorithm, finally, it is verified in the task of embedded keyword mining. The experimental results show that the quality of the word vectors learned by BERT is higher than that of Word2Vec. The embedding of similar words is very close in the vector space, while the embedding of non-similar words is far away. The subject words obtained in the subject word mining task are closely related.

Key words: Mongolian, word embedding, bidirectional encoder representations from transformers(BERT), conditional random field

摘要: 以Word2Vec为代表的静态蒙古文词向量学习方法,将处于不同语境的多种语义词汇综合表示成一个词向量,这种上下文无关的文本表示方法对后续任务的提升非常有限。通过二次训练多语言BERT预训练模型与CRF相结合,并采用两种子词融合方式,提出一种新的蒙古文动态词向量学习方法。为验证方法的有效性,在内蒙古师范大学蒙古文硕博论文的教育领域、文学领域数据集上用不同的模型进行了同义词对比实验,并利用[K]-means聚类算法对蒙古文词语进行聚类分析,最后在嵌入式主题词挖掘任务中进行了验证。实验结果表明,BERT学出的词向量质量高于Word2Vec,相近词的向量在向量空间中的距离非常近,不相近词的向量较远,在主题词挖掘任务中获取的主题词有密切的关联。

关键词: 蒙古文, 词向量, BERT, 条件随机场