BERT Mongolian Word Embedding Learning

doi:10.3778/j.issn.1002-8331.2107-0102

Abstract

Abstract: The static Mongolian word embedding learning method represented by Word2Vec comprehensively represents a variety of semantic words in different contexts into a word embedding. Such context-independent text representation method has limited improvement on subsequent tasks. Through the second training, the multilingual BERT pre-training model is combined with CRF, and adopting the fusion method of two seed words, a new dynamic Mongolian word embedding learning method is proposed, which can solve the problem of lexical aggregation. In order to verify the effectiveness of this method, a comparative experiment is carried out on the data sets of education and literature fields of Masters and Doctrines dissertations of Inner Mongolia Normal University, and the clustering analysis of Mongolian words is carried out by using [K]-means clustering algorithm, finally, it is verified in the task of embedded keyword mining. The experimental results show that the quality of the word vectors learned by BERT is higher than that of Word2Vec. The embedding of similar words is very close in the vector space, while the embedding of non-similar words is far away. The subject words obtained in the subject word mining task are closely related.

Key words: Mongolian, word embedding, bidirectional encoder representations from transformers（BERT）, conditional random field

摘要： 以Word2Vec为代表的静态蒙古文词向量学习方法，将处于不同语境的多种语义词汇综合表示成一个词向量，这种上下文无关的文本表示方法对后续任务的提升非常有限。通过二次训练多语言BERT预训练模型与CRF相结合，并采用两种子词融合方式，提出一种新的蒙古文动态词向量学习方法。为验证方法的有效性，在内蒙古师范大学蒙古文硕博论文的教育领域、文学领域数据集上用不同的模型进行了同义词对比实验，并利用[K]-means聚类算法对蒙古文词语进行聚类分析，最后在嵌入式主题词挖掘任务中进行了验证。实验结果表明，BERT学出的词向量质量高于Word2Vec，相近词的向量在向量空间中的距离非常近，不相近词的向量较远，在主题词挖掘任务中获取的主题词有密切的关联。

关键词: 蒙古文, 词向量, BERT, 条件随机场

WANG Yurong, LIN Min, LI Yanling. BERT Mongolian Word Embedding Learning[J]. Computer Engineering and Applications, 2023, 59(2): 129-134.

王玉荣, 林民, 李艳玲. BERT蒙古文词向量学习[J]. 计算机工程与应用, 2023, 59(2): 129-134.

References

[1] 唐国豪.分布式词向量研究和实现[J].电子制作，2021（2）：85-87.
TANG G H.Research and implementation of distributed word embeddings[J].Practical Electronics，2021（2）：85-87.
[2] MIKOLOV T，CHEN K，CORRADO G，et al.Efficient estimation of word representations in vector space[J].arXiv：1301.3781，2013.
[3] DEVLIN J，CHANG M W，LEE K，et al.BERT：bidirectional encoder representations from transformers for language understanding[J].Computation and Language，2018，23（2）：3-19.
[4] PETERS M E，NEUMANN M，IYYER M，et al.Deep contextualized word representations[C]//Proceedings of NAACL，2018.
[5] LAFFERTY J D，MCCALLUM A，PEREIRA F C N.Conditional random fields：probabilistic models for segmenting and labeling sequence data[C]//Eighteenth International Conference on Machine Learning，2001：282-289.
[6] BENGIO Y，DUCHARME R，VINCENT P，et al.A neural probabilistic language model[J].Journal of Machine Learning Research，2003：1137-1155.
[7] 曹宜超.基于单语语料库的汉蒙神经机器翻译方法研究[D].合肥：中国科学技术大学，2020.
CAO Y C.Research on Chinese-Mongolian neural machine translation based on monolingual corpora[D].Hefei：University of Science and Technology of China，2020.
[8] 樊文婷，侯宏旭，王洪彬，等.融合先验信息的蒙汉神经网络机器翻译模型[J].中文信息学报，2018，32（6）：36-43.
FAN W T，HOU H X，WANG H B，et al.Mongolian-Chinese neural machine translation with priori information[J].Journal of Chinese Information Processing，2018，32（6）：36-43.
[9] 王炜华.蒙古文命名实体识别研究[D].呼和浩特：内蒙古大学，2018.
WANG W H.Mongolian named entity recognition[D].Hohhot：Inner Mongolia University，2018.
[10] 熊玉竹.融合语言模型和注意力机制的蒙古文命名实体识别研究[D].呼和浩特：内蒙古大学，2019.
XIONG Y Z.Mongolian named entity recognition integrated language model and attention mechanism[D].Hohhot：Inner Mongolia University，2019.
[11] 朝汗.基于词向量模型的蒙古文多义词消歧研究[D].呼和浩特：内蒙古师范大学，2020.
CHAO H.Research on ambiguity disambiguation of Mongolian polysemy based on word vector model[D].Hohhot：Inner Mongolia Normal University，2020.
[12] VASWANI A，SHAZEER N，PARMAR N，et al.Attention is all you need[C]//Advances in Neural Information Processing，2017.
[13] 孔祥鹏，吾守尔·斯拉木，杨启萌，等.基于迁移学习的维吾尔语命名实体识别[J].东北师大学报（自然科学版），2020，52（2）：58-65.
KONG X P，SILAMU W，YANG Q M，et al.Uyghur named entity recognition based on transfer learning[J].Journal of Northeast Normal University（Natural Science Edition），2020，52（2）：58-65.
[14] 徐菲菲，冯东升.文本词向量与预训练语言模型研究[J].上海电力大学学报，2020，36（4）：320-328.
XU F F，FENG D S.A survey of research on word vectors and pretrained language models[J].Journal of Shanghai University of Electric Power，2012，36（4）：320-328.
[15] 王星予，吕学强，游新冬.KBLCC：融合实体关键字特征的医疗领域实体分类方法[J].小型微型计算机系统，2022，43（1）：27-34.
WANG X Y，LV X Q，YOU X D.KBLCC：entity classification method in the medical field integrating the features of entity keywords[J].Journal of Chinese Computer Systems，2022，43（1）：27-34.

[16] 乌云塔那，王斯日古楞.蒙古语词向量评测研究[J].广西科学院学报，2018，34（1）：68-71.

Wuyuntana，Wangsiriguleng.Research on Mongolian word vectors evaluation[J].Journal of Guangxi Academy of Sciences，2018，34（1）：68-71.