计算机工程与应用 ›› 2020, Vol. 56 ›› Issue (17): 167-172.DOI: 10.3778/j.issn.1002-8331.1906-0332

• 模式识别与人工智能 • 上一篇    下一篇

多重CCA算法的柬汉双语词向量构建方法

蒋亚芳,严馨,李思远,徐广义,周枫   

  1. 1.昆明理工大学 信息工程与自动化学院,昆明 650504
    2.云南南天电子信息产业股份有限公司,昆明 650051
  • 出版日期:2020-09-01 发布日期:2020-08-31

Construction of Khmer-Chinese Bilingual Word Embedding Based on Multiple CCA Algorithms

JIANG Yafang, YAN Xin, LI Siyuan, XU Guangyi, ZHOU Feng   

  1. 1.School of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650504, China
    2.Yunnan Nantian Electronic Information Industry Co., Ltd., Kunming 650051, China
  • Online:2020-09-01 Published:2020-08-31

摘要:

针对现有双语词向量研究方法获取双语词向量需要用到大量双语平行文本,对于柬汉双语而言存在着平行文本不足的关键问题,而英语作为通用语言,英语-汉语以及英语-柬埔寨语双语平行文本较多且容易获得,因此在典型相关分析跨语言词向量模型上作出进一步改进,提出以英语为中间语言的基于多重CCA算法的汉柬双语词向量构建方法。通过将英语、汉语词向量投影至汉-英向量空间,将英语、柬语词向量投影至柬-英向量空间,根据CCA算法分别得到英-汉、英-柬双语词向量;以英语作为中间词并结合部分实验室构建的柬汉双语电子词典将上一步得到的英-柬、英-汉双语词向量投影至第三方同一向量空间中,再次根据CCA算法得到柬语和汉语在新向量空间中的投影转换矩阵;得到柬英汉多语词向量,多语词向量中包含有柬汉双语词向量。与传统方法相比,该方法解决了当前其他模型所面临的初始柬汉平行文本稀缺的问题,且获得较高的柬汉双语词向量。

关键词: 双语词向量, 典型相关分析(CCA), 汉柬双语, 多重典型相关分析算法

Abstract:

A large number of parallel bilingual texts are needed to acquire the bilingual word embedding in the existing research methods of bilingual word embedding, and there are some key problems in Khmer-Chinese bilingualism. As English is a general language, English-Chinese and English-Khmer bilingual parallel texts are more and easier to obtain. Therefore, the cross-language word embedding of canonical correlation analysis is further improved, and a method of constructing Khmer-Chinese bilingual word embedding based on multiple CCA algorithm with English as the intermediate language is proposed. The English and Chinese word embedding is projected into the Chinese-English embedding space, and the English and Khmer word embedding is projected into the Khmer-English embedding space. According to CCA algorithm, the English-Chinese and English-Khmer bilingual word embedding is obtained respectively. Then, the English-Khmer and English-Chinese bilingual word embedding obtained from the previous step are projected into the same embedding space of the third party, and the projection transformation matrix of Khmer and Chinese in the new embedding space is obtained according to CCA algorithm. Finally, the Khmer-English-Chinese multilingual word embedding is obtained. The multilingual word embedding contains the Khmer-Chinese bilingual word embedding. Compared with traditional methods, this method solves the problem of scarcity of initial Khmer-Chinese parallel texts faced by other models, and obtains higher Khmer-Chinese bilingual word embedding.

Key words: bilingual word embedding, Canonical Correlation Analysis(CCA), Khmer-Chinese bilingual, multiple Canonical Correlation Analysis(CCA) algorithm