计算机工程与应用 ›› 2020, Vol. 56 ›› Issue (16): 165-168.DOI: 10.3778/j.issn.1002-8331.1906-0128

• 模式识别与人工智能 • 上一篇    下一篇

中国英语新词语料库构建技术研究

刘永芳,郝晓燕,刘荣   

  1. 1.太原理工大学 信息与计算机学院,太原 030000
    2.太原理工大学 外国语学院,太原 030000
  • 出版日期:2020-08-15 发布日期:2020-08-11

Research of Technology on Building China English New Words Corpus

LIU Yongfang, HAO Xiaoyan, LIU Rong   

  1. 1.College of Information and Computer, Taiyuan University of Technology, Taiyuan 030000, China
    2.Foreign Language College, Taiyuan University of Technology, Taiyuan 030000, China
  • Online:2020-08-15 Published:2020-08-11

摘要:

随着中国英语新词大量出现,缺少中国英语新词语料库成为研究中国英语的主要障碍,新词识别是建设语料库主要的技术问题。针对现有的点互信息和邻接熵新词识别算法中的词内部凝聚度低,及点互信息单阈值设置存在较多高阈值无效词组,且低阈值新词组无法识别的问题,提出了改进多字点互信息和邻接熵中国英语新词识别算法。利用多字点互信息以及点互信息双阈值的设定来识别新词。实验结果表明,相同数据和实验环境下,该方法提高了准确率、召回率和[F]值,对语料库建设是有效可行的。

关键词: 中国英语, 中国英语新词语料库, 新词识别, 点互信息(PMI), 双阈值

Abstract:

Specialized corpus about new words is too rare to systematically study the growing amount of China English new words, and new words identification is the main technical problem in constructing a corpus. Aiming at the problem that existing new words recognition algorithms based on Pointwise Mutual Information(PMI) and Branch Entropy(BE) have a low inner cohesion degree of new words, and invalid phrases with high threshold and unrecognizable new phrases with low threshold in setting single threshold of mutual information, a recognition algorithm of China English new words based on improved multi-word PMI and BE is proposed. The new words are identified through multi-word PMI and double threshold of PMI. Experimental results show that the proposed method improves the accuracy rate, recall rate and the [F] value, and is effective and feasible for corpus construction.

Key words: China English, corpus of China English new words, identification of new words, Pointwise Mutual Information(PMI), double threshold