中国英语新词语料库构建技术研究

doi:10.3778/j.issn.1002-8331.1906-0128

计算机工程与应用 ›› 2020, Vol. 56 ›› Issue (16): 165-168.DOI: 10.3778/j.issn.1002-8331.1906-0128

中国英语新词语料库构建技术研究

刘永芳，郝晓燕，刘荣

1.太原理工大学信息与计算机学院，太原 030000
2.太原理工大学外国语学院，太原 030000

出版日期:2020-08-15 发布日期:2020-08-11

Research of Technology on Building China English New Words Corpus

LIU Yongfang, HAO Xiaoyan, LIU Rong

1.College of Information and Computer, Taiyuan University of Technology, Taiyuan 030000, China
2.Foreign Language College, Taiyuan University of Technology, Taiyuan 030000, China

Online:2020-08-15 Published:2020-08-11

摘要/Abstract

摘要：

随着中国英语新词大量出现，缺少中国英语新词语料库成为研究中国英语的主要障碍，新词识别是建设语料库主要的技术问题。针对现有的点互信息和邻接熵新词识别算法中的词内部凝聚度低，及点互信息单阈值设置存在较多高阈值无效词组，且低阈值新词组无法识别的问题，提出了改进多字点互信息和邻接熵中国英语新词识别算法。利用多字点互信息以及点互信息双阈值的设定来识别新词。实验结果表明，相同数据和实验环境下，该方法提高了准确率、召回率和[F]值，对语料库建设是有效可行的。

关键词: 中国英语, 中国英语新词语料库, 新词识别, 点互信息（PMI）, 双阈值

Abstract:

Specialized corpus about new words is too rare to systematically study the growing amount of China English new words, and new words identification is the main technical problem in constructing a corpus. Aiming at the problem that existing new words recognition algorithms based on Pointwise Mutual Information（PMI） and Branch Entropy（BE） have a low inner cohesion degree of new words, and invalid phrases with high threshold and unrecognizable new phrases with low threshold in setting single threshold of mutual information, a recognition algorithm of China English new words based on improved multi-word PMI and BE is proposed. The new words are identified through multi-word PMI and double threshold of PMI. Experimental results show that the proposed method improves the accuracy rate, recall rate and the [F] value, and is effective and feasible for corpus construction.

Key words: China English, corpus of China English new words, identification of new words, Pointwise Mutual Information（PMI）, double threshold

刘永芳，郝晓燕，刘荣. 中国英语新词语料库构建技术研究[J]. 计算机工程与应用, 2020, 56(16): 165-168.

LIU Yongfang, HAO Xiaoyan, LIU Rong. Research of Technology on Building China English New Words Corpus[J]. Computer Engineering and Applications, 2020, 56(16): 165-168.

[1]	崔丽群，张月，田鑫. 融合双阈值和改进形态学的边缘检测[J]. 计算机工程与应用, 2017, 53(9): 190-194.
[2]	邓朝省，陈莹. 基于局部SIFT特征点的双阈值配准算法[J]. 计算机工程与应用, 2014, 50(2): 189-193.
[3]	缪丹权，郑河荣，顾国民. 基于优化加权参数的AdaBoost人脸检测算法[J]. 计算机工程与应用, 2014, 50(19): 173-177.
[4]	刘倩1，仇宾2. 基于克隆选择算法的花卉图像分割[J]. 计算机工程与应用, 2012, 48(14): 185-189.
[5]	郭兴明¹，林辉杰¹，肖守中^1，2. 心音中医学指标的提取[J]. 计算机工程与应用, 2011, 47(3): 214-217.
[6]	杨波1，2，何小海1，左航3. 基于空间最大类间距的像素群分割优化[J]. 计算机工程与应用, 2011, 47(23): 164-166.
[7]	贾超，王耀坤，邢晶晶. 利用小波多尺度积实现裂纹缺陷的边缘检测[J]. 计算机工程与应用, 2011, 47(15): 219-221.
[8]	毛良瑾，何丽莎，解利军. 一种有效的SAR图像自动目标识别方法[J]. 计算机工程与应用, 2009, 45(31): 186-189.
[9]	付浩，张辉，安向京. Robocon2007移动机器人视觉系统[J]. 计算机工程与应用, 2009, 45(29): 84-86.
[10]	陈北京李均利魏平陈刚. 基于阈值和B样条插值的MR图像增强算法[J]. 计算机工程与应用, 2007, 43(13): 41-44.

中国英语新词语料库构建技术研究

Research of Technology on Building China English New Words Corpus

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 10

编辑推荐

Metrics