Substring reduction algorithm based on independence statistic

doi:10.3778/j.issn.1002-8331.2010.24.039

Computer Engineering and Applications ›› 2010, Vol. 46 ›› Issue (24): 129-131.DOI: 10.3778/j.issn.1002-8331.2010.24.039

• 数据库、信号与信息处理 • Previous Articles Next Articles

Substring reduction algorithm based on independence statistic

ZHOU Lang^1，2，FENG Chong²，HUANG He-yan²，WANG Ping-yao³

1.School of Computer Science and Technology，Nanjing University of Science and Technology，Nanjing 210094，China
2.Research Center of Computer & Language Information Engineering，Chinese Academy of Sciences，Beijing 100097，China
3.Department of Computer，Ningbo Polytechnic，Ningbo，Zhejiang 315800，China

Received:2009-02-10 Revised:2009-04-01 Online:2010-08-21 Published:2010-08-21
Contact: ZHOU Lang

一种基于独立性统计的子串归并算法

周浪^1，2，冯冲²，黄河燕²，王平尧³

1.南京理工大学计算机科学与技术学院，南京 210094
2.中国科学院计算机语言信息工程研究中心，北京 100097
3.宁波职业技术学院计算机系，浙江宁波 315800

通讯作者: 周浪

Abstract

Abstract: The substring reduction algorithm applied in most cases is mainly focusing on the substrings having the same frequency with the parent string in one to one mode.After being processed by the morphological analysis tool，it’s unavoidable to product many segment fragments which compose many meaningless substrings.According to the analysis of the one to multiple relationship between the meaningless substring and its parent strings，a substring reduction algorithm based on independence statistic is proposed to filter these meaningless substrings.Finally，this substring reduction algorithm is applied in the Chinese multi-words terminology extraction system，and the precision of the term extraction results is improved from 91.3% to 93.32%.

摘要： 现行的子串归并算法都是采用一对一的方式针对同频子串提出的。但是在使用词法分析工具对文本进行切分时，不可避免地会产生很多的分词碎片，这直接导致了很多无意义子串的产生。通过分析这些无意义子串和众多父串之间的这种一对多关系，提出了一种基于独立性统计的子串归并算法。最后将该子串归并算法应用在中文术语抽取系统中，使得系统的准确率从91.3%提升到了93.32%。

CLC Number:

TP391

ZHOU Lang^1，2，FENG Chong²，HUANG He-yan²，WANG Ping-yao³. Substring reduction algorithm based on independence statistic[J]. Computer Engineering and Applications, 2010, 46(24): 129-131.

周浪^1，2，冯冲²，黄河燕²，王平尧³. 一种基于独立性统计的子串归并算法[J]. 计算机工程与应用, 2010, 46(24): 129-131.

[1]	CHEN Wang¹，LI Bo1，SHI Yanjun²，TENG Hongfei². Differential evolution algorithm with estimation of distribution for solving RCPSP problem [J]. Computer Engineering and Applications, 2011, 47(4): 1-4.
[2]	SHA Quanyou¹，SHI Jinfa¹，QIN Xiansheng². Research on dynamical decomposition and optimization configuration in aeronautic manufacturing field [J]. Computer Engineering and Applications, 2011, 47(4): 9-12.
[3]	DAI Qin，LIU Jianbo，LIU Shibin. Analysis of remote sensing information extraction using swarm intelligence method [J]. Computer Engineering and Applications, 2011, 47(4): 13-16.
[4]	LIU Guangshuai，LI Bailin，HE Chaoming. Patch-graph sparse optimization methods based on piecewise smooth surfaces reconstruction [J]. Computer Engineering and Applications, 2011, 47(4): 22-25.
[5]	LONG Yinfang，SHANG Junna. Frequency offset estimation for MC-CDMA systems [J]. Computer Engineering and Applications, 2011, 47(4): 102-104.
[6]	YU Jiangde¹，WANG Xijie¹，FAN Xiaozhong². Comparing of importance of above-context versus below-context for Chinese word segmentation [J]. Computer Engineering and Applications, 2011, 47(4): 117-120.
[7]	PEI Yingbo¹，LIU Xiaoxia². Study on improved CHI for feature selection in Chinese text categorization [J]. Computer Engineering and Applications, 2011, 47(4): 128-130.
[8]	ZHANG Yu，LUO Ke. OC-SVM-based classification for large-scale data sets [J]. Computer Engineering and Applications, 2011, 47(4): 131-133.
[9]	LIU Ronghui^1，2，ZHENG Jianguo¹. Clustering algorithm in Deep Web based on Chinese word segmentation [J]. Computer Engineering and Applications, 2011, 47(4): 138-140.
[10]	CAI Rangjia. Tibetan studies of corpus description method [J]. Computer Engineering and Applications, 2011, 47(4): 146-148.
[11]	LIU Xiuling，LIU Jing，WANG Hongrui，GUO Lei. Fast collision detection based on improved honeycomb-shape spatial decomposition [J]. Computer Engineering and Applications, 2011, 47(4): 149-153.
[12]	ZHANG Cong，GUI Zhiguo. Non-linear image sharpening approach based on noise estimation [J]. Computer Engineering and Applications, 2011, 47(4): 154-156.
[13]	FU Xiaojun¹，GUO Pengjiang¹，GUO Jing²，FENG Jun². 3D model classification based on statistical features and Markov models [J]. Computer Engineering and Applications, 2011, 47(4): 157-159.
[14]	CHEN Huijie，LAI Huicheng，JIA Zhiqiang. Double color image information hiding based on image mix and wavelet transform [J]. Computer Engineering and Applications, 2011, 47(4): 171-173.
[15]	YANG Xiaoqin，JI Xiaoyong. Fast motion estimation algorithm based on H.264 [J]. Computer Engineering and Applications, 2011, 47(4): 174-175.

Substring reduction algorithm based on independence statistic

一种基于独立性统计的子串归并算法

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics