基于词典词语量化关系的中文文本分割方法

doi:10.3778/j.issn.1002-8331.2008.21.007

计算机工程与应用 ›› 2008, Vol. 44 ›› Issue (21): 25-29.DOI: 10.3778/j.issn.1002-8331.2008.21.007

基于词典词语量化关系的中文文本分割方法

钟茂生^1,2,胡熠¹,刘磊¹

1.上海交通大学计算机科学与工程系，上海 200240
2.华东交通大学信息工程学院，南昌 330013

收稿日期:2008-04-30 修回日期:2008-06-02 出版日期:2008-07-21 发布日期:2008-07-21
通讯作者: 钟茂生

Research on Chinese text segmentation based on quantified conceptual relations extracted from Chinese dictionary

ZHONG Mao-sheng^1,2,HU Yi¹,LIU Lei¹

1.Department of Computer Science and Engineering，Shanghai Jiaotong University，Shanghai 200240，China
2.School of Information Engineering，East China Jiaotong University，Nanchang 330013，China

Received:2008-04-30 Revised:2008-06-02 Online:2008-07-21 Published:2008-07-21
Contact: ZHONG Mao-sheng

摘要/Abstract

摘要： 随着Internet网络资源的快速膨胀，海量的非结构化文本处理任务成为巨大的挑战。文本分割作为文本处理的一个重要的预处理步骤，其性能的优劣直接影响信息检索、文本摘要和问答系统等其他任务处理的效果。针对文本分割中需要解决的主题相关性度量和边界划分策略两个根本问题，提出了一种基于词典词语量化关系的句子间相关性度量方法，并建立了一个计算句子之间的间隔点分隔值的数学模型，以实现基于句子层次的中文文本分割。通过三组选自国家汉语语料库的测试语料的实验表明，该方法识别分割边界的平均错误概率pk和最低值均好于现有的其他中文文本分割方法。

关键词: 文本分割, 词语量化关系, 句子相关性度量, 间隔点, 分隔值

Abstract: With the quick expanding of the Internet information resource，the task of processing a mass of non-structured texts is faced with a huge challenge.Text segmentation based on the topic is a very important preprocessing step of text processing，and the performance of text segmentation technique has an immediate influence on the result of these tasks，such as Information Retrieval，Text Summarization and Q-A system.However，there exists two key problems in the text segmentation task，namely，how to measure the relevance of between topics and how to make a strategy for identifying the segment boundary based on the relevance of the context.In order to solve the above problems，this paper presents a new approach to measure the relevance of between sentences based on the Quantified Conceptual Relations（QCR） extracted from Modern Chinese Standard Dictionary（MCSD），and built a model to calculate the Segmentation Value of the gap point of between sentences for the task of text segmentation oriented sentence-level（no paragraph-level）.The experiment results show that this approach has achieved a lower average error rate pk than that of state-of-the-art methods in the task of Chinese Text Segmentation.

Key words: text segmentation, quantified conceptual relations, inter-sentence relevance measure, gap point, segmentation value

钟茂生^1,2,胡熠¹,刘磊¹. 基于词典词语量化关系的中文文本分割方法[J]. 计算机工程与应用, 2008, 44(21): 25-29.

ZHONG Mao-sheng^1,2,HU Yi¹,LIU Lei¹. Research on Chinese text segmentation based on quantified conceptual relations extracted from Chinese dictionary[J]. Computer Engineering and Applications, 2008, 44(21): 25-29.

HTML			PDF

最新录用	在线预览	正式出版	最新录用	在线预览	正式出版
0	0	0	0	0	65

来源	本网站	其他网站

次数	61	4
比例	94%	6%

摘要

最新录用	在线预览	正式出版

0	0	76

	来源	本网站

	次数	76
	比例	100%

[1]	石晶¹,李万龙^1,2. 三种主题分割方法的对比研究[J]. 计算机工程与应用, 2009, 45(18): 135-138.
[2]	刘娜^1,2,唐焕玲^1,3,鲁明羽¹. 文本线性分割方法的研究[J]. 计算机工程与应用, 2008, 44(21): 212-216.

基于词典词语量化关系的中文文本分割方法

Research on Chinese text segmentation based on quantified conceptual relations extracted from Chinese dictionary

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 2

编辑推荐 0

Metrics