Computer Engineering and Applications ›› 2009, Vol. 45 ›› Issue (18): 135-138.DOI: 10.3778/j.issn.1002-8331.2009.18.041

• 数据库、信息处理 • Previous Articles     Next Articles

Research on comparison of three topic segmentation approaches

SHI Jing1,LI Wan-long1,2   

  1. 1.College of Computer Science and Engineering,Changchun University of Technology,Changchun 130012,China
    2.College of Computer Science and Technology,Jilin University,Changchun 130012,China
  • Received:2008-09-19 Revised:2009-03-10 Online:2009-06-21 Published:2009-06-21
  • Contact: SHI Jing

三种主题分割方法的对比研究

石 晶1,李万龙1,2   

  1. 1.长春工业大学 计算机科学与工程学院,长春 130012
    2.吉林大学 计算机科学与技术学院,长春 130012
  • 通讯作者: 石 晶

Abstract: Text segmentation is very important for many fields including information retrieval,summarization,language modeling,anaphora resolution and so on.Text segmentation based on PLSA and LDA associates different latent topics with observable pairs of word and sentence.While segmentation based on small world relies on highly clustered feature and character of short path length.The three approaches of segmentation are compared from the theory of model,strategy of segmentation and results of experiments.The analysis shows that segmentation based on LDA is more stable than that based on PLSA and the error rate is lower.The segmentation based on small world is proper for those texts which has more obvious features of small world.

Key words: text segmentation, Probabilistic Latent Semantic Analysis(PLSA) model, Latent Dirichlet Allocation(LDA) model, small world model

摘要: 文本分割在信息提取、文摘自动生成、语言建模、首语消解等诸多领域都有极为重要的应用。基于PLSA及LDA模型的文本分割试图使隐藏于片段内的不同主题与文本表面的词、句对建立联系,而基于小世界模型的分割则依据小世界模型的短路径、高聚集性的特点实现片段边界的识别。从模型的特点、分割策略以及实验结果等角度对基于三种模型的分割进行对比。分析表明,基于LDA模型的分割比基于PLSA模型的分割具有更大的稳定性,且分割效果更好。基于小世界模型的分割策略更适合小世界模型特性明显的文本。

关键词: 文本分割, 概率潜在语义分析模型, LDA模型, 小世界模型