计算机工程与应用 ›› 2011, Vol. 47 ›› Issue (4): 138-140.DOI: 10.3778/j.issn.1002-8331.2011.04.038

• 数据库、信号与信息处理 • 上一篇    下一篇

Deep Web下基于中文分词的聚类算法

刘荣辉1,2,郑建国1   

  1. 1.东华大学 管理学院,上海 200051
    2.河南城建学院 计算机科学与工程系,河南 平顶山 467044
  • 收稿日期:2009-05-18 修回日期:2009-07-06 出版日期:2011-02-01 发布日期:2011-02-01
  • 通讯作者: 刘荣辉

Clustering algorithm in Deep Web based on Chinese word segmentation

LIU Ronghui1,2,ZHENG Jianguo1   

  1. 1.School of Management,Donghua University,Shanghai 200051,China
    2.Department of Computer Science and Engineering,Henan University of Urban Construction,Pingdingshan,Henan 467044,China
  • Received:2009-05-18 Revised:2009-07-06 Online:2011-02-01 Published:2011-02-01
  • Contact: LIU Ronghui

摘要: 随着Deep Web飞速的发展,使用商业网站上所提供的查询接口从Web数据库中获取高质量数据并对这些数据进行分析加工处理显得尤为重要。通过动态提交关键词,利用查询接口得到检索页面,对检索页面中的中文信息进行抽取并进行分词处理,对分词的结果进行统计分析,通过引入DF进行降维得到特征项,使用TF/IDF计算得到特征项的权重向量矩阵,对权重矩阵进行聚类从而实现文档的分类。通过仿真实验检验了本算法的合理性和可行性。

关键词: Deep Web, 数据抽取, 中文分词, TF/IDF, 聚类

Abstract: With the rapid development of Deep Web,it is especially important to extract quality data and process them from Web database by query interface on e-business sites.In this paper,searched pages are obtained to make use of query interface by dynamically submitting key words.Chinese item information is extracted from searched pages and segmented.The segmentation result is analyzed statistically to reduce dimensionality based on DF to get feature items.TF/IDF is used to calculate the weight vector matrix getting the feature item weight vector matrix.Weight vector matrix is presented to cluster the searched datum.The experiment results show the correctness and feasibility of this algorithm.

Key words: Deep Web, data extraction, Chinese word segmentation, Term Frenquency/Inverse Document Frequency(TF/IDF), clustering

中图分类号: