基于WSFT模型的深层网文本获取方法

doi:10.3778/j.issn.1002-8331.1604-0039

计算机工程与应用 ›› 2017, Vol. 53 ›› Issue (18): 236-242.DOI: 10.3778/j.issn.1002-8331.1604-0039

基于WSFT模型的深层网文本获取方法

杨贯中，李虹萱

湖南大学信息科学与工程学院，长沙 410082

出版日期:2017-09-15 发布日期:2017-09-29

Approach based on WSFT for crawling deep web

YANG Guanzhong, LI Hongxuan

School of Information Science and Engineering, Hunan University, Changsha 410082, China

Online:2017-09-15 Published:2017-09-29

摘要/Abstract

摘要： Ajax技术在深层网（Deep Web）网站开发中得到了广泛应用。针对Ajax页面多状态、状态之间强关联的特性，提出一种构建WSFT（带权状态融合树）模型的方法，来进行Ajax页面文本信息预处理。引入了文本特征树作为状态指纹进行状态捕获，优化了当前Ajax页面数据采集方法，同时通过StatusRank方法计算状态转移权值来分析状态迁移信息，最后生成WSFT。实验证明，该方法能有效地获取Ajax页面多状态文本信息，并且有助于后续Web挖掘的重要文本内容提取。

关键词: Ajax爬虫, 带权状态融合树, 文本挖掘, 文本特征树

Abstract: Ajax technology has been widely applied in deep web application development. This paper constructs a Weighted State Fusion Tree （WSFT） model to pre-process the text information in web page with Ajax technology which has multiple states with strong correlation. Firstly, the current approach of Ajax page data collection is optimized by regarding text feature tree as a fingerprint to traverse through the multiple states. Secondly, the transition weight with StatusRank method is calculated for each states of the Ajax page. The state transition information is analyzed. Finally, a WSFT is generated. The experimental results show that the proposed method can effectively obtain the text information in Ajax page with multiple states, and help the follow-up important text extraction of web mining.

Key words: Ajax crawler, weighted state fusion tree, text mining, text feature tree

杨贯中，李虹萱. 基于WSFT模型的深层网文本获取方法[J]. 计算机工程与应用, 2017, 53(18): 236-242.

YANG Guanzhong, LI Hongxuan. Approach based on WSFT for crawling deep web[J]. Computer Engineering and Applications, 2017, 53(18): 236-242.

[1]	刘晨晖，张德生，胡钢. 基于TAKE的中文关键短语提取算法研究[J]. 计算机工程与应用, 2020, 56(10): 115-121.
[2]	黄诚1，2，刘嘉勇1，刘亮1，何祥1，汤殿华2. 基于上下文语义的恶意域名语料提取模型研究[J]. 计算机工程与应用, 2018, 54(9): 101-108.
[3]	孙红敏，姜楠楠，李想. 基于文档集的生物信息挖掘模型研究[J]. 计算机工程与应用, 2016, 52(24): 102-106.
[4]	韩永花，雷玉霞，陈娟，王祥德. 多框架知识的不一致性检测及其修正算法[J]. 计算机工程与应用, 2016, 52(23): 94-97.
[5]	邱云飞，赵彬，林明明，王伟. 结合语义改进的K-means短文本聚类算法[J]. 计算机工程与应用, 2016, 52(19): 78-83.
[6]	邵浩. 贸易文本的主题挖掘研究[J]. 计算机工程与应用, 2016, 52(11): 60-67.
[7]	史玉珍1，2，彭智勇3. 基于修正h指数的学科领域专家发现的研究[J]. 计算机工程与应用, 2011, 47(29): 1-3.
[8]	马素琴，施化吉. 阈值优化的文本密度聚类算法[J]. 计算机工程与应用, 2011, 47(17): 134-136.
[9]	李芳，朱群雄. 文本挖掘技术在科研信息自动建议中的应用[J]. 计算机工程与应用, 2011, 47(10): 118-119.
[10]	万红新¹，彭云². 模糊策略下的搜索文本聚类分析技术[J]. 计算机工程与应用, 2009, 45(33): 135-137.
[11]	肖洪,薛德军. 基于大规模真实文本的数值知识元挖掘研究[J]. 计算机工程与应用, 2008, 44(30): 150-152.
[12]	郑煜钱榕. Web文本聚类算法WTCA的研究与实现[J]. 计算机工程与应用, 2007, 43(4期): 170-172.
[13]	刘玉琴,汪雪锋,雷孝平. 基于文本挖掘技术的专利质量评价与实证研究[J]. 计算机工程与应用, 2007, 43(33): 12-14.
[14]	高茂庭王正欧. 一种基于双词关联的文本特征选择模型[J]. 计算机工程与应用, 2007, 43(10期): 183-185.
[15]	王永恒,贾焰,杨树强. 大规模文本数据库中的短文分类方法 [J]. 计算机工程与应用, 2006, 42(22期): 5-.

基于WSFT模型的深层网文本获取方法

Approach based on WSFT for crawling deep web

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics