Approach based on WSFT for crawling deep web

doi:10.3778/j.issn.1002-8331.1604-0039

Abstract

Abstract: Ajax technology has been widely applied in deep web application development. This paper constructs a Weighted State Fusion Tree （WSFT） model to pre-process the text information in web page with Ajax technology which has multiple states with strong correlation. Firstly, the current approach of Ajax page data collection is optimized by regarding text feature tree as a fingerprint to traverse through the multiple states. Secondly, the transition weight with StatusRank method is calculated for each states of the Ajax page. The state transition information is analyzed. Finally, a WSFT is generated. The experimental results show that the proposed method can effectively obtain the text information in Ajax page with multiple states, and help the follow-up important text extraction of web mining.

Key words: Ajax crawler, weighted state fusion tree, text mining, text feature tree

摘要： Ajax技术在深层网（Deep Web）网站开发中得到了广泛应用。针对Ajax页面多状态、状态之间强关联的特性，提出一种构建WSFT（带权状态融合树）模型的方法，来进行Ajax页面文本信息预处理。引入了文本特征树作为状态指纹进行状态捕获，优化了当前Ajax页面数据采集方法，同时通过StatusRank方法计算状态转移权值来分析状态迁移信息，最后生成WSFT。实验证明，该方法能有效地获取Ajax页面多状态文本信息，并且有助于后续Web挖掘的重要文本内容提取。

关键词: Ajax爬虫, 带权状态融合树, 文本挖掘, 文本特征树

YANG Guanzhong, LI Hongxuan. Approach based on WSFT for crawling deep web[J]. Computer Engineering and Applications, 2017, 53(18): 236-242.

杨贯中，李虹萱. 基于WSFT模型的深层网文本获取方法[J]. 计算机工程与应用, 2017, 53(18): 236-242.

[1]	LIU Chenhui, ZHANG Desheng, HU Gang. Research on Chinese Key Phrase Extraction Algorithm Based on TAKE [J]. Computer Engineering and Applications, 2020, 56(10): 115-121.
[2]	HUANG Cheng1，2, LIU Jiayong1, LIU Liang1, HE Xiang1, TANG Dianhua2. Research on extraction model of malicious domain corpus based on context semantics [J]. Computer Engineering and Applications, 2018, 54(9): 101-108.
[3]	SUN Hongmin, JIANG Nannan, LI Xiang. Research on biological information mining model based on document set [J]. Computer Engineering and Applications, 2016, 52(24): 102-106.
[4]	HAN Yonghua, LEI Yuxia, CHEN Juan, WANG Xiangde. Multi-frame knowledge inconsistency detection and revision algorithms [J]. Computer Engineering and Applications, 2016, 52(23): 94-97.
[5]	QIU Yunfei, ZHAO Bin, LIN Mingming, WANG Wei. Improved K-means clustering algorithm combined semantic similarity of short text [J]. Computer Engineering and Applications, 2016, 52(19): 78-83.
[6]	SHAO Hao. Topic mining in trade policy review [J]. Computer Engineering and Applications, 2016, 52(11): 60-67.
[7]	SHI Yuzhen1，2，PENG Zhiyong3. Research on experts community discovery based on modified h index [J]. Computer Engineering and Applications, 2011, 47(29): 1-3.
[8]	MA Suqin，SHI Huaji. Text density clustering algorithm with optimized threshold values [J]. Computer Engineering and Applications, 2011, 47(17): 134-136.
[9]	LI Fang，ZHU Qunxiong. Study on science and research information’s auto-suggestion method based on text mining [J]. Computer Engineering and Applications, 2011, 47(10): 118-119.
[10]	WAN Hong-xin¹，PENG Yun². Technique of searching text clustering analysis based on fuzzy set [J]. Computer Engineering and Applications, 2009, 45(33): 135-137.
[11]	XIAO Hong,XUE De-jun. Numeric knowledge element mining based on large-scale realistic corpora [J]. Computer Engineering and Applications, 2008, 44(30): 150-152.
[12]	. The Research and Implementation of Web Text Clustering Algorithm WTCA [J]. Computer Engineering and Applications, 2007, 43(4期): 170-172.
[13]	LIU Yu-qin,WANG Xue-feng,LEI Xiao-ping. Quality estimation of patent based on text mining and its empirical research [J]. Computer Engineering and Applications, 2007, 43(33): 12-14.
[14]	MaoTing Gao ZhengOu Wang. A New Model for Text Feature Selection based on Twin Words Relationship [J]. Computer Engineering and Applications, 2007, 43(10期): 183-185.
[15]	,,. SHORT DOCUMENTS CLASSIFICATION METHOD IN VERY LARGE TEXT DATABASE [J]. Computer Engineering and Applications, 2006, 42(22期): 5-.

Approach based on WSFT for crawling deep web

基于WSFT模型的深层网文本获取方法

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics