Web content information extraction using density of feature text

doi:10.3778/j.issn.1002-8331.2010.20.001

Computer Engineering and Applications ›› 2010, Vol. 46 ›› Issue (20): 1-3.DOI: 10.3778/j.issn.1002-8331.2010.20.001

• 博士论坛 • Previous Articles Next Articles

Web content information extraction using density of feature text

WANG Shao-kang^1，2，DONG Ke-jun¹，YAN Bao-ping¹

1.Computer Network Information Center，Chinese Academy of Sciences，Beijing 100190，China
2.Graduate School of Chinese Academy of Sciences，Beijing 100049，China

Received:2010-03-17 Revised:2010-05-17 Online:2010-07-11 Published:2010-07-11
Contact: WANG Shao-kang

使用特征文本密度的网页正文提取

王少康^1，2，董科军¹，阎保平¹

1.中国科学院计算机网络信息中心，北京 100190
2.中国科学院研究生院，北京 100049

通讯作者: 王少康

Abstract

Abstract: The current web pages are getting more and more diverse，complex and non-standardized which makes the information extraction more difficult，the paper proposes a web content information extraction method based on density of feature text，which classifies the page text according to its usage and features，and constructs mathematical models to analyze the text proportion and density，thus identifies the content information accurately.The method has rather low time and space complexity.Experiments show that it can extract content information effectively from complex and multi-topic web pages and has a wide applicability.

Key words: text density, text feature, information extraction, web page

摘要： 针对当前互联网网页越来越多样化、复杂化、非规范化的特点，提出了基于特征文本密度的网页正文提取方法。该方法将网页包含的文本根据用途和特征进行分类，并构建数学模型进行比例密度分析，从而精确地识别出主题文本。该方法的时间和空间复杂度均较低。实验显示，它能有效地抽取复杂网页以及多主题段网页的正文信息，具有很好的通用性。

关键词: 文本密度, 文本特征, 信息抽取, 网页

CLC Number:

TP393

WANG Shao-kang^1，2，DONG Ke-jun¹，YAN Bao-ping¹. Web content information extraction using density of feature text[J]. Computer Engineering and Applications, 2010, 46(20): 1-3.

王少康^1，2，董科军¹，阎保平¹. 使用特征文本密度的网页正文提取[J]. 计算机工程与应用, 2010, 46(20): 1-3.

[1]	WAN Mengxiang, YAO Hanbing. GAN Model for Malicious Web Training Data Generation [J]. Computer Engineering and Applications, 2021, 57(6): 124-130.
[2]	WEI Hao, ZHOU Ai, ZHANG Yijia, CHEN Fei, QU Wen, LU Mingyu. Review of Deep Learning-Based Biomedical Entity Relation Extraction Research [J]. Computer Engineering and Applications, 2021, 57(21): 14-23.
[3]	WANG Qiaoyue, CHEN Shuyue. Defogging Algorithm Based on Color Transfer and Regularization Constraints of License Plate Images [J]. Computer Engineering and Applications, 2021, 57(14): 217-222.
[4]	WU Cheng, WANG Chaokun, WANG Muxian. Entity Attributes Extraction Based on Text Simplification [J]. Computer Engineering and Applications, 2020, 56(21): 115-122.
[5]	YANG Heping, ZHANG Zhiqiang, YANG Ming, YANG Di, JIANG Xiaowei, CHEN Jinghua. Design and Development of Online Visualization Platform for Meteorological Grid Data [J]. Computer Engineering and Applications, 2019, 55(18): 207-211.
[6]	HUANG Cheng1，2, LIU Jiayong1, LIU Liang1, HE Xiang1, TANG Dianhua2. Research on extraction model of malicious domain corpus based on context semantics [J]. Computer Engineering and Applications, 2018, 54(9): 101-108.
[7]	WANG Haiyong, FENG Zhaoxu, YANG Haibo, ZHANG Jindong. Research on text extraction algorithm based on structure similarity page clustering [J]. Computer Engineering and Applications, 2018, 54(11): 122-127.
[8]	DU Boyuan1, WANG Meiqing1, CHEN Changfu2, CHEN Fei1. Tags extraction for Web information based on structure consistency and feature learning [J]. Computer Engineering and Applications, 2017, 53(7): 74-78.
[9]	ZHAO Xiaoyong, WANG Lei. Product specification auto extract method of e-commerce websites [J]. Computer Engineering and Applications, 2017, 53(24): 168-171.
[10]	DIAN Yujie, JIN Qin, WU Huimin. Stance detection in Chinese microblogs via fusing multiple text features [J]. Computer Engineering and Applications, 2017, 53(21): 77-84.
[11]	YANG Guanzhong, LI Hongxuan. Approach based on WSFT for crawling deep web [J]. Computer Engineering and Applications, 2017, 53(18): 236-242.
[12]	GU Nannan, FENG Jun, SUN Xia, ZHAO Yan, ZHANG Lei. Chinese resume information automatic extraction and recommendation algorithm [J]. Computer Engineering and Applications, 2017, 53(18): 141-148.
[13]	SUN Hongmin, JIANG Nannan, LI Xiang. Research on biological information mining model based on document set [J]. Computer Engineering and Applications, 2016, 52(24): 102-106.
[14]	YI Zheng, XU Wuping, XU Aiping. Discovery method of webpage subject area based on structural analysis [J]. Computer Engineering and Applications, 2015, 51(6): 227-230.
[15]	CHEN Xianfu, LI Shijun, ZENG Hui. Classification of web pages based on extreme learning machine [J]. Computer Engineering and Applications, 2015, 51(5): 102-106.

Web content information extraction using density of feature text

使用特征文本密度的网页正文提取

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics