Method of expressing features used for large-scale text classification

Computer Engineering and Applications ›› 2007, Vol. 43 ›› Issue (15): 170-172.

• 数据库与信息处理 • Previous Articles Next Articles

Method of expressing features used for large-scale text classification

HAO Chun-feng¹，WANG Zhong-min²

Computer Department，University of Science and Technology Beijing，Beijing 100083，China

Received:1900-01-01 Revised:1900-01-01 Online:2007-05-21 Published:2007-05-21
Contact: HAO Chun-feng

一种用于大规模文本分类的特征表示方法

郝春风¹，王忠民²

北京科技大学计算机系，北京 100083

通讯作者: 郝春风

Abstract

Abstract: Along with the technical development of network and information technology，the text categorization becomes the key technique on processing and organizing large scale of texts.How to characterize the text exactly as a data set that can be processed is a key problem that restricts the improvement of text categorization results seriously.The author brings up a formula used to characterize the text named p-idf based on the vector space model and tf-idf.After comparing Bayes，K neighbors，neural network and support vector machine these four typical text categorization devices，the author builds a text categorization system using support vector machine.After a scientifical test which displays the impact to the text categorization system of these three formula tf-idf，p-idf，LTC，we can conclude that the p-idf formula is reasonable and valid to a text categorization system.

Key words: text categorization, vector space model, p-idf, Support Vector Machine（SVM）

摘要： 随着网络和信息技术的迅猛发展，文本分类成为处理和组织大量文档数据的关键技术。文本的特征表示严重地限制了文本分类性能的提升。以经典的向量空间模型和tf-idf权值计算公式为基础，提出了以应用于文本分类为目的的权值改进公式p-idf公式。在比较了贝叶斯、K近邻、神经网络和支持向量机四种典型的文本分类器的基础上，采用支持向量机分类器搭建了一个文本分类试验系统。经过科学的试验比较了tf-idf、p-idf、LTC三种权值公式在文本分类系统中对分类器性能的影响，证实了所提出的p-idf公式的合理性和有效性。

关键词: 文本分类, 向量空间模型, p-idf, 支持向量机

HAO Chun-feng¹，WANG Zhong-min². Method of expressing features used for large-scale text classification[J]. Computer Engineering and Applications, 2007, 43(15): 170-172.

郝春风¹，王忠民². 一种用于大规模文本分类的特征表示方法[J]. 计算机工程与应用, 2007, 43(15): 170-172.

[1]	HAN Weiyu, CHENG Longsheng. Research on Roling Bearing Failure Mode Classification Based on MTS and SVM [J]. Computer Engineering and Applications, 2021, 57(6): 239-246.
[2]	WEN Jiebin, YANG Wenzhong, MA Guoxiang, ZHANG Zhihao, LI Hailei. Micro-expression Recognition Based on Apex Frame Optical Flow and Convolutional Autoencoder [J]. Computer Engineering and Applications, 2021, 57(4): 127-133.
[3]	XU Xianfeng, CAI Lulu, ZHANG Li. Photovoltaic Power Generation Prediction Algorithm Based on MLP and DBN [J]. Computer Engineering and Applications, 2021, 57(3): 266-272.
[4]	LI Junxia, ZHANG Qin, ZHENG Guimei. Overview of Human Posture Recognition by Ultra-wideband Radar [J]. Computer Engineering and Applications, 2021, 57(3): 14-23.
[5]	CHEN Fujian, XIE Weixin, XIA Ting. Adaptive Anti-occlusion Target Tracking Algorithm Based on LCT+ [J]. Computer Engineering and Applications, 2021, 57(22): 190-198.
[6]	SHEN Yanguang, JIA Yaoqing. Text Categorization Method Based on Word Co-occurrence and Graph Convolution [J]. Computer Engineering and Applications, 2021, 57(11): 173-178.
[7]	CHEN Feiyu, YUE Wenbin, RAO Yinglu, XING Jinhao, MA Xiaojing. Autonomous Precision Landing of Drone Based on Improved TLD Algorithm [J]. Computer Engineering and Applications, 2020, 56(7): 247-254.
[8]	MA Ling, LUO Xiaoshu, JIANG Pinqun. Research on Dot Matrix Character Recognition Based on Template Matching and Support Vector Machine [J]. Computer Engineering and Applications, 2020, 56(4): 134-139.
[9]	ZHANG Zhonglin, FENG Yibang, ZHAO Zhongkai. Oversampling Method for Unbalanced Data Sets Based on SVM [J]. Computer Engineering and Applications, 2020, 56(23): 220-228.
[10]	HUANG Guangjun, DENG Yuanlong. Polarizer Visual Defect Detection and Classification Based on Improved LBP and SVM Algorithm [J]. Computer Engineering and Applications, 2020, 56(22): 251-255.
[11]	HAN Bang, LI Zichen, TANG Yongli. Design and Implementation of Full Text Retrieval Scheme Based on Homomorphic Encryption [J]. Computer Engineering and Applications, 2020, 56(21): 103-107.
[12]	SUI Xiuwu, NIU Jiabao, LI Haotian, QIAO Mingmin. Upper Limb sEMG Gesture Recognition Method Based on NMF-SVM Model [J]. Computer Engineering and Applications, 2020, 56(17): 161-166.
[13]	YANG Yu，ZENG Guohui，HUANG Bo. Fault Diagnosis Method of Bearings Based on Dual-Tree Complex Wavelet Packet Transform and Improved SVM [J]. Computer Engineering and Applications, 2020, 56(17): 231-235.
[14]	YANG Ying, WANG Jun, WANG Gang. Customer Complaints Classification Method Based on Improved Random Subspace [J]. Computer Engineering and Applications, 2020, 56(13): 230-235.
[15]	YANG Yanrong, SONG Rongjie, ZHOU Zhaoyong. Network Intrusion Detection Method Based on GAN-PSO-ELM [J]. Computer Engineering and Applications, 2020, 56(12): 66-72.

Method of expressing features used for large-scale text classification

一种用于大规模文本分类的特征表示方法

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics