Research on extraction model of malicious domain corpus based on context semantics

doi:10.3778/j.issn.1002-8331.1612-0283

Computer Engineering and Applications ›› 2018, Vol. 54 ›› Issue (9): 101-108.DOI: 10.3778/j.issn.1002-8331.1612-0283

Previous Articles Next Articles

Research on extraction model of malicious domain corpus based on context semantics

HUANG Cheng1，2, LIU Jiayong1, LIU Liang1, HE Xiang1, TANG Dianhua2

1.College of Electronics and Information Engineering, Sichuan University, Chengdu 610065, China
2.Science and Technology on Communication Security Laboratory, Chengdu 610041, China

Online:2018-05-01 Published:2018-05-15

基于上下文语义的恶意域名语料提取模型研究

黄诚1，2，刘嘉勇1，刘亮1，何祥1，汤殿华2

1.四川大学电子信息学院，成都 610065
2.保密通信重点实验室，成都 610041

Abstract

Abstract: To solve the problem of omitting and false positive in extracting malicious domains based on whitelist filtering technology in massive text, a contextual semantic-based model for extracting malicious domain corpus is presented. The proposed approach is based on the context words and phrases which describes malicious domains in a technical way, and natural language processing technology is used to automatically generate corpus from sentences which contain malicious domains. Malicious domain corpus is generated from many advanced persistent threat reports and articles with the proposed model. The malicious corpus extracted from documents is verified by random forest classifier.

Key words: malware detection, text mining, information extraction, malicious corpus

摘要： 针对目前基于白名单过滤技术在海量文本中恶意域名提取的漏报、误报等问题，提出了一种基于上下文语义的恶意域名语料提取模型。该模型分别从恶意域名所在语句的上下文单词、短语进行语义分析，并利用自然语言处理技术自动生成描述恶意域名的语料。通过该模型对公开的APT（Advanced Persistent Threat）分析文档数据提取了大量恶意域名语料数据。利用安全博客文章数据并结合基于随机森林算法的机器分类模型对论文提取的恶意语料的有效性进行了验证。

关键词: 恶意域名, 文本挖掘, 提取模型, 恶意语料

HUANG Cheng1，2, LIU Jiayong1, LIU Liang1, HE Xiang1, TANG Dianhua2. Research on extraction model of malicious domain corpus based on context semantics[J]. Computer Engineering and Applications, 2018, 54(9): 101-108.

黄诚1，2，刘嘉勇1，刘亮1，何祥1，汤殿华2. 基于上下文语义的恶意域名语料提取模型研究[J]. 计算机工程与应用, 2018, 54(9): 101-108.

[1]	WEI Hao, ZHOU Ai, ZHANG Yijia, CHEN Fei, QU Wen, LU Mingyu. Review of Deep Learning-Based Biomedical Entity Relation Extraction Research [J]. Computer Engineering and Applications, 2021, 57(21): 14-23.
[2]	ZHANG Bohan, LING Jie. Improved Malware Detection Method Based on DNN [J]. Computer Engineering and Applications, 2021, 57(10): 81-87.
[3]	HUO Lin, LU Yinli. Improved Particle Swarm Optimization for Android Malware Detection [J]. Computer Engineering and Applications, 2020, 56(7): 96-101.
[4]	WU Cheng, WANG Chaokun, WANG Muxian. Entity Attributes Extraction Based on Text Simplification [J]. Computer Engineering and Applications, 2020, 56(21): 115-122.
[5]	LIU Chenhui, ZHANG Desheng, HU Gang. Research on Chinese Key Phrase Extraction Algorithm Based on TAKE [J]. Computer Engineering and Applications, 2020, 56(10): 115-121.
[6]	WANG Wenchong, LING Jie. Android malware detection approach based on fuzzy Hash [J]. Computer Engineering and Applications, 2018, 54(18): 133-138.
[7]	LIU Chushu, WANG Weiping, LIU Pengfei. Detection of Android malware using resource features [J]. Computer Engineering and Applications, 2018, 54(15): 67-73.
[8]	WANG Haiyong, FENG Zhaoxu, YANG Haibo, ZHANG Jindong. Research on text extraction algorithm based on structure similarity page clustering [J]. Computer Engineering and Applications, 2018, 54(11): 122-127.
[9]	DU Boyuan1, WANG Meiqing1, CHEN Changfu2, CHEN Fei1. Tags extraction for Web information based on structure consistency and feature learning [J]. Computer Engineering and Applications, 2017, 53(7): 74-78.
[10]	ZHAO Xiaoyong, WANG Lei. Product specification auto extract method of e-commerce websites [J]. Computer Engineering and Applications, 2017, 53(24): 168-171.
[11]	LIN Qingyu, LING Jie. Android malware detection based on application classfication and system calls [J]. Computer Engineering and Applications, 2017, 53(19): 109-113.
[12]	GU Nannan, FENG Jun, SUN Xia, ZHAO Yan, ZHANG Lei. Chinese resume information automatic extraction and recommendation algorithm [J]. Computer Engineering and Applications, 2017, 53(18): 141-148.
[13]	YANG Guanzhong, LI Hongxuan. Approach based on WSFT for crawling deep web [J]. Computer Engineering and Applications, 2017, 53(18): 236-242.
[14]	SUN Hongmin, JIANG Nannan, LI Xiang. Research on biological information mining model based on document set [J]. Computer Engineering and Applications, 2016, 52(24): 102-106.
[15]	HAN Yonghua, LEI Yuxia, CHEN Juan, WANG Xiangde. Multi-frame knowledge inconsistency detection and revision algorithms [J]. Computer Engineering and Applications, 2016, 52(23): 94-97.

Research on extraction model of malicious domain corpus based on context semantics

基于上下文语义的恶意域名语料提取模型研究

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics