基于文档集的生物信息挖掘模型研究

计算机工程与应用 ›› 2016, Vol. 52 ›› Issue (24): 102-106.

基于文档集的生物信息挖掘模型研究

孙红敏，姜楠楠，李想

东北农业大学电信与信息学院，哈尔滨 150030

出版日期:2016-12-15 发布日期:2016-12-20

Research on biological information mining model based on document set

SUN Hongmin, JIANG Nannan, LI Xiang

School of Electrical and Information, Northeast Agricultural University, Harbin 150030, China

Online:2016-12-15 Published:2016-12-20

摘要/Abstract

摘要： 针对生物医学文献的数量急剧增长，人工从文献中获取所需要的信息已不能适应生物医学文献数量迅速生长的需要。利用Stanford Parser等开源工具，采用自然语言处理技术、统计学等多种方法，提出了一种新型的生物信息挖掘模型，并对其关键技术进行分析。该模型在对全文文本SBQTL（Soybean Quantitative Trait Loci）测试中父母本信息提取的准确率和召回率分别为93.0%和78.4%；在对PubMed测试中，准确率和召回率分别为94.3%和80.0%。解决了生物医学研究者从海量文献中更有效、快速地找到所需信息的问题，以便生物学家发现隐藏的生物医学知识并验证得到新的科学发现，从而使人们对生物医学现象的认识得到了提高。

关键词: 文本挖掘, Stanford Parser, 文本预处理, 依存关系, 信息抽取

Abstract: As the quantity of literature increases dramatically, to get the information manully can’t adapt to the speed of added literature. This paper proposes a new model of biological data mining, utilizing some tools of open source such as Stanford Parser, using some approaches such as natural language processing and statistics. It also analyzes its crucial technique. During the process to test the SBQTL（Soybean Quantitative Trait Loci） using this model, the precision and recall rate are 93.0% and 78.4% respectively. During the process to test the PubMed, the precision and recall rate are 94.3% and 80.0% respectively. So the problem that the researchers who are engaged in biomedicine can find the information they need from large quantity of literature quickly and efficiently is solved, and biologists can find closet information in biomedicine and verificate the newest science discovery. Thus, people can better understand the phenomenon of biomedicine.

Key words: text mining, Stanford Parser, text preprocessing, dependencies, information extraction

孙红敏，姜楠楠，李想. 基于文档集的生物信息挖掘模型研究[J]. 计算机工程与应用, 2016, 52(24): 102-106.

SUN Hongmin, JIANG Nannan, LI Xiang. Research on biological information mining model based on document set[J]. Computer Engineering and Applications, 2016, 52(24): 102-106.

[1]	隗昊，周爱，张益嘉，陈飞，屈雯，鲁明羽. 深度学习生物医学实体关系抽取研究综述[J]. 计算机工程与应用, 2021, 57(21): 14-23.
[2]	吴呈，王朝坤，王沐贤. 基于文本化简的实体属性抽取方法[J]. 计算机工程与应用, 2020, 56(21): 115-122.
[3]	刘晨晖，张德生，胡钢. 基于TAKE的中文关键短语提取算法研究[J]. 计算机工程与应用, 2020, 56(10): 115-121.
[4]	黄诚1，2，刘嘉勇1，刘亮1，何祥1，汤殿华2. 基于上下文语义的恶意域名语料提取模型研究[J]. 计算机工程与应用, 2018, 54(9): 101-108.
[5]	赵晓永，王磊. 电商网页中商品规格信息自动抽取方法研究[J]. 计算机工程与应用, 2017, 53(24): 168-171.
[6]	谷楠楠，冯筠，孙霞，赵妍，张蕾. 中文简历自动解析及推荐算法[J]. 计算机工程与应用, 2017, 53(18): 141-148.
[7]	杨贯中，李虹萱. 基于WSFT模型的深层网文本获取方法[J]. 计算机工程与应用, 2017, 53(18): 236-242.
[8]	陈迪，代艳君，王志锋. 论坛主题挖掘研究综述[J]. 计算机工程与应用, 2017, 53(16): 36-44.
[9]	冯钦林，杨志豪，林鸿飞. 疾病-病症和病症-治疗物质的关系抽取研究[J]. 计算机工程与应用, 2017, 53(10): 251-257.
[10]	韩永花，雷玉霞，陈娟，王祥德. 多框架知识的不一致性检测及其修正算法[J]. 计算机工程与应用, 2016, 52(23): 94-97.
[11]	邱云飞，赵彬，林明明，王伟. 结合语义改进的K-means短文本聚类算法[J]. 计算机工程与应用, 2016, 52(19): 78-83.
[12]	邵浩. 贸易文本的主题挖掘研究[J]. 计算机工程与应用, 2016, 52(11): 60-67.
[13]	伊政，徐武平，徐爱萍. 一种基于结构分析的网页主题区域发现方法[J]. 计算机工程与应用, 2015, 51(6): 227-230.
[14]	黄彦姣，吴秦，梁久祯. 基于增强约束条件随机场的Web对象信息抽取[J]. 计算机工程与应用, 2015, 51(23): 143-148.
[15]	张菲菲1，李宗海2，周晓辉1，李晓戈1,2. 基于层次聚类的跨文本中文人名消歧研究[J]. 计算机工程与应用, 2014, 50(6): 106-111.