计算机工程与应用 ›› 2011, Vol. 47 ›› Issue (35): 128-131.

• 数据库、信号与信息处理 • 上一篇    下一篇

语义相似度的基因名标准化方法

胡运翠,林鸿飞,杨志豪   

  1. 大连理工大学 电子信息与电气工程学部,辽宁 大连 116024
  • 收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2011-12-11 发布日期:2011-12-11

Gene name normalization based on extended semantic similarity

HU Yuncui,LIN Hongfei,YANG Zhihao   

  1. School of Computer Science and Technology,Dalian University of Technology,Dalian,Liaoning 116024,China
  • Received:1900-01-01 Revised:1900-01-01 Online:2011-12-11 Published:2011-12-11

摘要: 针对生物医学数据库中基因标识符的描述信息不够丰富和完整,不能很好地区分歧义词不同含义的问题,给出了一种基于扩展语义相似度的基因名标准化方法。该方法利用MEDLINE摘要信息和基因本体描述信息,为数据库中的基因标识符生成了扩展的语义信息;然后通过比较歧义基因名的上下文信息和其不同语义描述信息之间的相似性,为歧义基因名确定能够表达真实含义的唯一基因标识符。使用BioCreative II基因标准化任务的语料,实验结果的准确率达到了80%,召回率达到了82.4%,F值达到了81.2%。从实验结果可以看出,扩展语义相似度的方法适用于生物医学领域的命名实体标准化研究。

关键词: 基因, 标准化, 扩展语义相似度, 消歧

Abstract: In this paper,a normalization method based on extended semantic similarity is presented to resolve the problem that description of gene symbols in biomedical databases is not rich and complete so that it is hard to make a choice from different gene symbols for the ambiguous term.In this method,extended semantic information is extracted for each gene symbol from gene ontology and MEDLINE abstracts,and the unique identifier which expresses the actual meaning of the named entities is determined depending on the similarity of the context information and extended semantic description.The experiment on Bio- Creative II gene normalization task achieves an F-measure performance of 81.2%(precision:80% recall:82.4%).The experimental result shows that the method based on extended semantic similarity can apply to gene named entities normalization.

Key words: gene, normalization, extended semantic similarity, disambiguation