计算机工程与应用 ›› 2021, Vol. 57 ›› Issue (7): 115-120.DOI: 10.3778/j.issn.1002-8331.1912-0489

• 模式识别与人工智能 • 上一篇    下一篇

基于文献挖掘的生物实体关系提取研究

陈伟,徐云   

  1. 1.中国科学技术大学 计算机科学与技术学院,合肥 230026
    2.安徽省高性能计算重点实验室,合肥 230026
  • 出版日期:2021-04-01 发布日期:2021-04-02

Research on Extraction of Biomedical Entity Relation Based on Literature Mining

CHEN Wei, XU Yun   

  1. 1.School of Computer Science and Technology, University of Science and Technology of China, Hefei 230026, China
    2.Key Laboratory of High Performance Computing of Anhui Province, Hefei 230026, China
  • Online:2021-04-01 Published:2021-04-02

摘要:

生物医学研究人员经常搜索大量文献,寻找生物实体之间的作用关系,如:药物-药物、化合物-蛋白质等作用关系。随着医学文献的激增和深度学习的发展,自动从文献中提取生物实体作用关系已经显示出巨大潜力。以往使用深度学习的方法取得了一定效果,但存在以下问题:模型采用静态词向量,不能区分一词多义;未考虑单词的权重,对长句子提取效果较差;通过多种模型集成来改善样本不平衡问题,模型较为复杂。为此提出一种基于残差结构的深层多通道CNN模型(MCCNN),通过BERT(Bidirectional Encoder Representation from Transformers)产生动态词向量来提高词汇语义表示的准确性,利用多头注意力捕获长句子的依赖并通过设计Ranking损失函数代替多模型集成来降低样本不平衡的影响。在多个数据集上进行测试,结果表明提出的方法具有较好的效果。

关键词: 生物医学文献, 关系提取, 注意力机制, 多通道

Abstract:

Biomedical researchers often search the literature for interactions between biological entities, such as drug-drug interactions, chemical-protein interactions. With the rapid growth of biomedical literature and the development of deep learning, automatic extraction of biological entity interactions from literature has shown great potential. The previous methods using deep learning have achieved certain results, but there are some problems as follows:The static word vector is used in the model, which can’t distinguish the polysemy of a word; the weight of the word is not considered, and the effect of long sentence extraction is poor; it improves the sample imbalance problem by ensembles of models, which is more complex. Therefore, the paper proposes a deep Multi-Channel CNN(MCCNN) model based on residual structure, which uses BERT(Bidirectional Encoder Representation from Transformers) to generate dynamic word vectors to improve the accuracy of word semantic representation, and uses multi-head attention to capture long sentence dependencies, and reduces the impact of sample imbalance through the Ranking loss function instead of ensembles of models. Experiments on several data sets show that the proposed method is effective.

Key words: biomedical literature, relation extraction, attention mechanisms, multi-channel