计算机工程与应用 ›› 2024, Vol. 60 ›› Issue (3): 177-186.DOI: 10.3778/j.issn.1002-8331.2208-0433

• 模式识别与人工智能 • 上一篇    下一篇

基于Tri-training的社交媒体药物不良反应实体抽取

何忠玻,严馨,徐广义,张金鹏,邓忠莹   

  1. 1.昆明理工大学 信息工程与自动化学院,昆明 650500
    2.昆明理工大学 云南省人工智能重点实验室,昆明 650500
    3.云南南天电子信息产业股份有限公司 昆明南天电脑系统有限公司,昆明 650040
    4.云南大学 信息学院,昆明 650091
    5.云南财经大学 信息学院,昆明 650221
  • 出版日期:2024-02-01 发布日期:2024-02-01

Entity Extraction of Adverse Drug Reaction on Social Media Based on Tri-training

HE Zhongbo, YAN Xin, XU Guangyi, ZHANG Jinpeng, DENG Zhongying   

  1. 1.Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650500, China
    2.Yunnan Key Laboratory of Artificial Intelligence, Kunming University of Science and Technology, Kunming 650500, China
    3.Kunming Nantian Computer System Co., Ltd., Yunnan Nantian Electronic Information Industry Co., Ltd., Kunming 650040, China
    4. School?of?Information?Science?and?Engineering, Yunnan?University, Kunming 650091, China
    5. School?of?Information, Yunnan?University?of?Finance?and?Economics, Kunming 650221, China
  • Online:2024-02-01 Published:2024-02-01

摘要: 社交媒体因其数据的实时性,对其充分利用可以弥补传统医疗文献药物不良反应中实体抽取的迟滞性问题,但社交媒体文本面临标注数据成本高、数据噪声大等问题,使得模型难以发挥良好的效果。针对社交媒体大量未标注语料存在标注成本高的问题,采用Tri-training半监督的方法进行社交媒体药物不良反应实体抽取,通过三个学习器Transformer+CRF、BiLSTM+CRF和IDCNN+CRF对未标注数据进行标注,再利用一致性评价函数迭代地扩展训练集,最后通过加权投票整合模型输出标签。针对社交媒体的文本不正式性(口语化严重、错别字等)问题,通过融合字与词两个粒度的向量作为整个模型嵌入层的输入,来提取更丰富的语义信息。实验结果表明,提出的模型在“好大夫在线”网站获取的数据集上取得了良好表现。

关键词: 中文社交媒体, 药物不良反应, 实体抽取, 半监督学习, Tri-training

Abstract: Due to the real-time nature of social media data, the full use of it can make up for the delay problem of entity extraction in traditional medical literature adverse drug reaction. However, social media texts face problems such as high cost of labeling data and noise, making it difficult for the model to perform well. Aiming at the problem of high labeling cost in a large number of unlabeled corpora in social media, the Tri-training semi-supervised method is used to extract entities of adverse drug reaction. Unlabeled data are annotated by Transformer+CRF, BiLSTM+CRF and IDCNN+CRF, and then the training set is iteratively expanded by the consistency evaluation function. Finally, the output labels of model is integrated through weighted voting. Aiming at the informality of texts in social media (serious colloquialism, typos, etc.) , this paper extracts richer semantic information by merging two granularity vectors as the input of the model embedding layer. The experimental results show that the proposed model achieves good performance on the dataset obtained from the “Good Doctor Online” website.

Key words: Chinese social media, adverse drug reaction, entity extraction, semi-supervised learning, Tri-training