Entity Extraction of Adverse Drug Reaction on Social Media Based on Tri-training

doi:10.3778/j.issn.1002-8331.2208-0433

Abstract

Abstract: Due to the real-time nature of social media data, the full use of it can make up for the delay problem of entity extraction in traditional medical literature adverse drug reaction. However, social media texts face problems such as high cost of labeling data and noise, making it difficult for the model to perform well. Aiming at the problem of high labeling cost in a large number of unlabeled corpora in social media, the Tri-training semi-supervised method is used to extract entities of adverse drug reaction. Unlabeled data are annotated by Transformer+CRF, BiLSTM+CRF and IDCNN+CRF, and then the training set is iteratively expanded by the consistency evaluation function. Finally, the output labels of model is integrated through weighted voting. Aiming at the informality of texts in social media (serious colloquialism, typos, etc.) , this paper extracts richer semantic information by merging two granularity vectors as the input of the model embedding layer. The experimental results show that the proposed model achieves good performance on the dataset obtained from the “Good Doctor Online” website.

Key words: Chinese social media, adverse drug reaction, entity extraction, semi-supervised learning, Tri-training

摘要： 社交媒体因其数据的实时性，对其充分利用可以弥补传统医疗文献药物不良反应中实体抽取的迟滞性问题，但社交媒体文本面临标注数据成本高、数据噪声大等问题，使得模型难以发挥良好的效果。针对社交媒体大量未标注语料存在标注成本高的问题，采用Tri-training半监督的方法进行社交媒体药物不良反应实体抽取，通过三个学习器Transformer+CRF、BiLSTM+CRF和IDCNN+CRF对未标注数据进行标注，再利用一致性评价函数迭代地扩展训练集，最后通过加权投票整合模型输出标签。针对社交媒体的文本不正式性（口语化严重、错别字等）问题，通过融合字与词两个粒度的向量作为整个模型嵌入层的输入，来提取更丰富的语义信息。实验结果表明，提出的模型在“好大夫在线”网站获取的数据集上取得了良好表现。

关键词: 中文社交媒体, 药物不良反应, 实体抽取, 半监督学习, Tri-training

HE Zhongbo, YAN Xin, XU Guangyi, ZHANG Jinpeng, DENG Zhongying. Entity Extraction of Adverse Drug Reaction on Social Media Based on Tri-training[J]. Computer Engineering and Applications, 2024, 60(3): 177-186.

何忠玻, 严馨, 徐广义, 张金鹏, 邓忠莹. 基于Tri-training的社交媒体药物不良反应实体抽取[J]. 计算机工程与应用, 2024, 60(3): 177-186.

References

[1] 朱晓旭. 面向社交媒体的药物关系挖掘研究[D]. 大连: 大连理工大学, 2020.
ZHU X X. Research on drug relationship mining for social media[D]. Dalian: Dalian University of Technology, 2020.
[2] FRIEDMAN C, ALDERSON P O, AUSTIN J H M, et al. A general natural-language text processor for clinical radiology[J]. Journal of the American Medical Informatics Association, 1994, 1(2): 161-174.
[3] LI D, SAVOVA G, KIPPER K. Conditional random fields and support vector machines for disorder named entity recognition in clinical texts[C]//Proceedings of the Workshop on Current Trends in Biomedical Natural Language Processing, 2008: 94-95.
[4] ARAMAKI E, MIURA Y, TONOIKE M, et al. Extraction of adverse drug effects from clinical records[M]//Studies in health technology and informatics.[S.l.]: IOS Press, 2010: 739-743.
[5] TANG B, CAO H, WU Y, et al. Recognizing clinical entities in hospital discharge summaries using structural support vector machines with word representation features[J]. BMC Medical Informatics and Decision Making, 2013, 13(1): 1-10.
[6] ZHANG S, ELHADAD N. Unsupervised biomedical named entity recognition: experiments with clinical and biological texts[J]. Journal of Biomedical Informatics, 2013, 46(6): 1088-1098.
[7] HOCHREITER S, SCHMIDHUBER J. Long short-term memory[J]. Neural Computation, 1997, 9(8): 1735-1780.
[8] GREENBERG N, BANSAL T, VERGA P, et al. Marginal likelihood training of BiLSTM-CRF for biomedical named entity recognition from disjoint label sets[C]//Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018: 2824-2829.
[9] WU G, TANG G, WANG Z, et al. An attention-based BiLSTM-CRF model for Chinese clinic named entity recognition[J]. IEEE Access, 2019, 7: 113942-113949.
[10] WU F, LIU J, WU C, et al. Neural Chinese named entity recognition via CNN-LSTM-CRF and joint training with word segmentation[C]//The World Wide Web Conference, 2019: 3342-3348.
[11] DAI Z, WANG X, NI P, et al. Named entity recognition using BERT BiLSTM CRF for Chinese electronic health records[C]//2019 12th International Congress on Image and Signal Processing, Biomedical Engineering and Informatics (CISP-BMEI), 2019: 1-5.
[12] 佘朝阳, 严馨, 徐广义, 等. 融合数据增强和半监督学习的药物不良反应检测[J]. 计算机工程, 2022, 48(6): 314-320.
SHE Z Y, YAN X, XU G Y, , et al. Adverse drug reaction detection based on data augmentation and semi-supervised learning[J]. Computer Engineering, 2022, 48(6): 314-320.
[13] RILOFF E, WIEBE J, WILSON T. Learning subjective nouns using extraction pattern bootstrapping[C]//Proceedings of the Seventh Conference on Natural Language Learning at HLT (NAACL 2003), 2003: 25-32.
[14] BLUM A, MITCHELL T. Combining labeled and unlabeled data with co-training[C]//Proceedings of the Eleventh Annual Conference on Computational Learning Theory, 1998: 92-100.
[15] ZHOU Z H, LI M. Tri-training: exploiting unlabeled data using three classifiers[J]. IEEE Transactions on Knowledge and Data Engineering, 2005, 17(11): 1529-1541.
[16] GUPTA S, GUPTA M, VARMA V, et al. Multi-task learning for extraction of adverse drug reaction mentions from tweets[C]//European Conference on Information Retrieval. Cham: Springer, 2018: 59-71.
[17] DENG N, XIONG C. Serialized co-training-based recognition of medicine names for patent mining and retrieval[J]. International Journal of Data Warehousing and Mining (IJDWM), 2020, 16(3): 87-107.
[18] 谢俊, 严馨, 王若兰, 等. 基于Tri-training的柬埔寨语组织机构名识别[J]. 软件导刊, 2018, 17(5): 127-131.
XIE J, YAN X, WANG R L, et al Recognition of the names of Cambodian organizations based on Tri-training[J]. Software Guide, 2018, 17(5): 127-131.
[19] 张厚昌, 刘成良. 融合嵌入字词特征的中文医疗命名实体识别[J]. 中华医学图书情报杂志, 2021, 30(9): 42-49.
ZHANG H C, LIU C L. Recognition of Chinese-named medical entities embedded words character[J]. Chinese Journal of Medical Library and Information Science, 2021, 30(9): 42-49.
[20] SONG Y, SHI S, LI J, et al. Directional skip-gram: explicitly distinguishing left and right context for word embeddings[C]//Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2018: 175-180.
[21] YU F, KOLTUN V. Multi-scale context aggregation by dilated convolutions[J]. arXiv:1511.07122, 2015.
[22] STRUBELL E, VERGA P, BELANGER D, et al. Fast and accurate entity recognition with iterated dilated convolutions[C]//Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 2017.
[23] LI X, ZHANG H, ZHOU X H. Chinese clinical named entity recognition with variant neural structures based on BERT methods[J]. Journal of Biomedical Informatics, 2020, 107: 103422.
[24] GUPTA S, GUPTA M, VARMA V, et al. Co-training for extraction of adverse drug reaction mentions from tweets[C]//European Conference on Information Retrieval. Cham: Springer, 2018: 556-562.
[25] LAINE S, AILA T. Temporal ensembling for semi-supervised learning[J]. arXiv:1610.02242, 2016.