集成多种特征匹配中文实体名称

计算机工程与应用 ›› 2012, Vol. 48 ›› Issue (27): 136-141.

• 数据库、信号与信息处理 • 上一篇下一篇

集成多种特征匹配中文实体名称

巩军

1.北京大学信息技术学院博士后流动站，北京 100871
2.神华集团博士后工作站，北京 100011

出版日期:2012-09-21 发布日期:2012-09-24

Matching Chinese entity names with multiple features

GONG Jun

1.Mobile Post-doctoral Station, School of Electronics Engineering and Computer Science, Peking University, Beijing 100871, China
2.Postdoctoral Centre, Shenhua Group, Beijing 100011, China

Online:2012-09-21 Published:2012-09-24

摘要/Abstract

摘要： 准确匹配实体名称在信息系统集成中有广泛的应用，而在中文环境中，实体名称的变化和笔误使得中文实体名称难以准确匹配，所以需要开发出适应这些变化和笔误的匹配方法。中文实体名称的相似度从字、词、语义三个层次计算出来，将这些相似度线性合并起来，集成各自的优势。为了利用更多的匹配特征，引入了两种机器学习的方法：第一种方法通过训练获得一个优化排序和最佳切分点；第二种方法利用支持向量机来判断两个名称是否指向同一实体。在中文实体名称的数据集上的实验表明，这些方法和特征有效提高了匹配的效果。

关键词: 字符串相似度, 名字消歧, 名字匹配, 机器学习

Abstract: Entity name matching plays an important role in information system integration applications, while the name variations and clerical errors in Chinese entity names make exact string matching problematic. Therefore it is important to develop methodologies that can handle the different variants of the same name entity. The Chinese entity name similarity is measured based on character, word and semantic levels separately, and a hybrid solution is introduced by combining these similarities linearly. Two machine learning methods are developed to integrate editing features for more precise matching: the optimized ranking list and best cut point are achieved from a training process; a Support Vector Machine is used to judge the name pairs. The results of an experimental study on a real dataset of Chinese entity names are reported; the experiment results show the methods are effective.

Key words: string similarity, name disambiguation, name-matching, machine learning

巩军. 集成多种特征匹配中文实体名称[J]. 计算机工程与应用, 2012, 48(27): 136-141.

GONG Jun. Matching Chinese entity names with multiple features[J]. Computer Engineering and Applications, 2012, 48(27): 136-141.

[1]	冉蓉，徐兴华，邱少华，崔小鹏，欧阳斌. 基于深度卷积神经网络的裂纹检测方法综述[J]. 计算机工程与应用, 2021, 57(9): 23-35.
[2]	韦佶宏，郑荣锋，刘嘉勇. 基于混合神经网络的恶意TLS流量识别研究[J]. 计算机工程与应用, 2021, 57(7): 107-114.
[3]	张晓丽，张魁星，江梅，魏本征，丛金玉. 淋巴瘤图像分类技术研究综述[J]. 计算机工程与应用, 2021, 57(6): 1-9.
[4]	韩东方，吐尔地·托合提，艾斯卡尔·艾木都拉. 问答系统中问句分类方法研究综述[J]. 计算机工程与应用, 2021, 57(6): 10-21.
[5]	万梦翔，姚寒冰. 面向恶意网页训练数据生成的GAN模型[J]. 计算机工程与应用, 2021, 57(6): 124-130.
[6]	杨晔民，张慧军，张小龙. 随机森林的可解释性可视分析方法研究[J]. 计算机工程与应用, 2021, 57(6): 168-175.
[7]	徐可文，许波，吴英，徐浩然. 机器学习在超声图像中的应用综述[J]. 计算机工程与应用, 2021, 57(4): 11-17.
[8]	王振东，张林，李大海. 基于机器学习的物联网入侵检测系统综述[J]. 计算机工程与应用, 2021, 57(4): 18-27.
[9]	王方，张雪英，胡风云，李凤莲. 集成分类器对脑卒中患者脑电的分类[J]. 计算机工程与应用, 2021, 57(24): 276-282.
[10]	吕品，武秦娟，许嘉. 上市公司文本信息披露智能分析研究综述[J]. 计算机工程与应用, 2021, 57(24): 1-13.
[11]	张隅希，段宗涛，朱依水，王路阳，周祎，郭宇. 机动车油耗模型研究综述[J]. 计算机工程与应用, 2021, 57(24): 14-26.
[12]	安卫超，阎婷，张楠，张杉，相洁，曹锐，王彬. 病理图像纹理分析在胃癌MSI预测中的应用研究[J]. 计算机工程与应用, 2021, 57(24): 205-211.
[13]	高见，孙懿，王润正，袁得嵛. 基于机器学习的浏览器挖矿检测模型研究[J]. 计算机工程与应用, 2021, 57(22): 125-130.
[14]	黎英. 迁移学习在医学图像分析中的应用研究综述[J]. 计算机工程与应用, 2021, 57(20): 42-52.
[15]	任泽裕，王振超，柯尊旺，李哲，吾守尔·斯拉木. 多模态数据融合综述[J]. 计算机工程与应用, 2021, 57(18): 49-64.

集成多种特征匹配中文实体名称

Matching Chinese entity names with multiple features

PDF

可视化

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

编辑推荐

Metrics