计算机工程与应用 ›› 2014, Vol. 50 ›› Issue (5): 108-111.
• 数据库、数据挖掘、机器学习 • 上一篇 下一篇
郭文龙
出版日期:
发布日期:
GUO Wenlong
Online:
Published:
摘要: 中文地址由行政区划地址和详细地址两部分组成,行政区划地址的处理可通过构建地址词典、分词、补充特征字等方式清洗,目前技术较为成熟。详细地址则随我国城镇化的发展而不断变化,且新的地址层出不穷,导致其清洗和规范化工作极其困难。在研究大数据量中文地址的基础上,提出了中文地址清洗模型,在行政区划地址先清洗并规范的前提下,对地址进行排序,利用SNM算法将详细地址聚集在一个较小的窗口内,对窗口内的地址进行匹配和清洗,实验结果证明清洗效果良好。
关键词: 中文地址, 政区划, 详细地址, 特征字, 清洗
Abstract: Chinese address consists of the administrative addresses and detailed addresses. The administrative addresses are cleaned by ways of constructing address dictionary, word segmentation, adding characteristic words, and so on. Current technology of cleaning administrative addresses is quite mature. With the urbanization of China, detailed addresses have changed continually and new addresses have emerged in an endless stream, which leads to the great difficulty of the cleaning and standardization of those addresses. The address cleaning model is proposed on the basis of studies of large amounts of Chinese address. According to this model, the detailed addresses are sorted on the premise of the cleaned and standardized administrative address. The detailed addresses are gathered in a smaller window by SNM algorithm. After that, they are matched and cleaned. The experimental result proves that it has good cleaning effect.
Key words: Chinese address, administrative divisions, detailed address, characteristic word, cleaning
郭文龙. 基于SNM算法的大数据量中文地址清洗方法[J]. 计算机工程与应用, 2014, 50(5): 108-111.
GUO Wenlong. Cleaning approach to large amounts of Chinese address based on SNM algorithm[J]. Computer Engineering and Applications, 2014, 50(5): 108-111.
0 / 推荐
导出引用管理器 EndNote|Ris|BibTeX
链接本文: http://cea.ceaj.org/CN/
http://cea.ceaj.org/CN/Y2014/V50/I5/108