Computer Engineering and Applications ›› 2014, Vol. 50 ›› Issue (5): 108-111.

Previous Articles     Next Articles

Cleaning approach to large amounts of Chinese address based on SNM algorithm

GUO Wenlong   

  1. College of Electronics and Information Science, Fujian Jiangxia University, Fuzhou 350108, China
  • Online:2014-03-01 Published:2015-05-12

基于SNM算法的大数据量中文地址清洗方法

郭文龙   

  1. 福建江夏学院 电子信息科学学院,福州 350108

Abstract: Chinese address consists of the administrative addresses and detailed addresses. The administrative addresses are cleaned by ways of constructing address dictionary, word segmentation, adding characteristic words, and so on. Current technology  of cleaning administrative addresses is quite mature. With the urbanization of China, detailed addresses have changed continually and new addresses have emerged in an endless stream, which leads to the great difficulty of the cleaning and standardization of those addresses. The address cleaning model is proposed on the basis of studies of large amounts of Chinese address. According to this model, the detailed addresses are sorted on the premise of the cleaned and standardized administrative address. The detailed addresses are gathered in a smaller window by SNM algorithm. After that, they are matched and cleaned. The experimental result proves that it has good cleaning effect.

Key words: Chinese address, administrative divisions, detailed address, characteristic word, cleaning

摘要: 中文地址由行政区划地址和详细地址两部分组成,行政区划地址的处理可通过构建地址词典、分词、补充特征字等方式清洗,目前技术较为成熟。详细地址则随我国城镇化的发展而不断变化,且新的地址层出不穷,导致其清洗和规范化工作极其困难。在研究大数据量中文地址的基础上,提出了中文地址清洗模型,在行政区划地址先清洗并规范的前提下,对地址进行排序,利用SNM算法将详细地址聚集在一个较小的窗口内,对窗口内的地址进行匹配和清洗,实验结果证明清洗效果良好。

关键词: 中文地址, 政区划, 详细地址, 特征字, 清洗