Computer Engineering and Applications ›› 2016, Vol. 52 ›› Issue (23): 19-24.

Previous Articles     Next Articles

Automatic identification of address description in unstructured Chinese natural language

ZHAO Weifeng1,2, ZHANG Qin1   

  1. 1.College of Geology Engineering and Geomatics, Chang’an University, Xi’an 710054, China
    2.State Key Laboratory of Geo-information Engineering, Xi’an 710054, China
  • Online:2016-12-01 Published:2016-12-20

非结构化中文自然语言地址描述的自动识别

赵卫锋1,2,张  勤1   

  1. 1.长安大学 地质工程与测绘学院,西安 710054
    2.地理信息工程国家重点实验室,西安 710054

Abstract: The texts of address description in natural language, which are massive and available on the Internet, imply a wealth of spatial information. Considering its unstructured characteristics, a two-step approach is proposed in this paper to automatically extract the information of words and syntaxes from the corpus of address description in Chinese natural language, for further discovery of associated spatial knowledge. In the first step, an gazetteer-independent word segmentation algorithm for Chinese is designed, according to statistical regularities of the co-occurrence of character strings in the address corpus. In this algorithm, a predefined list comprised of common words used for indicating or restricting others in address statements, could be introduced to improve segmentation effect and facilitate part-of-speech tagging. In the second step, a finite state machine model is built to represent common syntaxes of Chinese address description, and then applied to automatically match and recognize the syntactic structures of segmented and tagged address statements. On the basis of the abundant address corpus collected from Internet, the experiments for statistical segmentation and syntactic recognition demonstrate the effectiveness and availability of this approach.

Key words: address description, natural language, Chinese word segmentation, syntactic recognition

摘要: 互联网中存在海量易获取的自然语言形式地址描述文本,其中蕴含丰富的空间信息。针对其非结构化特点,提出了自动提取中文自然语言地址描述中词语和句法信息的方法,以便深度挖掘空间知识。首先,根据地址语料中字串共现的统计规律设计一种不依赖地名词典的中文分词算法,并利用在地址文本中起指示、限定作用的常见词语组成的预定义词表改善分词效果及辅助词性标注。分词完成后,定义能够表达中文地址描述常用句法的有限状态机模型,进而利用其自动匹配与识别地址文本的句法结构。最后,基于大规模真实语料的统计分词及句法识别实验表明了该方法的可用性及有效性。

关键词: 地址描述, 自然语言, 中文分词, 句法识别