Automatic identification of address description in unstructured Chinese natural language

Abstract

Abstract: The texts of address description in natural language, which are massive and available on the Internet, imply a wealth of spatial information. Considering its unstructured characteristics, a two-step approach is proposed in this paper to automatically extract the information of words and syntaxes from the corpus of address description in Chinese natural language, for further discovery of associated spatial knowledge. In the first step, an gazetteer-independent word segmentation algorithm for Chinese is designed, according to statistical regularities of the co-occurrence of character strings in the address corpus. In this algorithm, a predefined list comprised of common words used for indicating or restricting others in address statements, could be introduced to improve segmentation effect and facilitate part-of-speech tagging. In the second step, a finite state machine model is built to represent common syntaxes of Chinese address description, and then applied to automatically match and recognize the syntactic structures of segmented and tagged address statements. On the basis of the abundant address corpus collected from Internet, the experiments for statistical segmentation and syntactic recognition demonstrate the effectiveness and availability of this approach.

Key words: address description, natural language, Chinese word segmentation, syntactic recognition

摘要： 互联网中存在海量易获取的自然语言形式地址描述文本，其中蕴含丰富的空间信息。针对其非结构化特点，提出了自动提取中文自然语言地址描述中词语和句法信息的方法，以便深度挖掘空间知识。首先，根据地址语料中字串共现的统计规律设计一种不依赖地名词典的中文分词算法，并利用在地址文本中起指示、限定作用的常见词语组成的预定义词表改善分词效果及辅助词性标注。分词完成后，定义能够表达中文地址描述常用句法的有限状态机模型，进而利用其自动匹配与识别地址文本的句法结构。最后，基于大规模真实语料的统计分词及句法识别实验表明了该方法的可用性及有效性。

关键词: 地址描述, 自然语言, 中文分词, 句法识别

ZHAO Weifeng1，2, ZHANG Qin1. Automatic identification of address description in unstructured Chinese natural language[J]. Computer Engineering and Applications, 2016, 52(23): 19-24.

赵卫锋1，2，张勤1. 非结构化中文自然语言地址描述的自动识别[J]. 计算机工程与应用, 2016, 52(23): 19-24.

[1]	LIU Bowen, FAN Chunxiao. Relation Extraction Based on CapsuleNet via Position Perception [J]. Computer Engineering and Applications, 2021, 57(6): 101-107.
[2]	LIAO Wenxiong, ZENG Bi, XU Yayun. Natural Language Processing Model Based on One-Dimensional Dilated Convolution and Attention Mechanism [J]. Computer Engineering and Applications, 2021, 57(4): 114-119.
[3]	JIANG Yangyang, JIN Bo, ZHANG Baochang. Research Progress of Natural Language Processing Based on Deep Learning [J]. Computer Engineering and Applications, 2021, 57(22): 1-14.
[4]	YUAN Xun, LIU Rong, LIU Ming. Aspect-Level Sentiment Analysis Model Incorporating Multi-layer Attention [J]. Computer Engineering and Applications, 2021, 57(22): 147-152.
[5]	YANG Quan. SVM Algorithm for N1+N2 Structure Syntax Relation Determination [J]. Computer Engineering and Applications, 2021, 57(20): 104-108.
[6]	JIAO Kainan, LI Xin, ZHU Rongchen. Overview of Chinese Domain Named Entity Recognition [J]. Computer Engineering and Applications, 2021, 57(16): 1-15.
[7]	LIU Chang, Abudukelimu·Abulizi, YAO Dengfeng, Halidanmu·Abudukelimu. Survey for Uyghur Morphological Analysis [J]. Computer Engineering and Applications, 2021, 57(15): 42-61.
[8]	LI Zhi, WANG Zhen, YANG Fugeng, Xi Xuefeng. Research and Prospect of Automatic Question Answer Based on Table [J]. Computer Engineering and Applications, 2021, 57(13): 67-76.
[9]	BAO Yue, LI Yanling, LIN Min. Review of Extractive Machine Reading Comprehension [J]. Computer Engineering and Applications, 2021, 57(12): 25-36.
[10]	HE Yujie, DU Fang, SHI Yingjie, SONG Lijuan. Survey of Named Entity Recognition Based on Deep Learning [J]. Computer Engineering and Applications, 2021, 57(11): 21-36.
[11]	SUN Linghao. Cross-Lingual Chinese Named Entity Recognition Based on Translation Model [J]. Computer Engineering and Applications, 2021, 57(10): 94-100.
[12]	HAO Chao, QIU Hangping, SUN Yi, ZHANG Chaoran. Research Progress of Multi-label Text Classification [J]. Computer Engineering and Applications, 2021, 57(10): 48-56.
[13]	YU Tongrui, JIN Ran, HAN Xiaozhen, LI Jiahui, YU Ting. Review of Pre-training Models for Natural Language Processing [J]. Computer Engineering and Applications, 2020, 56(23): 12-22.
[14]	WU Cheng, WANG Chaokun, WANG Muxian. Entity Attributes Extraction Based on Text Simplification [J]. Computer Engineering and Applications, 2020, 56(21): 115-122.
[15]	TU Wenbo, YUAN Zhenming, YU Kai. Convolutional Neural Networks Without Pooling Layer for Chinese Word Segmentation [J]. Computer Engineering and Applications, 2020, 56(2): 120-126.

Automatic identification of address description in unstructured Chinese natural language

非结构化中文自然语言地址描述的自动识别

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics