计算机工程与应用 ›› 2013, Vol. 49 ›› Issue (6): 143-147.

• 数据库、数据挖掘、机器学习 • 上一篇    下一篇

搜索引擎日志中“N+V+N”、“V+N+N”型短语识别

郑  丽,吕学强   

  1. 北京信息科技大学 中文信息处理研究中心,北京 100101
  • 出版日期:2013-03-15 发布日期:2013-03-14

“N+V+N”、“V+N+N” structure phrase recognition in search engine query logs

ZHENG Li, LV Xueqiang   

  1. Chinese Information Processing Research Center, Beijing Information Science & Technology University, Beijing 100101, China
  • Online:2013-03-15 Published:2013-03-14

摘要: 短语识别是进行短语分析的前期准备工作。针对搜索引擎日志中“N+V+N”、“V+N+N”型短语特点,采用最大熵方法,按词信息、词性信息、音节数及前位标记信息提取特征构建训练集,得到最大熵方法进行短语识别的机器学习模型。实验结果显示,利用最大熵方法对两种短语进行开放性测试,两种短语的识别F值分别达到85.78%和76.47%,取得了较好的自动识别效果,在半开放性测试中,其识别结果更佳。

关键词: 短语识别, 搜索引擎日志, &ldquo, N+V+N&rdquo, &ldquo, V+N+N&rdquo, 最大熵方法

Abstract: The phrase recognition is the period preparatory work before carrying on the phrase analysis. This paper in view of the characteristics of “N+V+N”、“V+N+N” structure phrase in search engine query logs of the corpus, uses a method of maximum entropy to get the machine learning model for phrase recognition according to the word information, the part of speech information, the number of syllable, anterior tags. Experimental results of the open tests show better performances: F_value of “N+V+N” 85.78% and F_value of “V+N+N” 76.47%. In the semi open tests, the experiment result is better.

Key words: phrase recognition, search engine logs, “N+V+N”, “V+N+N”, maximum entropy