计算机工程与应用 ›› 2015, Vol. 51 ›› Issue (14): 120-126.

• 数据库、数据挖掘、机器学习 • 上一篇    下一篇

基于语义网的中文百科知识组织与集成

付宇新1,王  鑫1,2,冯志勇1,2,吕雪栋1   

  1. 1.天津大学 计算机科学与技术学院 计算机科学与技术系,天津 300072
    2.天津市认知计算与应用重点实验室,天津 300072
  • 出版日期:2015-07-15 发布日期:2015-08-03

Organization and integration of Chinese encyclopedia knowledge based on semantic web

FU Yuxin1, WANG Xin1,2, FENG Zhiyong1,2, LV Xuedong1   

  1. 1.Department of Computer Science and Technology, School of Computer Science and Technology, Tianjin University, Tianjin 300072, China
    2.Tianjin Key Laboratory of Cognitive Computing and Application, Tianjin 300072, China
  • Online:2015-07-15 Published:2015-08-03

摘要: 通过从3个最大的中文百科全书(百度百科、互动百科、中文维基百科)所包含的大规模知识数据中识别重要的结构化特征生成RDF三元组,并将这些信息数据整合加入分布式大规模RDF数据存储系统,从而构成符合Linked Data要求的中文百科知识库RDF数据集。主要工作包括,通过配置网络爬虫对百度百科和互动百科的网页进行爬取,解析其中信息框等内容,生成RDF三元组并实现三元组的动态插入;下载需要的DBpedia中文三元组数据,将三元组进行整合并存储到课题组的大规模语义数据存储库Jingwei中;设计显示动态插入和三元组模式查询的页面,通过原型系统实验,验证了该方法的有效性。

关键词: 语义网, 资源描述框架(RDF), 中文百科全书, Linked Open Data, Nutch

Abstract: It identifies important structural features from immense knowledgeable data in three largest Chinese encyclopedias (Baidu Encyclopedia, Hudong Encyclopedia, Chinese Wikipedia) and generates RDF triples, then integrates the information data and sets into a distributed large-scale RDF data storage system, and constructs the RDF dataset of Chinese Encyclopedia Knowledge Base that fits with the requirements of Linked Data. The main work includes, configuring the web crawler to crawl the html pages from Baidu Encyclopedia and Hudong Encyclopedia, parsing the content of the information box to generate RDF triples, downloading Chinese triples data from DBpedia, integrating the data sets into the distributed large-scale RDF data storage system Jingwei, designing the display of dynamic inserting and triple pattern query pages, it also verifies the validity of the method through prototype system experiments.

Key words: semantic web, Resource Description Framework(RDF), Chinese encyclopedia, Linked Open Data, Nutch