计算机科学 ›› 2021, Vol. 48 ›› Issue (3): 174-179.doi: 10.11896/jsjkx.191200154

• 数据库&大数据&数据科学 • 上一篇    下一篇

基于关系图谱的科技数据分析算法

张寒烁, 杨冬菊   

  1. 北方工业大学大规模流数据集成与分析技术北京市重点实验室 北京100144
    北方工业大学云计算研究中心 北京100144
  • 收稿日期:2019-12-25 修回日期:2020-05-28 出版日期:2021-03-15 发布日期:2021-03-05
  • 通讯作者: 杨冬菊(yangdongju@ncut.edu.cn)
  • 作者简介:hanshuo_1994@foxmail.com
  • 基金资助:
    国家重点研发计划课题(2019YFB1405103)

Technology Data Analysis Algorithm Based on Relational Graph

ZHANG Han-shuo, YANG Dong-ju   

  1. Beijing Key Laboratory on Integration and Analysis of Large-scale Stream Data,North China University of Technology,Beijing 100144,China
    Research Center for Cloud Computing,North China University of Technology,Beijing 100144,China
  • Received:2019-12-25 Revised:2020-05-28 Online:2021-03-15 Published:2021-03-05
  • About author:ZHANG Han-shuo,born in 1994,postgraduate.His main research interests include service computing,cloud computing and big data.
    YANG Dong-ju,born in 1975,Ph.D,associate professor,is a member of China Computer Federation.Her main research interests include service computing,data integration,cloud computing,cloud storage,and their applications in industry data center.
  • Supported by:
    National Key Research and Development Project of China(2019YFB1405103).

摘要: 随着科技数据量的不断增长,各科技部门积累了大量科技项目的科技管理数据。对于大量结构化数据,需要针对分散数据进行整理、分析,最终按需求提供数据查询与抽取服务。由于在关系数据库中关联关系的分析效果不佳,为了提高分析效率,文中引入了关系图谱进行数据处理。首先,提出了一种基于词频的实体搜索与定位算法来提取实体和关系,并构建关系图谱;其次,对关系图谱进行分析,提出了一种基于改进FP-growth的图数据频繁项挖掘算法;然后,设计了基于图数据的数据筛选流程,进行数据的筛选、分析,并定义评分矩阵,对待筛选数据情况进行评价,最终给出分析意见,且数据筛选的评价标准可以进行自定义;最后,结合构建的关系图谱,将算法进行了实际应用,并将其封装为服务。实验结果表明,提出的基于改进FP-growth的频繁项挖掘算法相比传统FP-growth算法在时间上有10%~12%的提升,数据筛选过程的准确率达到97%左右。

关键词: 服务应用, 关系图谱, 人员关系图谱, 数据分析, 数据挖掘, 图谱构建

Abstract: With the continuous growth of scientific and technological data,various science and technology departments have accumulated a large number of scientific and technological management data of scientific and technological projects.For a large amount of structured data,it is necessary to organize and analyze the distributed data,and finally provide data query and extraction ser-vices according to requirements.The analysis of relationships in relational databases is not effective.In order to improve the efficiency of analysis,relational graphs are introduced for data processing.Firstly,an entity search and localization algorithm based on word frequency is proposed,and the entities and relationships are extracted to construct the relationalgraph.Secondly,an improved FP-growth algorithm for frequent item mining of graph data is proposed in order to solve the frequent item screening problem in the graph data.Then,a data filtering process based on graph data is designed.In addition,this paper defines the scoring matrix,evaluate the screening data,and finally give an analysis opinion.The evaluation standard of data screening can be customized.Finally,combined with the constructed relational graph,the algorithm is applied in practice and encapsulated as a ser-vice.Experimentalresults show that the improved FP-growth-based frequent item mining algorithm has 10%~12% improvement over the traditional FP-growth algorithm.The accuracy of the data screening process designed in this paper reaches 97%.

Key words: Construction of human relation graph, Data analysis, Data mining, Graph construction, Relational graph, Service application

中图分类号: 

  • TP391
[1]XU F.Research on spam speech recognition based on user social relationship graph [D].Wuhan:Huazhong University of Science and Technology,2017.
[2]AMIT S.Introducing the Knowledge Graph:Things,NotStrings,Official Blog [OL].[2019-06-14].http://googleblog.blogspot.co.uk/.
[3]TANG Y,CHEN G H,HE C B,et al.Knowledge Map and Its Application in the Field of Academic Information Services [J].Journal of South China Normal University(Natural Science Edition),2018,50(5):110-119.
[4]LING X,WELD D S.Fine-grained entity recognition[C]//Proc of the 26th Conf on Association for the Advancement of Artificial Intelligence.Menlo Park,CA:AAAI,2012:94-100.
[5]YIN L,YUAN F,XIE W B,et al.Research Progress and Challenges of Correlation Maps[J].Computer Science,2018,45(S1):1-10,35.
[6]JIANG B C,WAN G,XU J,et al.Construction of large-scale geo-graphic knowledge maps of multi-source heterogeneous data[J].Journal of Surveying and Mapping,2018,47(8):1051-1061.
[7]YAN J H,WANG C Y,CHENG W L,et al.A retrospective of knowledge graphs [J].Frontiers of Computer Science,2018,12(1):55-74.
[8]NATHAN E,BADER D A.Incrementally updating Katz centrality in dynamic graphs(Article)[J].Social Network Analysis and Mining,2018,8(1):1-26.
[9]LI X,TUR G,HAKKANI-TUR D,et al.Personal knowledgegraph population from user utterances in conversational understanding[C]//Spoken Language Technology Workshop.IEEE,2015.
[10]YU J,LIU Y B,ZHANG Y,et al.Overview of Large ScaleGraph Data Matching Technology[J].Journal of Computer Research and Development,2015,52(2):391-409.
[11]ZHANG L X,WANG W P,GAO J L,et al.Incremental Graph Pattern Matching for Pattern Graph Changes[J].Journal of Software,2015,26(11):2964-2980.
[12]GUAN J,WANG W,QI Q H.Multi-Keyword Streaming Parallel Retrieval Algorithm Based on Urban Security Knowledge Map[J].Computer Science,2019,46(2):35-41.
[13]SUN W P,CHANG L,BIN C Z,et al.Recommendations ofTourism Routes Based on Knowledge Mapping and Frequent Sequence Mining[J].Computer Science,2019,46(2):56-61.
[14]ZHAO Z B,JIA Y F,YAO L,et al.Research on Web Page Classification Technology with Rich Structured Data[J].Journal of Computer Research and Development,2013,50(S1):53-60.
[15]ZHANG Y,JIA Y D,FU L Y,et al.AceMap Academic Map and AceKG Academic Knowledge Atlas——Visualization of Academic Data [J].Journal of Shanghai Jiaotong University,2018,52(10):1357-1362.
[16]ZHENG W G,CHENG H Y,XU J,et al.Interactive natural language question answering over knowledge graphs[J].Information Sciences,2019,481:141-159.
[17]SHI D X,LI H,YANG R S,et al.Excavation of daily frequent behavior patterns of users[J].Journal of National University of Defense Technology,2017,39(1):74-80.
[18]FADER A,SODERLAND S,ETZIONI O.Identifying relations for Open information extraction[C]//Proc. of the Conf. on Empirical Methods in Natural Language Processing.Stroudsburg,PA:ACL,2011:1535-1545.
[1] 黎嵘繁, 钟婷, 吴劲, 周帆, 匡平.
基于时空注意力克里金的边坡形变数据插值方法
Spatio-Temporal Attention-based Kriging for Land Deformation Data Interpolation
计算机科学, 2022, 49(8): 33-39. https://doi.org/10.11896/jsjkx.210600161
[2] 邓凯, 杨频, 李益洲, 杨星, 曾凡瑞, 张振毓.
一种可快速迁移的领域知识图谱构建方法
Fast and Transmissible Domain Knowledge Graph Construction Method
计算机科学, 2022, 49(6A): 100-108. https://doi.org/10.11896/jsjkx.210900018
[3] 丛颖男, 王兆毓, 朱金清.
关于法律人工智能数据和算法问题的若干思考
Insights into Dataset and Algorithm Related Problems in Artificial Intelligence for Law
计算机科学, 2022, 49(4): 74-79. https://doi.org/10.11896/jsjkx.210900191
[4] 么晓明, 丁世昌, 赵涛, 黄宏, 罗家德, 傅晓明.
大数据驱动的社会经济地位分析研究综述
Big Data-driven Based Socioeconomic Status Analysis:A Survey
计算机科学, 2022, 49(4): 80-87. https://doi.org/10.11896/jsjkx.211100014
[5] 孔钰婷, 谭富祥, 赵鑫, 张正航, 白璐, 钱育蓉.
基于差分隐私的K-means算法优化研究综述
Review of K-means Algorithm Optimization Based on Differential Privacy
计算机科学, 2022, 49(2): 162-173. https://doi.org/10.11896/jsjkx.201200008
[6] 梁静茹, 鄂海红, 宋美娜.
基于属性图模型的领域知识图谱构建方法
Method of Domain Knowledge Graph Construction Based on Property Graph Model
计算机科学, 2022, 49(2): 174-181. https://doi.org/10.11896/jsjkx.210500076
[7] 马董, 李新源, 陈红梅, 肖清.
星型高影响的空间co-location模式挖掘
Mining Spatial co-location Patterns with Star High Influence
计算机科学, 2022, 49(1): 166-174. https://doi.org/10.11896/jsjkx.201000186
[8] 江昊琛, 魏子麒, 刘璘, 陈俊.
非均衡数据分类经典方法综述与面向医疗领域的实验分析
Imbalanced Data Classification:A Survey and Experiments in Medical Domain
计算机科学, 2022, 49(1): 80-88. https://doi.org/10.11896/jsjkx.210200124
[9] 张亚迪, 孙悦, 刘锋, 朱二周.
结合密度参数与中心替换的改进K-means算法及新聚类有效性指标研究
Study on Density Parameter and Center-Replacement Combined K-means and New Clustering Validity Index
计算机科学, 2022, 49(1): 121-132. https://doi.org/10.11896/jsjkx.201100148
[10] 余乐章, 夏天宇, 荆一楠, 何震瀛, 王晓阳.
面向大数据分析的智能交互向导系统
Smart Interactive Guide System for Big Data Analytics
计算机科学, 2021, 48(9): 110-117. https://doi.org/10.11896/jsjkx.200900083
[11] 徐慧慧, 晏华.
基于相对危险度的儿童先心病风险因素分析算法
Relative Risk Degree Based Risk Factor Analysis Algorithm for Congenital Heart Disease in Children
计算机科学, 2021, 48(6): 210-214. https://doi.org/10.11896/jsjkx.200500082
[12] 吴广智, 郭斌, 丁亚三, 成家慧, 於志文.
假消息认知机理研究综述
Cognitive Mechanisms of Fake News
计算机科学, 2021, 48(6): 306-314. https://doi.org/10.11896/jsjkx.201200194
[13] 张岩金, 白亮.
一种基于符号关系图的快速符号数据聚类算法
Fast Symbolic Data Clustering Algorithm Based on Symbolic Relation Graph
计算机科学, 2021, 48(4): 111-116. https://doi.org/10.11896/jsjkx.200800011
[14] 邹承明, 陈德.
高维大数据分析的无监督异常检测方法
Unsupervised Anomaly Detection Method for High-dimensional Big Data Analysis
计算机科学, 2021, 48(2): 121-127. https://doi.org/10.11896/jsjkx.191100141
[15] 胡腾, 王艳平, 张小松, 牛伟纳.
基于区块链的DApp数据与行为分析
Data and Behavior Analysis of Blockchain-based DApp
计算机科学, 2021, 48(11): 116-123. https://doi.org/10.11896/jsjkx.210200134
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!