计算机科学 ›› 2021, Vol. 48 ›› Issue (2): 93-99.doi: 10.11896/jsjkx.200900039

• 数据库&大数据&数据科学 • 上一篇    下一篇

面向NoSQL数据库的JSON文档异常检测与语义消歧模型

刘立成, 徐一凡, 谢贵才, 段磊   

  1. 四川大学计算机学院 成都610065
  • 收稿日期:2020-08-04 修回日期:2020-09-25 出版日期:2021-02-15 发布日期:2021-02-04
  • 通讯作者: 段磊(leiduan@scu.edu.cn)
  • 作者简介:liuli_cheng@qq.com
  • 基金资助:
    国家自然科学基金(61972268)

Outlier Detection and Semantic Disambiguation of JSON Document for NoSQL Database

LIU Li-cheng, XU Yi-fan, XIE Gui-cai, DUAN Lei   

  1. School of Computer Science,Sichuan University,Chengdu 610065,China
  • Received:2020-08-04 Revised:2020-09-25 Online:2021-02-15 Published:2021-02-04
  • About author:LIU Li-cheng,born in 1995,master candidate,is a student member of China Computer Federation.His main research interests include data mining and knowledge engineering.
    DUAN Lei,born in 1981,Ph.D,professor,Ph.D supervisor,is a senior member of China Computer Federation.His main research interests include data mining,health-informatics and evolutionary computation.
  • Supported by:
    This work was supported by the National Natural Science Foundation of China(61972268).

摘要: 随着信息化技术的发展,面对材料等相关领域数据的多源异构、扩展性强、爆炸增长等特点,传统关系数据库无法对数据进行存储,因此可利用NoSQL的无模式存储、高扩展性等特性来解决这一难题。作为NoSQL数据库常用的数据存储格式,JSON因简单性和灵活性备受欢迎。然而,NoSQL数据库缺乏模式信息,在JSON文档存入数据库之前,需要对其进行数据验证与分析。目前,大多数方法是基于JSON schema对JSON文档格式的规范性进行校验,无法有效解决JSON文档的异常检测以及语义歧义问题。为此,文中提出了面向NoSQL数据库的JSON文档异常检测与语义消歧模型doctorJSON。该模型基于JSON schema对存入的JSON文档分别设计了异常检测算法deoutJSON和语义消歧算法disemaJSON,以检测JSON文档存在的异常和歧义。在真实数据集与合成数据集上的实验验证了所提模型的有效性和执行效率。

关键词: JSON schema, JSON文档, NoSQL数据库, 异常检测, 语义消歧

Abstract: With the development of information technology,traditional relational database cannot be used for storage due to multi-source heterogeneity,strong scalability and explosive growth of data in materials and other related fields.Therefore,NoSQL can be used with the charactersitics of schemaless storage and high scalability to solve this problem.As a common data storage format for NoSQL databases,JSON is popular for its simplicity and flexibility.However,NoSQL databases lack schema information,and JSON documents need to be validated and analyzed before being stored in the database.At present,most methods verify the normalization of JSON document format based on JSON schema,which cannot effectively solve the problem of exception detection and semantic ambiguity of JSON document.Therefore,a JSON document outlier detection and semantic disambiguating model for NoSQL database is proposed,named doctorJSON.Based on JSON schema,the model designs outlier detection algorithm deout JSON and semantic disambiguation algorithm disemaJSON to detect the outlier and disambiguation in JSON documents.The vali-dity and efficiency of the model are verified by experiments on the real and synthetic datasets.

Key words: JSON document, JSON schema, NoSQL database, Outlier detection, Semantic disambiguation

中图分类号: 

  • TP311
[1] KAUFMAN J,BEGLEY E.MatML:A Data Interchange Markup Language[C]//Advanced Materials & Processes/November.2003:35-36.
[2] FRENKEL M,CHIRICO R,DIKY V,et al.ThermoML:XML-based IUPAC Standard for Experimental,Predicted,and Critically Evaluated Thermodynamic Property Data Storage and Capture[J].IUPAC Recommendations,2006,78(3):541-612.
[3] LAKIOTAKI K,VORNIOTAKIS N,TSAGRIS M,et al.Bio-Dataome:a collection of uniformly preprocessed and automatically annotated datasets for data-driven biology[J].Database the Journal of Biological Databases & Curation,2018,2018:bay011.
[4] JSON schema language[OL].http:∥json-schema.org.
[5] PEZOA F,REUTTER J,SUAREZ F,et al.Foundations ofJSON Schema[C]//Proceedings of the 25th International Conference on World Wide Web(WWW '16).2016:263-273.
[6] BOURHIS P,REUTTER J,SUÁREZ F,et al.JSON:data mo-del,query languages and schema specification[C]//PODS '17.2017:123-135.
[7] WANG L,ZHANG S,SHI J,et al.Schema management for docu-ment stores[J].Proc.VLDB Endow,2015,8(9):922-933.
[8] LI Y,KATSIPOULAKIS N,CHANDRAMOULI B,et al.Mison:a fast JSON parser for data analytics[J].PVLDB,2017,10(10):1118-1129.
[9] FROZZA A,MELLO R,COSTA F.An Approach for Schema Extraction of JSON and Extended JSON Document Collections[C]//IRI '18.2018:356-363.
[10] MEIKE K,UTA S,STEFANIE S.Schema Extraction andStructural Outlier Detection for JSON-based NoSQL Data Stores[C]//BTW '15.2015:425-444.
[11] RAIHAN R,MALEEHA N,HAFIZ F,et al.A novel JSON based regular expression language for pattern matching in the internet of things[J].Journal of Ambient Intelligence and Humanized Computing,2019,10:1463-1481.
[12] HAI R,QUI X C,KENSCHE D.Nested Schema Mappings for Integrating JSON [C]//Conceptual Modeling.ER 2018.2018,11157:397-405.
[13] JAN O,CHRISTOPH L.Semantically Weighted SimilarityAnalysis for XML-based Content Components[C]//DocEng.2018,20:1-4.
[14] CHEN W,ZHAO X.Similarity-Based Classification for BigNon-Structured and Semi-Structured Recipe Data[C]//Database Systems for Advanced Applications.2016:57-64.
[15] BRAY T.The JavaScript Object Notation (JSON) Data Interchange Format[J].RFC,2014,8259:1-16.
[16] Nigikokun.Generate-schema[OL].https://github.com/nijiko-kun/generate-schema.
[17] Julian.Jsonschema[OL].https://github.com/Julian/jsonschema.
[18] LI S,ZHAO Z,HU R F,et al.Analogical Reasoning on Chinese Morphological and Semantic Relations[C]//ACL.2018:138-143.
[19] Fzumstein.Jsondiff:Diff JSON and JSON-like structures in Python[OL].https://github.com/fzumstein/jsondiff.
[20] Rugleb.JsonCompare:compare two objects with a JSON-likestructure and data types[OL].https://github.com/rugleb/JsonCompare.
[1] 徐天慧, 郭强, 张彩明.
基于全变分比分隔距离的时序数据异常检测
Time Series Data Anomaly Detection Based on Total Variation Ratio Separation Distance
计算机科学, 2022, 49(9): 101-110. https://doi.org/10.11896/jsjkx.210600174
[2] 李其烨, 邢红杰.
基于最大相关熵的KPCA异常检测方法
KPCA Based Novelty Detection Method Using Maximum Correntropy Criterion
计算机科学, 2022, 49(8): 267-272. https://doi.org/10.11896/jsjkx.210700175
[3] 王馨彤, 王璇, 孙知信.
基于多尺度记忆残差网络的网络流量异常检测模型
Network Traffic Anomaly Detection Method Based on Multi-scale Memory Residual Network
计算机科学, 2022, 49(8): 314-322. https://doi.org/10.11896/jsjkx.220200011
[4] 杜航原, 李铎, 王文剑.
一种面向电商网络的异常用户检测方法
Method for Abnormal Users Detection Oriented to E-commerce Network
计算机科学, 2022, 49(7): 170-178. https://doi.org/10.11896/jsjkx.210600092
[5] 武玉坤, 李伟, 倪敏雅, 许志骋.
单类支持向量机融合深度自编码器的异常检测模型
Anomaly Detection Model Based on One-class Support Vector Machine Fused Deep Auto-encoder
计算机科学, 2022, 49(3): 144-151. https://doi.org/10.11896/jsjkx.210100142
[6] 冷佳旭, 谭明圮, 胡波, 高新波.
基于隐式视角转换的视频异常检测
Video Anomaly Detection Based on Implicit View Transformation
计算机科学, 2022, 49(2): 142-148. https://doi.org/10.11896/jsjkx.210900266
[7] 刘意, 毛莺池, 程杨堃, 高建, 王龙宝.
基于邻域一致性的异常检测序列集成方法
Locality and Consistency Based Sequential Ensemble Method for Outlier Detection
计算机科学, 2022, 49(1): 146-152. https://doi.org/10.11896/jsjkx.201000156
[8] 张叶, 李志华, 王长杰.
基于核密度估计的轻量级物联网异常流量检测方法
Kernel Density Estimation-based Lightweight IoT Anomaly Traffic Detection Method
计算机科学, 2021, 48(9): 337-344. https://doi.org/10.11896/jsjkx.200600108
[9] 郭奕杉, 刘漫丹.
基于时空轨迹数据的异常检测
Anomaly Detection Based on Spatial-temporal Trajectory Data
计算机科学, 2021, 48(6A): 213-219. https://doi.org/10.11896/jsjkx.201100193
[10] 邢红杰, 郝忠.
基于全局和局部判别对抗自编码器的异常检测方法
Novelty Detection Method Based on Global and Local Discriminative Adversarial Autoencoder
计算机科学, 2021, 48(6): 202-209. https://doi.org/10.11896/jsjkx.200400083
[11] 管文华, 林春雨, 杨尚蓉, 刘美琴, 赵耀.
基于人体关节点的低头异常行人检测
Detection of Head-bowing Abnormal Pedestrians Based on Human Joint Points
计算机科学, 2021, 48(5): 163-169. https://doi.org/10.11896/jsjkx.200800214
[12] 邹承明, 陈德.
高维大数据分析的无监督异常检测方法
Unsupervised Anomaly Detection Method for High-dimensional Big Data Analysis
计算机科学, 2021, 48(2): 121-127. https://doi.org/10.11896/jsjkx.191100141
[13] 石琳姗, 马创, 杨云, 靳敏.
基于SSC-BP神经网络的异常检测算法
Anomaly Detection Algorithm Based on SSC-BP Neural Network
计算机科学, 2021, 48(12): 357-363. https://doi.org/10.11896/jsjkx.201000086
[14] 杨月麟, 毕宗泽.
基于深度学习的网络流量异常检测
Network Anomaly Detection Based on Deep Learning
计算机科学, 2021, 48(11A): 540-546. https://doi.org/10.11896/jsjkx.201200077
[15] 冯安然, 王旭仁, 汪秋云, 熊梦博.
基于PCA和随机树的数据库异常访问检测
Database Anomaly Access Detection Based on Principal Component Analysis and Random Tree
计算机科学, 2020, 47(9): 94-98. https://doi.org/10.11896/jsjkx.190800056
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!