计算机科学 ›› 2021, Vol. 48 ›› Issue (7): 105-111.doi: 10.11896/jsjkx.200400140
所属专题: 大数据&数据科学 虚拟专题
水泽农, 张星宇, 沙朝锋
SHUI Ze-nong, ZHANG Xing-yu, SHA Chao-feng
摘要: 离群点或异常检测是数据挖掘和机器学习等领域的研究热点之一,研究人员已提出了多种离群点检测方法,并将其应用于入侵检测和异常交易检测等问题。但多数离群点检测方法主要针对表数据或时间序列数据等,无法直接应用于离群文档检测。现有基于相近性的离群文档检测方法一般用文档与整个文档集的距离来衡量离群性,无法发现基于局部考量的离群文档,而且采用欧几里德距离可能无法刻画出文档间的语义相近性。基于概率模型的离群文档检测方法过于复杂,并且同样只从全局来定义文档的离群值。针对这些问题,文中提出了一种新的基于相近性的离群文档检测方法。该方法引入最优输运距离,基于利用文档词嵌入向量的语义信息,在文档之间使用最优输运算法以度量距离,并利用LDA主题模型对文本进行层级抽象,通过最优输运算法算出主题之间的距离后,再计算文档距离,文中基于这两种最优运输距离计算文档与它的k近邻文档之间的距离来衡量该文档的离群程度。该方法从局部视角来定义文档的离群性,所采用的文档距离能体现文档之间的语义相近性。在两个开源数据集上进行了较细致的对比实验,实验结果显示,所提方法在多个指标上优于基准离群文档检测方法;还检验了基于k近邻离群文档定义的有效性以及k值的选取对结果的影响。
中图分类号:
[1]AGGARWALC C.Outlier Analysis[M].Springer Publishing Company,Incorporated,2015:237-263. [2]LI C J,ZHAO S N,CHI Y X.Outlier Detection Algorithm Based on Spectral Embedding and Local Density[J].Computer Science,2019,46(3):260-266. [3]GOU J,MA Z T,ZHANG Z C.PODKNN:A Parallel Outlier Detection Algorithm for Large Dataset[J].Computer Science,2016,43(7):251-274. [4]YIN N,ZHANG L.Research on Application of Outlier Mining Based on Hybrid Clustering Algorithm in Anomaly Detection[J].Computer Science,2017,44 (5):116-119. [5]ZHUANG H,WANG C,TAO F,et al.Identifying semanticallydeviating outlier documents[C]//Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing,2017:2748-2757. [6]LARSON S,MAHENDRAN A,LEE A,et al.Outlier Detection for Improved Data Quality and Diversity in Dialog Systems[C]//NAACL.2019:517-527. [7]BLEI D M,NG A Y,JORDAN M I.Latent dirichlet allocation [J].Journal of Machine Learning Research,2003,3(Jan.):993-1022. [8]MIKOLOV T,SUTSKEVER I,CHEN K,et al.Distributed representations of words and phrases and their compositionality[C]//Advances in Neural Information Processing Systems.2013:3111-3119. [9]ARORA S,LIANG Y,MA T.A simple but tough-to-beat baseline for sentence embeddings[C]//5th International Conference on Learning Representations,ICLR.2017. [10]RAMASWAMY S,RASTOGI R,SHIM K.Efficient algorithms for mining outliers from large data sets[C]//Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data.2000:427-438. [11]KUSNER M,SUN Y,KOLKIN N,et al.From word embeddings to document distances[C]//International Conference on Machine Learning.2015:957-966. [12]YUROCHKIN M,CLAICI S,CHIEN E,et al.Hierarchical optimal transport for document representation[C]//Advances in Neural Information Processing Systems,2019:1599-1609. [13]LE Q,MIKOLOV T.Distributed representations of sentences and documents[C]//International Conference on Machine Learning.2014:1188-1196. [14]ROUSSEEUW P J,DRIESSEN K V.A fast algorithm for the minimum covariance determinant estimator[J].Technometrics,1999,41(3):212-223. [15]PEYRÉ G,CUTURI M.Computational optimal transport[J].Foundations and Trends in Machine Learning,2019,11(5/6):355-607. [16]CUTURI M.Sinkhorn distances:Lightspeed computation of optimal transport[C]//Advances in Neural Information Proces-sing Systems.2013:2292-2300. [17]XU X X.Research on Text Similarity Algorithm Based onWMD Distance[D].Taiyuan:Taiyuan University of Techno-logy,2019. [18]JESUS A M D,CACHOPO C.Improving methods for single-label text categorization[D].Instituto Superior Técnico,Portugal,2007:1-141. [19]LAU J H,BALDWIN T.An Empirical Evaluation of doc2vecwith Practical Insights into Document Embedding Generation[C]//Proceedings of the 1st Workshop on Representation Learning for NLP.2016:78-86. [20]PENNINGTON J,SOCHER R,MANNING C D .Glove:Globalvectors for word representation[C]//Conference on Empirical Methods in Natural Language Processing.2014:1532-1543. [21]CAMPOS G O,ZIMEK A,SANDER J,et al.On the evaluation of unsupervised outlier detection:measures,datasets,and an empirical study[J].Data Mining and Knowledge Discovery,2016,30(4):891-927. [22]XU H,WANG W,LIU W,et al.Distilled wasserstein learning for word embedding and topic modeling[C]//Advances in Neural Information Processing Systems.2018:1716-1725. [23]DIENG A B,RUIZ F J R,BLEI D M.Topic modeling in embed-ding spaces[J].arXiv:1907.04907,2019. |
[1] | 徐涌鑫, 赵俊峰, 王亚沙, 谢冰, 杨恺. 时序知识图谱表示学习 Temporal Knowledge Graph Representation Learning 计算机科学, 2022, 49(9): 162-171. https://doi.org/10.11896/jsjkx.220500204 |
[2] | 王子凯, 朱健, 张伯钧, 胡凯. 区块链与智能合约并行方法研究与实现 Research and Implementation of Parallel Method in Blockchain and Smart Contract 计算机科学, 2022, 49(9): 312-317. https://doi.org/10.11896/jsjkx.210800102 |
[3] | 曾志贤, 曹建军, 翁年凤, 蒋国权, 徐滨. 基于注意力机制的细粒度语义关联视频-文本跨模态实体分辨 Fine-grained Semantic Association Video-Text Cross-modal Entity Resolution Based on Attention Mechanism 计算机科学, 2022, 49(7): 106-112. https://doi.org/10.11896/jsjkx.210500224 |
[4] | 熊罗庚, 郑尚, 邹海涛, 于化龙, 高尚. 融合双向门控循环单元和注意力机制的软件自承认技术债识别方法 Software Self-admitted Technical Debt Identification with Bidirectional Gate Recurrent Unit and Attention Mechanism 计算机科学, 2022, 49(7): 212-219. https://doi.org/10.11896/jsjkx.210500075 |
[5] | 潘志勇, 程宝雷, 樊建席, 卞庆荣. 数据中心网络BCDC上的顶点独立生成树构造算法 Algorithm to Construct Node-independent Spanning Trees in Data Center Network BCDC 计算机科学, 2022, 49(7): 287-296. https://doi.org/10.11896/jsjkx.210500170 |
[6] | 李瑭, 秦小麟, 迟贺宇, 费珂. 面向多无人系统的安全协同模型 Secure Coordination Model for Multiple Unmanned Systems 计算机科学, 2022, 49(7): 332-339. https://doi.org/10.11896/jsjkx.210600107 |
[7] | 黄觉, 周春来. 基于本地化差分隐私的频率特征提取 Frequency Feature Extraction Based on Localized Differential Privacy 计算机科学, 2022, 49(7): 350-356. https://doi.org/10.11896/jsjkx.210900229 |
[8] | 叶跃进, 李芳, 陈德训, 郭恒, 陈鑫. 基于国产众核架构的非结构网格分区块重构预处理算法研究 Study on Preprocessing Algorithm for Partition Reconnection of Unstructured-grid Based on Domestic Many-core Architecture 计算机科学, 2022, 49(6): 73-80. https://doi.org/10.11896/jsjkx.210900045 |
[9] | 赵静文, 付岩, 吴艳霞, 陈俊文, 冯云, 董继斌, 刘嘉琪. 多线程数据竞争检测技术研究综述 Survey on Multithreaded Data Race Detection Techniques 计算机科学, 2022, 49(6): 89-98. https://doi.org/10.11896/jsjkx.210700187 |
[10] | 陈鑫, 李芳, 丁海昕, 孙唯哲, 刘鑫, 陈德训, 叶跃进, 何香. 面向国产异构众核架构的CFD非结构网格计算并行优化方法 Parallel Optimization Method of Unstructured-grid Computing in CFD for DomesticHeterogeneous Many-core Architecture 计算机科学, 2022, 49(6): 99-107. https://doi.org/10.11896/jsjkx.210400157 |
[11] | 王毅, 李政浩, 陈星. 基于用户场景的Android 应用服务推荐方法 Recommendation of Android Application Services via User Scenarios 计算机科学, 2022, 49(6A): 267-271. https://doi.org/10.11896/jsjkx.210700123 |
[12] | 傅丽玉, 陆歌皓, 吴义明, 罗娅玲. 区块链技术的研究及其发展综述 Overview of Research and Development of Blockchain Technology 计算机科学, 2022, 49(6A): 447-461. https://doi.org/10.11896/jsjkx.210600214 |
[13] | 蒋成满, 华保健, 樊淇梁, 朱洪军, 徐波, 潘志中. Python虚拟机本地代码的安全性实证研究 Empirical Security Study of Native Code in Python Virtual Machines 计算机科学, 2022, 49(6A): 474-479. https://doi.org/10.11896/jsjkx.210600200 |
[14] | 袁昊男, 王瑞锦, 郑博文, 吴邦彦. 基于Fabric的电子病历跨链可信共享系统设计与实现 Design and Implementation of Cross-chain Trusted EMR Sharing System Based on Fabric 计算机科学, 2022, 49(6A): 490-495. https://doi.org/10.11896/jsjkx.210500063 |
[15] | 陈钧吾, 余华山. 面向无尺度图的Δ-stepping算法改进策略 Strategies for Improving Δ-stepping Algorithm on Scale-free Graphs 计算机科学, 2022, 49(6A): 594-600. https://doi.org/10.11896/jsjkx.210400062 |
|