计算机科学 ›› 2021, Vol. 48 ›› Issue (7): 105-111.doi: 10.11896/jsjkx.200400140

所属专题: 大数据&数据科学 虚拟专题

• 数据库&大数据&数据科学 • 上一篇    下一篇

基于最优输运和k-近邻的离群文档检测

水泽农, 张星宇, 沙朝锋   

  1. 复旦大学计算机科学技术学院 上海200433
  • 收稿日期:2020-04-29 修回日期:2020-09-05 出版日期:2021-07-15 发布日期:2021-07-02
  • 通讯作者: 沙朝锋 (cfsha@fudan.edu.cn)
  • 基金资助:
    国家重点研发计划(2018YFB0904503)

Outlier Document Detection via Optimal Transport and k-nearest Neighbor

SHUI Ze-nong, ZHANG Xing-yu, SHA Chao-feng   

  1. School of Computer Science,Fudan University,Shanghai 200433,China
  • Received:2020-04-29 Revised:2020-09-05 Online:2021-07-15 Published:2021-07-02
  • About author:SHUI Ze-nong,born in 1994,postgra-duate.His main research interests include natural language processing,data mining and software engineering.(shuizenong@fudan.edu.cn)
    SHA Chao-feng,born in 1976,Ph.D,associate professor.His main research interests include machine learning and data mining,natural language processing.
  • Supported by:
    National Key Research and Development Program of China(2018YFB0904503).

摘要: 离群点或异常检测是数据挖掘和机器学习等领域的研究热点之一,研究人员已提出了多种离群点检测方法,并将其应用于入侵检测和异常交易检测等问题。但多数离群点检测方法主要针对表数据或时间序列数据等,无法直接应用于离群文档检测。现有基于相近性的离群文档检测方法一般用文档与整个文档集的距离来衡量离群性,无法发现基于局部考量的离群文档,而且采用欧几里德距离可能无法刻画出文档间的语义相近性。基于概率模型的离群文档检测方法过于复杂,并且同样只从全局来定义文档的离群值。针对这些问题,文中提出了一种新的基于相近性的离群文档检测方法。该方法引入最优输运距离,基于利用文档词嵌入向量的语义信息,在文档之间使用最优输运算法以度量距离,并利用LDA主题模型对文本进行层级抽象,通过最优输运算法算出主题之间的距离后,再计算文档距离,文中基于这两种最优运输距离计算文档与它的k近邻文档之间的距离来衡量该文档的离群程度。该方法从局部视角来定义文档的离群性,所采用的文档距离能体现文档之间的语义相近性。在两个开源数据集上进行了较细致的对比实验,实验结果显示,所提方法在多个指标上优于基准离群文档检测方法;还检验了基于k近邻离群文档定义的有效性以及k值的选取对结果的影响。

关键词: 层次型最优主题输运, 词搬动距离, 离群文档检测, 最优输运

Abstract: Outlier or anomaly detection is one of the research hotspots in areas such as data mining and machine learning,and researchers have proposed a variety of outlier detection methods that can be applied to problems such as intrusion detection and anomalous transaction detection.However,most outlier detection methods mainly target tabular data or time series data,etc.and cannot be directly applied to outlier document detection.Existing outlier detection methods based on proximity generally measure proximity by the distance of a document to the entire document set,failing to find outliers based on local considerations,and may not be able to characterize semantic proximity between documents using Euclidean distance.Probabilistic model-based outlier do-cument detection methods are too complex and define document outliers only globally.In response to these questions,this paper proposes a new proximity-based outlier document detection method where we measure the outlier of a document by the distance between the document and its k-nearest neighbor document.We introduce the optimal transport algorithm to calculate the distance between documents,based on the semantic information of the document obtained from word embedding vector and the topic model.The method defines document outliers from a local perspective,using document distances that reflect the semantic proxi-mity between documents.This paper conducts extensive experiments on two open source document datasets,and the results show that the proposed methods outperform the benchmark outlier document detection methods in terms of four evaluation metrics.Experiments also demonstrate the effectiveness of proposal of k-nearest neighbor based outliers and the impact of value k.

Key words: Hierarchical optimal topic transport, Optimal transport, Outlier document detection, Word mover’s distance

中图分类号: 

  • TP311
[1]AGGARWALC C.Outlier Analysis[M].Springer Publishing Company,Incorporated,2015:237-263.
[2]LI C J,ZHAO S N,CHI Y X.Outlier Detection Algorithm Based on Spectral Embedding and Local Density[J].Computer Science,2019,46(3):260-266.
[3]GOU J,MA Z T,ZHANG Z C.PODKNN:A Parallel Outlier Detection Algorithm for Large Dataset[J].Computer Science,2016,43(7):251-274.
[4]YIN N,ZHANG L.Research on Application of Outlier Mining Based on Hybrid Clustering Algorithm in Anomaly Detection[J].Computer Science,2017,44 (5):116-119.
[5]ZHUANG H,WANG C,TAO F,et al.Identifying semanticallydeviating outlier documents[C]//Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing,2017:2748-2757.
[6]LARSON S,MAHENDRAN A,LEE A,et al.Outlier Detection for Improved Data Quality and Diversity in Dialog Systems[C]//NAACL.2019:517-527.
[7]BLEI D M,NG A Y,JORDAN M I.Latent dirichlet allocation [J].Journal of Machine Learning Research,2003,3(Jan.):993-1022.
[8]MIKOLOV T,SUTSKEVER I,CHEN K,et al.Distributed representations of words and phrases and their compositionality[C]//Advances in Neural Information Processing Systems.2013:3111-3119.
[9]ARORA S,LIANG Y,MA T.A simple but tough-to-beat baseline for sentence embeddings[C]//5th International Conference on Learning Representations,ICLR.2017.
[10]RAMASWAMY S,RASTOGI R,SHIM K.Efficient algorithms for mining outliers from large data sets[C]//Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data.2000:427-438.
[11]KUSNER M,SUN Y,KOLKIN N,et al.From word embeddings to document distances[C]//International Conference on Machine Learning.2015:957-966.
[12]YUROCHKIN M,CLAICI S,CHIEN E,et al.Hierarchical optimal transport for document representation[C]//Advances in Neural Information Processing Systems,2019:1599-1609.
[13]LE Q,MIKOLOV T.Distributed representations of sentences and documents[C]//International Conference on Machine Learning.2014:1188-1196.
[14]ROUSSEEUW P J,DRIESSEN K V.A fast algorithm for the minimum covariance determinant estimator[J].Technometrics,1999,41(3):212-223.
[15]PEYRÉ G,CUTURI M.Computational optimal transport[J].Foundations and Trends in Machine Learning,2019,11(5/6):355-607.
[16]CUTURI M.Sinkhorn distances:Lightspeed computation of optimal transport[C]//Advances in Neural Information Proces-sing Systems.2013:2292-2300.
[17]XU X X.Research on Text Similarity Algorithm Based onWMD Distance[D].Taiyuan:Taiyuan University of Techno-logy,2019.
[18]JESUS A M D,CACHOPO C.Improving methods for single-label text categorization[D].Instituto Superior Técnico,Portugal,2007:1-141.
[19]LAU J H,BALDWIN T.An Empirical Evaluation of doc2vecwith Practical Insights into Document Embedding Generation[C]//Proceedings of the 1st Workshop on Representation Learning for NLP.2016:78-86.
[20]PENNINGTON J,SOCHER R,MANNING C D .Glove:Globalvectors for word representation[C]//Conference on Empirical Methods in Natural Language Processing.2014:1532-1543.
[21]CAMPOS G O,ZIMEK A,SANDER J,et al.On the evaluation of unsupervised outlier detection:measures,datasets,and an empirical study[J].Data Mining and Knowledge Discovery,2016,30(4):891-927.
[22]XU H,WANG W,LIU W,et al.Distilled wasserstein learning for word embedding and topic modeling[C]//Advances in Neural Information Processing Systems.2018:1716-1725.
[23]DIENG A B,RUIZ F J R,BLEI D M.Topic modeling in embed-ding spaces[J].arXiv:1907.04907,2019.
[1] 徐涌鑫, 赵俊峰, 王亚沙, 谢冰, 杨恺.
时序知识图谱表示学习
Temporal Knowledge Graph Representation Learning
计算机科学, 2022, 49(9): 162-171. https://doi.org/10.11896/jsjkx.220500204
[2] 王子凯, 朱健, 张伯钧, 胡凯.
区块链与智能合约并行方法研究与实现
Research and Implementation of Parallel Method in Blockchain and Smart Contract
计算机科学, 2022, 49(9): 312-317. https://doi.org/10.11896/jsjkx.210800102
[3] 曾志贤, 曹建军, 翁年凤, 蒋国权, 徐滨.
基于注意力机制的细粒度语义关联视频-文本跨模态实体分辨
Fine-grained Semantic Association Video-Text Cross-modal Entity Resolution Based on Attention Mechanism
计算机科学, 2022, 49(7): 106-112. https://doi.org/10.11896/jsjkx.210500224
[4] 熊罗庚, 郑尚, 邹海涛, 于化龙, 高尚.
融合双向门控循环单元和注意力机制的软件自承认技术债识别方法
Software Self-admitted Technical Debt Identification with Bidirectional Gate Recurrent Unit and Attention Mechanism
计算机科学, 2022, 49(7): 212-219. https://doi.org/10.11896/jsjkx.210500075
[5] 潘志勇, 程宝雷, 樊建席, 卞庆荣.
数据中心网络BCDC上的顶点独立生成树构造算法
Algorithm to Construct Node-independent Spanning Trees in Data Center Network BCDC
计算机科学, 2022, 49(7): 287-296. https://doi.org/10.11896/jsjkx.210500170
[6] 李瑭, 秦小麟, 迟贺宇, 费珂.
面向多无人系统的安全协同模型
Secure Coordination Model for Multiple Unmanned Systems
计算机科学, 2022, 49(7): 332-339. https://doi.org/10.11896/jsjkx.210600107
[7] 黄觉, 周春来.
基于本地化差分隐私的频率特征提取
Frequency Feature Extraction Based on Localized Differential Privacy
计算机科学, 2022, 49(7): 350-356. https://doi.org/10.11896/jsjkx.210900229
[8] 叶跃进, 李芳, 陈德训, 郭恒, 陈鑫.
基于国产众核架构的非结构网格分区块重构预处理算法研究
Study on Preprocessing Algorithm for Partition Reconnection of Unstructured-grid Based on Domestic Many-core Architecture
计算机科学, 2022, 49(6): 73-80. https://doi.org/10.11896/jsjkx.210900045
[9] 赵静文, 付岩, 吴艳霞, 陈俊文, 冯云, 董继斌, 刘嘉琪.
多线程数据竞争检测技术研究综述
Survey on Multithreaded Data Race Detection Techniques
计算机科学, 2022, 49(6): 89-98. https://doi.org/10.11896/jsjkx.210700187
[10] 陈鑫, 李芳, 丁海昕, 孙唯哲, 刘鑫, 陈德训, 叶跃进, 何香.
面向国产异构众核架构的CFD非结构网格计算并行优化方法
Parallel Optimization Method of Unstructured-grid Computing in CFD for DomesticHeterogeneous Many-core Architecture
计算机科学, 2022, 49(6): 99-107. https://doi.org/10.11896/jsjkx.210400157
[11] 王毅, 李政浩, 陈星.
基于用户场景的Android 应用服务推荐方法
Recommendation of Android Application Services via User Scenarios
计算机科学, 2022, 49(6A): 267-271. https://doi.org/10.11896/jsjkx.210700123
[12] 傅丽玉, 陆歌皓, 吴义明, 罗娅玲.
区块链技术的研究及其发展综述
Overview of Research and Development of Blockchain Technology
计算机科学, 2022, 49(6A): 447-461. https://doi.org/10.11896/jsjkx.210600214
[13] 蒋成满, 华保健, 樊淇梁, 朱洪军, 徐波, 潘志中.
Python虚拟机本地代码的安全性实证研究
Empirical Security Study of Native Code in Python Virtual Machines
计算机科学, 2022, 49(6A): 474-479. https://doi.org/10.11896/jsjkx.210600200
[14] 袁昊男, 王瑞锦, 郑博文, 吴邦彦.
基于Fabric的电子病历跨链可信共享系统设计与实现
Design and Implementation of Cross-chain Trusted EMR Sharing System Based on Fabric
计算机科学, 2022, 49(6A): 490-495. https://doi.org/10.11896/jsjkx.210500063
[15] 陈钧吾, 余华山.
面向无尺度图的Δ-stepping算法改进策略
Strategies for Improving Δ-stepping Algorithm on Scale-free Graphs
计算机科学, 2022, 49(6A): 594-600. https://doi.org/10.11896/jsjkx.210400062
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!