Computer Science ›› 2021, Vol. 48 ›› Issue (7): 105-111.doi: 10.11896/jsjkx.200400140

Special Issue: Big Data & Data Scinece

• Database & Big Data & Data Science • Previous Articles     Next Articles

Outlier Document Detection via Optimal Transport and k-nearest Neighbor

SHUI Ze-nong, ZHANG Xing-yu, SHA Chao-feng   

  1. School of Computer Science,Fudan University,Shanghai 200433,China
  • Received:2020-04-29 Revised:2020-09-05 Online:2021-07-15 Published:2021-07-02
  • About author:SHUI Ze-nong,born in 1994,postgra-duate.His main research interests include natural language processing,data mining and software engineering.(shuizenong@fudan.edu.cn)
    SHA Chao-feng,born in 1976,Ph.D,associate professor.His main research interests include machine learning and data mining,natural language processing.
  • Supported by:
    National Key Research and Development Program of China(2018YFB0904503).

Abstract: Outlier or anomaly detection is one of the research hotspots in areas such as data mining and machine learning,and researchers have proposed a variety of outlier detection methods that can be applied to problems such as intrusion detection and anomalous transaction detection.However,most outlier detection methods mainly target tabular data or time series data,etc.and cannot be directly applied to outlier document detection.Existing outlier detection methods based on proximity generally measure proximity by the distance of a document to the entire document set,failing to find outliers based on local considerations,and may not be able to characterize semantic proximity between documents using Euclidean distance.Probabilistic model-based outlier do-cument detection methods are too complex and define document outliers only globally.In response to these questions,this paper proposes a new proximity-based outlier document detection method where we measure the outlier of a document by the distance between the document and its k-nearest neighbor document.We introduce the optimal transport algorithm to calculate the distance between documents,based on the semantic information of the document obtained from word embedding vector and the topic model.The method defines document outliers from a local perspective,using document distances that reflect the semantic proxi-mity between documents.This paper conducts extensive experiments on two open source document datasets,and the results show that the proposed methods outperform the benchmark outlier document detection methods in terms of four evaluation metrics.Experiments also demonstrate the effectiveness of proposal of k-nearest neighbor based outliers and the impact of value k.

Key words: Hierarchical optimal topic transport, Optimal transport, Outlier document detection, Word mover’s distance

CLC Number: 

  • TP311
[1]AGGARWALC C.Outlier Analysis[M].Springer Publishing Company,Incorporated,2015:237-263.
[2]LI C J,ZHAO S N,CHI Y X.Outlier Detection Algorithm Based on Spectral Embedding and Local Density[J].Computer Science,2019,46(3):260-266.
[3]GOU J,MA Z T,ZHANG Z C.PODKNN:A Parallel Outlier Detection Algorithm for Large Dataset[J].Computer Science,2016,43(7):251-274.
[4]YIN N,ZHANG L.Research on Application of Outlier Mining Based on Hybrid Clustering Algorithm in Anomaly Detection[J].Computer Science,2017,44 (5):116-119.
[5]ZHUANG H,WANG C,TAO F,et al.Identifying semanticallydeviating outlier documents[C]//Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing,2017:2748-2757.
[6]LARSON S,MAHENDRAN A,LEE A,et al.Outlier Detection for Improved Data Quality and Diversity in Dialog Systems[C]//NAACL.2019:517-527.
[7]BLEI D M,NG A Y,JORDAN M I.Latent dirichlet allocation [J].Journal of Machine Learning Research,2003,3(Jan.):993-1022.
[8]MIKOLOV T,SUTSKEVER I,CHEN K,et al.Distributed representations of words and phrases and their compositionality[C]//Advances in Neural Information Processing Systems.2013:3111-3119.
[9]ARORA S,LIANG Y,MA T.A simple but tough-to-beat baseline for sentence embeddings[C]//5th International Conference on Learning Representations,ICLR.2017.
[10]RAMASWAMY S,RASTOGI R,SHIM K.Efficient algorithms for mining outliers from large data sets[C]//Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data.2000:427-438.
[11]KUSNER M,SUN Y,KOLKIN N,et al.From word embeddings to document distances[C]//International Conference on Machine Learning.2015:957-966.
[12]YUROCHKIN M,CLAICI S,CHIEN E,et al.Hierarchical optimal transport for document representation[C]//Advances in Neural Information Processing Systems,2019:1599-1609.
[13]LE Q,MIKOLOV T.Distributed representations of sentences and documents[C]//International Conference on Machine Learning.2014:1188-1196.
[14]ROUSSEEUW P J,DRIESSEN K V.A fast algorithm for the minimum covariance determinant estimator[J].Technometrics,1999,41(3):212-223.
[15]PEYRÉ G,CUTURI M.Computational optimal transport[J].Foundations and Trends in Machine Learning,2019,11(5/6):355-607.
[16]CUTURI M.Sinkhorn distances:Lightspeed computation of optimal transport[C]//Advances in Neural Information Proces-sing Systems.2013:2292-2300.
[17]XU X X.Research on Text Similarity Algorithm Based onWMD Distance[D].Taiyuan:Taiyuan University of Techno-logy,2019.
[18]JESUS A M D,CACHOPO C.Improving methods for single-label text categorization[D].Instituto Superior Técnico,Portugal,2007:1-141.
[19]LAU J H,BALDWIN T.An Empirical Evaluation of doc2vecwith Practical Insights into Document Embedding Generation[C]//Proceedings of the 1st Workshop on Representation Learning for NLP.2016:78-86.
[20]PENNINGTON J,SOCHER R,MANNING C D .Glove:Globalvectors for word representation[C]//Conference on Empirical Methods in Natural Language Processing.2014:1532-1543.
[21]CAMPOS G O,ZIMEK A,SANDER J,et al.On the evaluation of unsupervised outlier detection:measures,datasets,and an empirical study[J].Data Mining and Knowledge Discovery,2016,30(4):891-927.
[22]XU H,WANG W,LIU W,et al.Distilled wasserstein learning for word embedding and topic modeling[C]//Advances in Neural Information Processing Systems.2018:1716-1725.
[23]DIENG A B,RUIZ F J R,BLEI D M.Topic modeling in embed-ding spaces[J].arXiv:1907.04907,2019.
[1] XU Yong-xin, ZHAO Jun-feng, WANG Ya-sha, XIE Bing, YANG Kai. Temporal Knowledge Graph Representation Learning [J]. Computer Science, 2022, 49(9): 162-171.
[2] WANG Zi-kai, ZHU Jian, ZHANG Bo-jun, HU Kai. Research and Implementation of Parallel Method in Blockchain and Smart Contract [J]. Computer Science, 2022, 49(9): 312-317.
[3] ZENG Zhi-xian, CAO Jian-jun, WENG Nian-feng, JIANG Guo-quan, XU Bin. Fine-grained Semantic Association Video-Text Cross-modal Entity Resolution Based on Attention Mechanism [J]. Computer Science, 2022, 49(7): 106-112.
[4] XIONG Luo-geng, ZHENG Shang, ZOU Hai-tao, YU Hua-long, GAO Shang. Software Self-admitted Technical Debt Identification with Bidirectional Gate Recurrent Unit and Attention Mechanism [J]. Computer Science, 2022, 49(7): 212-219.
[5] PAN Zhi-yong, CHENG Bao-lei, FAN Jian-xi, BIAN Qing-rong. Algorithm to Construct Node-independent Spanning Trees in Data Center Network BCDC [J]. Computer Science, 2022, 49(7): 287-296.
[6] LI Tang, QIN Xiao-lin, CHI He-yu, FEI Ke. Secure Coordination Model for Multiple Unmanned Systems [J]. Computer Science, 2022, 49(7): 332-339.
[7] HUANG Jue, ZHOU Chun-lai. Frequency Feature Extraction Based on Localized Differential Privacy [J]. Computer Science, 2022, 49(7): 350-356.
[8] YE Yue-jin, LI Fang, CHEN De-xun, GUO Heng, CHEN Xin. Study on Preprocessing Algorithm for Partition Reconnection of Unstructured-grid Based on Domestic Many-core Architecture [J]. Computer Science, 2022, 49(6): 73-80.
[9] ZHAO Jing-wen, FU Yan, WU Yan-xia, CHEN Jun-wen, FENG Yun, DONG Ji-bin, LIU Jia-qi. Survey on Multithreaded Data Race Detection Techniques [J]. Computer Science, 2022, 49(6): 89-98.
[10] CHEN Xin, LI Fang, DING Hai-xin, SUN Wei-ze, LIU Xin, CHEN De-xun, YE Yue-jin, HE Xiang. Parallel Optimization Method of Unstructured-grid Computing in CFD for DomesticHeterogeneous Many-core Architecture [J]. Computer Science, 2022, 49(6): 99-107.
[11] WANG Yi, LI Zheng-hao, CHEN Xing. Recommendation of Android Application Services via User Scenarios [J]. Computer Science, 2022, 49(6A): 267-271.
[12] FU Li-yu, LU Ge-hao, WU Yi-ming, LUO Ya-ling. Overview of Research and Development of Blockchain Technology [J]. Computer Science, 2022, 49(6A): 447-461.
[13] JIANG Cheng-man, HUA Bao-jian, FAN Qi-liang, ZHU Hong-jun, XU Bo, PAN Zhi-zhong. Empirical Security Study of Native Code in Python Virtual Machines [J]. Computer Science, 2022, 49(6A): 474-479.
[14] YUAN Hao-nan, WANG Rui-jin, ZHENG Bo-wen, WU Bang-yan. Design and Implementation of Cross-chain Trusted EMR Sharing System Based on Fabric [J]. Computer Science, 2022, 49(6A): 490-495.
[15] CHEN Jun-wu, YU Hua-shan. Strategies for Improving Δ-stepping Algorithm on Scale-free Graphs [J]. Computer Science, 2022, 49(6A): 594-600.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!