基于最优输运和k-近邻的离群文档检测

doi:10.11896/jsjkx.200400140

Abstract

Abstract: Outlier or anomaly detection is one of the research hotspots in areas such as data mining and machine learning,and researchers have proposed a variety of outlier detection methods that can be applied to problems such as intrusion detection and anomalous transaction detection.However,most outlier detection methods mainly target tabular data or time series data,etc.and cannot be directly applied to outlier document detection.Existing outlier detection methods based on proximity generally measure proximity by the distance of a document to the entire document set,failing to find outliers based on local considerations,and may not be able to characterize semantic proximity between documents using Euclidean distance.Probabilistic model-based outlier do-cument detection methods are too complex and define document outliers only globally.In response to these questions,this paper proposes a new proximity-based outlier document detection method where we measure the outlier of a document by the distance between the document and its k-nearest neighbor document.We introduce the optimal transport algorithm to calculate the distance between documents,based on the semantic information of the document obtained from word embedding vector and the topic model.The method defines document outliers from a local perspective,using document distances that reflect the semantic proxi-mity between documents.This paper conducts extensive experiments on two open source document datasets,and the results show that the proposed methods outperform the benchmark outlier document detection methods in terms of four evaluation metrics.Experiments also demonstrate the effectiveness of proposal of k-nearest neighbor based outliers and the impact of value k.

Key words: Hierarchical optimal topic transport, Optimal transport, Outlier document detection, Word mover’s distance

CLC Number:

TP311

SHUI Ze-nong, ZHANG Xing-yu, SHA Chao-feng. Outlier Document Detection via Optimal Transport and k-nearest Neighbor[J].Computer Science, 2021, 48(7): 105-111.

References

[1]AGGARWALC C.Outlier Analysis[M].Springer Publishing Company,Incorporated,2015:237-263.
[2]LI C J,ZHAO S N,CHI Y X.Outlier Detection Algorithm Based on Spectral Embedding and Local Density[J].Computer Science,2019,46(3):260-266.
[3]GOU J,MA Z T,ZHANG Z C.PODKNN:A Parallel Outlier Detection Algorithm for Large Dataset[J].Computer Science,2016,43(7):251-274.
[4]YIN N,ZHANG L.Research on Application of Outlier Mining Based on Hybrid Clustering Algorithm in Anomaly Detection[J].Computer Science,2017,44 (5):116-119.
[5]ZHUANG H,WANG C,TAO F,et al.Identifying semanticallydeviating outlier documents[C]//Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing,2017:2748-2757.
[6]LARSON S,MAHENDRAN A,LEE A,et al.Outlier Detection for Improved Data Quality and Diversity in Dialog Systems[C]//NAACL.2019:517-527.
[7]BLEI D M,NG A Y,JORDAN M I.Latent dirichlet allocation [J].Journal of Machine Learning Research,2003,3(Jan.):993-1022.
[8]MIKOLOV T,SUTSKEVER I,CHEN K,et al.Distributed representations of words and phrases and their compositionality[C]//Advances in Neural Information Processing Systems.2013:3111-3119.
[9]ARORA S,LIANG Y,MA T.A simple but tough-to-beat baseline for sentence embeddings[C]//5th International Conference on Learning Representations,ICLR.2017.
[10]RAMASWAMY S,RASTOGI R,SHIM K.Efficient algorithms for mining outliers from large data sets[C]//Proceedings of the 2000 ACM SIGMOD International Conference on Management of Data.2000:427-438.
[11]KUSNER M,SUN Y,KOLKIN N,et al.From word embeddings to document distances[C]//International Conference on Machine Learning.2015:957-966.
[12]YUROCHKIN M,CLAICI S,CHIEN E,et al.Hierarchical optimal transport for document representation[C]//Advances in Neural Information Processing Systems,2019:1599-1609.
[13]LE Q,MIKOLOV T.Distributed representations of sentences and documents[C]//International Conference on Machine Learning.2014:1188-1196.
[14]ROUSSEEUW P J,DRIESSEN K V.A fast algorithm for the minimum covariance determinant estimator[J].Technometrics,1999,41(3):212-223.
[15]PEYRÉ G,CUTURI M.Computational optimal transport[J].Foundations and Trends in Machine Learning,2019,11(5/6):355-607.
[16]CUTURI M.Sinkhorn distances:Lightspeed computation of optimal transport[C]//Advances in Neural Information Proces-sing Systems.2013:2292-2300.
[17]XU X X.Research on Text Similarity Algorithm Based onWMD Distance[D].Taiyuan:Taiyuan University of Techno-logy,2019.
[18]JESUS A M D,CACHOPO C.Improving methods for single-label text categorization[D].Instituto Superior Técnico,Portugal,2007:1-141.
[19]LAU J H,BALDWIN T.An Empirical Evaluation of doc2vecwith Practical Insights into Document Embedding Generation[C]//Proceedings of the 1st Workshop on Representation Learning for NLP.2016:78-86.
[20]PENNINGTON J,SOCHER R,MANNING C D .Glove:Globalvectors for word representation[C]//Conference on Empirical Methods in Natural Language Processing.2014:1532-1543.
[21]CAMPOS G O,ZIMEK A,SANDER J,et al.On the evaluation of unsupervised outlier detection:measures,datasets,and an empirical study[J].Data Mining and Knowledge Discovery,2016,30(4):891-927.
[22]XU H,WANG W,LIU W,et al.Distilled wasserstein learning for word embedding and topic modeling[C]//Advances in Neural Information Processing Systems.2018:1716-1725.
[23]DIENG A B,RUIZ F J R,BLEI D M.Topic modeling in embed-ding spaces[J].arXiv:1907.04907,2019.

Related Articles 15

[1]	XU Yong-xin, ZHAO Jun-feng, WANG Ya-sha, XIE Bing, YANG Kai. Temporal Knowledge Graph Representation Learning [J]. Computer Science, 2022, 49(9): 162-171.
[2]	WANG Zi-kai, ZHU Jian, ZHANG Bo-jun, HU Kai. Research and Implementation of Parallel Method in Blockchain and Smart Contract [J]. Computer Science, 2022, 49(9): 312-317.
[3]	ZENG Zhi-xian, CAO Jian-jun, WENG Nian-feng, JIANG Guo-quan, XU Bin. Fine-grained Semantic Association Video-Text Cross-modal Entity Resolution Based on Attention Mechanism [J]. Computer Science, 2022, 49(7): 106-112.
[4]	XIONG Luo-geng, ZHENG Shang, ZOU Hai-tao, YU Hua-long, GAO Shang. Software Self-admitted Technical Debt Identification with Bidirectional Gate Recurrent Unit and Attention Mechanism [J]. Computer Science, 2022, 49(7): 212-219.
[5]	PAN Zhi-yong, CHENG Bao-lei, FAN Jian-xi, BIAN Qing-rong. Algorithm to Construct Node-independent Spanning Trees in Data Center Network BCDC [J]. Computer Science, 2022, 49(7): 287-296.
[6]	LI Tang, QIN Xiao-lin, CHI He-yu, FEI Ke. Secure Coordination Model for Multiple Unmanned Systems [J]. Computer Science, 2022, 49(7): 332-339.
[7]	HUANG Jue, ZHOU Chun-lai. Frequency Feature Extraction Based on Localized Differential Privacy [J]. Computer Science, 2022, 49(7): 350-356.
[8]	YE Yue-jin, LI Fang, CHEN De-xun, GUO Heng, CHEN Xin. Study on Preprocessing Algorithm for Partition Reconnection of Unstructured-grid Based on Domestic Many-core Architecture [J]. Computer Science, 2022, 49(6): 73-80.
[9]	ZHAO Jing-wen, FU Yan, WU Yan-xia, CHEN Jun-wen, FENG Yun, DONG Ji-bin, LIU Jia-qi. Survey on Multithreaded Data Race Detection Techniques [J]. Computer Science, 2022, 49(6): 89-98.
[10]	CHEN Xin, LI Fang, DING Hai-xin, SUN Wei-ze, LIU Xin, CHEN De-xun, YE Yue-jin, HE Xiang. Parallel Optimization Method of Unstructured-grid Computing in CFD for DomesticHeterogeneous Many-core Architecture [J]. Computer Science, 2022, 49(6): 99-107.
[11]	WANG Yi, LI Zheng-hao, CHEN Xing. Recommendation of Android Application Services via User Scenarios [J]. Computer Science, 2022, 49(6A): 267-271.
[12]	FU Li-yu, LU Ge-hao, WU Yi-ming, LUO Ya-ling. Overview of Research and Development of Blockchain Technology [J]. Computer Science, 2022, 49(6A): 447-461.
[13]	JIANG Cheng-man, HUA Bao-jian, FAN Qi-liang, ZHU Hong-jun, XU Bo, PAN Zhi-zhong. Empirical Security Study of Native Code in Python Virtual Machines [J]. Computer Science, 2022, 49(6A): 474-479.
[14]	YUAN Hao-nan, WANG Rui-jin, ZHENG Bo-wen, WU Bang-yan. Design and Implementation of Cross-chain Trusted EMR Sharing System Based on Fabric [J]. Computer Science, 2022, 49(6A): 490-495.
[15]	CHEN Jun-wu, YU Hua-shan. Strategies for Improving Δ-stepping Algorithm on Scale-free Graphs [J]. Computer Science, 2022, 49(6A): 594-600.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Outlier Document Detection via Optimal Transport and k-nearest Neighbor

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 15

Metrics

Comments

Recommended 0