Computer Science ›› 2022, Vol. 49 ›› Issue (11A): 211000028-5.doi: 10.11896/jsjkx.211000028

• Information Security • Previous Articles     Next Articles

Application of Improved Feature Selection Algorithm in Spam Filtering

LI Yong-hong1, WANG Ying1, LI La-quan1, ZHAO Zhi-qiang2   

  1. 1 School of Science,Chongqing University of Posts and Telecommunications,Chongqing 400065,China
    2 School of Software Engineering,Chongqing University of Posts and Telecommunications,Chongqing 400065,China
  • Online:2022-11-10 Published:2022-11-21
  • About author:LI Yong-hong,born in 1970,B.S,professor.His main research interests include combinatorial optimization,fuzzy matroid and data processing.
  • Supported by:
    National Natural Science Foundation of China(2020YFC2003502,61876201,61901074),Natural Science Foundation Project of Chongqing(cstc2020jcyj-msxmX0649) and Science and Technology Research Program of Chongqing Municipal Education Commission(KJQN201900636).

Abstract: Spam usually refers to e-mails with promotional materials,viruses and other contents that are forcibly sent to the user’se-mail address without user’s request.It has the characteristics of batch sending,and will cause great harm on the Internet.Therefore,it is very important to filter out these spams for users.The essence of the spam filtering problem is a text classification problem,which has a very high features dimension.But not all features contribute to classification,so choosing a suitable subset of features that can reflect the entire data set is the basis for constructing a good email classifier.Existing feature selection me-thods have some limitations,such as redundancy between features,unstable result of feature reduction and high computational cost.By studying and analyzing some of the advantages and disadvantages of the existing spam processing methods,a new integrated feature selection method based on the information gain method and the granular ball neighborhood rough set method is proposed,named IGBNRS algorithm.Through the experimental comparison on different classification models,the proposed algorithm simplifies the model and has a good performance.

Key words: Spam filtering, Feature selection, Attribute reduction, Text classification, IGGBNRS

CLC Number: 

  • TP391
[1]BHOWMICK A,HAZARIKA S M.E-Mail Spam Filtering:A Review of Techniques and Trends[M].Singapore:Springer,2018.
[2]GUYON I M,ELISSEEFF R.An Introduction to Variable and Feature Selection[J].The Journal of Machine Learning Research,2003,38(3):1157-1182.
[3]LI H M,WANG J Y.Research on knowledge discovery based on knowledge dependency reduction[J].Software Guide,2015,14(6):135-137.
[4]AZAM N,YAO J.Comparison of Term Frequency and Document Frequency based Feature Selection Metrics in Text Categorization [J].Expert Systems with Applications,2012,39(5):4760-4768.
[5]YANG Y.A Comparative Study on Feature Selection in Text Categorization[C]//Proceedings of International Conference on Machine Learning.1997.
[6]ZHAI J C,QIN Y P,CHE W W.Improvement of Information Gain in Spam Filtering[J].Computer Science,2014,41(6):214-216.
[7]PENG H,LONG F,DING C.Feature Selection based on Mutual Information Criteria of Max-dependency,Max-relevance,and Min-redundancy [J].IEEE Transactions on Pattern Analysis/Machine Intelligence,2005,27(8):1226-1238.
[8]SHANG C,LI M,PENG S,et al.Feature selection via maximizing global information gain for text classification[J].Know-ledge-Based Systems,2013,54(4):298-309.
[9]LEE C,LEE G G.Information Gain and Divergence-based Feature Selection for Machine Learning-based Text Categorization [J].Information Processing/Management,2006,42(1):155-165.
[10]UYSAL A K,GUNAL S.A Novel Probabilistic Feature Selection Method for Text Classification [J].Knowledge-Based Systems,2012,36(13):226-235.
[11]PAWLAK Z.Rough sets[J].International Journal of Compu-ter/Information Sciences,1982,11(5):341-356.
[12]LI Y,FAN B,GUO J,et al.Attribute Reduction Method Based on k-prototypes Clustering and Rough Sets[J].Computer Science,2021,48(6A):342-348.
[13]YANG Y,CHEN D,HUI W.Incremental Perspective for Feature Selection Based on Fuzzy Rough Sets [J].IEEE Transactions on Fuzzy Systems,2018,26(3):1257-1273.
[14]HU Q,ZHANG L,ZHOU Y,et al.Large-Scale Multimodality Attribute Reduction with Multi-Kernel Fuzzy Rough Sets[J].IEEE Transactions on Fuzzy Systems,2018,26(1):226-238.
[15]XIA S,LIU Y,DING X,et al.Granular Ball Computing Classi-fiers for Efficient,Scalable and Robust Learning [J].Information Ences,2019,483(10):136-252.
[16]XIA S,ZHANG Z,LI W,et al.GBNRS:A Novel Rough Set Algorithm for Fast Adaptive Attribute Reduction in Classification[J].IEEE Transactions on Knowledge and Data Engineering,2022,34(3):1231-1241.
[17]CHEN Z X Survey on Spam Filtering Technology[J].Application Research of Computers,2009,26(5):1612-1615.
[18]SUBASI A,ALZAHRANI S,ALJUHANI A,et al.Comparison of Decision Tree Algorithms for Spam E-mail Filtering[C]//2018 1st International Conference on Computer Applications/Information Security(ICCAIS).2018.
[19]LIU Y,DU X P,ZHOU S,et al.Intelligent Analysis and Filtering of “Spam” and Discussion on Rough Sets [C]//Network and Data Communication Academic Conference of China Computer Federation.China Computer Federation,2022.
[20]DRUCKER H,WU D,VAPNIK V N.Support Vector Machines for Spam Categorization [J].IEEE Transactions on Neural Networks,2002,10(5):1048-1054.
[21]WANG Q S,WEI R Y.Bayesian Chinese Spam Filtering Method Based on Phrases[J].Computer Science,2016,43(4):256-259,269.
[22]WANG L,LI Z W,ZHU C D,et al.Research on spam filtering based on NB algorithm[J].Transducer and Microsystem Technologies,2020,39(9):46-48,52.
[23]DONG M G,HUANG Y Y,JING C.K-Nearest Neighbor Classification Training Set Optimization Method Based on Genetic Instance and Feature Selection[J].Computer Science,2020,47(8):178-184.
[24]BO Y,XU Z B.A Comparative Study for Content-based Dyna-mic Spam Classification Using Four Machine Learning Algorithms[J].Knowledge-Based Systems,2008,21(4):355-362.
[25]ZHOU Z H.Machine Learning[M].Beijing:Tsinghua University Press,2016.
[1] HAO Zhi-rong, CHEN Long, HUANG Jia-cheng. Class Discriminative Universal Adversarial Attack for Text Classification [J]. Computer Science, 2022, 49(8): 323-329.
[2] WU Hong-xin, HAN Meng, CHEN Zhi-qiang, ZHANG Xi-long, LI Mu-hang. Survey of Multi-label Classification Based on Supervised and Semi-supervised Learning [J]. Computer Science, 2022, 49(8): 12-25.
[3] LI Bin, WAN Yuan. Unsupervised Multi-view Feature Selection Based on Similarity Matrix Learning and Matrix Alignment [J]. Computer Science, 2022, 49(8): 86-96.
[4] TAN Ying-ying, WANG Jun-li, ZHANG Chao-bo. Review of Text Classification Methods Based on Graph Convolutional Network [J]. Computer Science, 2022, 49(8): 205-216.
[5] YAN Jia-dan, JIA Cai-yan. Text Classification Method Based on Information Fusion of Dual-graph Neural Network [J]. Computer Science, 2022, 49(8): 230-236.
[6] HU Yan-yu, ZHAO Long, DONG Xiang-jun. Two-stage Deep Feature Selection Extraction Algorithm for Cancer Classification [J]. Computer Science, 2022, 49(7): 73-78.
[7] DENG Kai, YANG Pin, LI Yi-zhou, YANG Xing, ZENG Fan-rui, ZHANG Zhen-yu. Fast and Transmissible Domain Knowledge Graph Construction Method [J]. Computer Science, 2022, 49(6A): 100-108.
[8] KANG Yan, WANG Hai-ning, TAO Liu, YANG Hai-xiao, YANG Xue-kun, WANG Fei, LI Hao. Hybrid Improved Flower Pollination Algorithm and Gray Wolf Algorithm for Feature Selection [J]. Computer Science, 2022, 49(6A): 125-132.
[9] KANG Yan, WU Zhi-wei, KOU Yong-qi, ZHANG Lan, XIE Si-yu, LI Hao. Deep Integrated Learning Software Requirement Classification Fusing Bert and Graph Convolution [J]. Computer Science, 2022, 49(6A): 150-158.
[10] SHAO Xin-xin. TI-FastText Automatic Goods Classification Algorithm [J]. Computer Science, 2022, 49(6A): 206-210.
[11] DENG Zhao-yang, ZHONG Guo-qiang, WANG Dong. Text Classification Based on Attention Gated Graph Neural Network [J]. Computer Science, 2022, 49(6): 326-334.
[12] LIU Shuo, WANG Geng-run, PENG Jian-hua, LI Ke. Chinese Short Text Classification Algorithm Based on Hybrid Features of Characters and Words [J]. Computer Science, 2022, 49(4): 282-287.
[13] ZHONG Gui-feng, PANG Xiong-wen, SUI Dong. Text Classification Method Based on Word2Vec and AlexNet-2 with Improved AttentionMechanism [J]. Computer Science, 2022, 49(4): 288-293.
[14] CHU An-qi, DING Zhi-jun. Application of Gray Wolf Optimization Algorithm on Synchronous Processing of Sample Equalization and Feature Selection in Credit Evaluation [J]. Computer Science, 2022, 49(4): 134-139.
[15] SUN Lin, HUANG Miao-miao, XU Jiu-cheng. Weak Label Feature Selection Method Based on Neighborhood Rough Sets and Relief [J]. Computer Science, 2022, 49(4): 152-160.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!