计算机科学 ›› 2022, Vol. 49 ›› Issue (11A): 211000028-5.doi: 10.11896/jsjkx.211000028

• 信息安全 • 上一篇    下一篇

一种改进的特征选择算法在邮件过滤中的应用

李永红1, 汪盈1, 李腊全1, 赵志强2   

  1. 1 重庆邮电大学理学院 重庆 400065
    2 重庆邮电大学软件学院 重庆 400065
  • 出版日期:2022-11-10 发布日期:2022-11-21
  • 通讯作者: 李永红(liyh@cqupt.edu.cn)
  • 基金资助:
    国家自然科学基金面上项目(2020YFC2003502,61876201,61901074);重庆市自然科学基金面上项目(cstc2020jcyj-msxmX0649);重庆市教委科学技术研究项目(KJQN201900636)

Application of Improved Feature Selection Algorithm in Spam Filtering

LI Yong-hong1, WANG Ying1, LI La-quan1, ZHAO Zhi-qiang2   

  1. 1 School of Science,Chongqing University of Posts and Telecommunications,Chongqing 400065,China
    2 School of Software Engineering,Chongqing University of Posts and Telecommunications,Chongqing 400065,China
  • Online:2022-11-10 Published:2022-11-21
  • About author:LI Yong-hong,born in 1970,B.S,professor.His main research interests include combinatorial optimization,fuzzy matroid and data processing.
  • Supported by:
    National Natural Science Foundation of China(2020YFC2003502,61876201,61901074),Natural Science Foundation Project of Chongqing(cstc2020jcyj-msxmX0649) and Science and Technology Research Program of Chongqing Municipal Education Commission(KJQN201900636).

摘要: 垃圾邮件一般是指未经用户请求强行发到用户电子信箱中的包含宣传资料、病毒等内容的电子邮件,它具有批量发送的特征,且会在互联网上造成巨大危害。因此,为用户过滤掉这些垃圾邮件非常重要。垃圾邮件过滤问题的实质是一个文本分类问题,具有很高的特征维度。但并不是所有特征都对分类有贡献,因此选择一个合适的能够反映整个数据集的特征子集是构造一个好的邮件分类器的基础。现有的特征选择方法存在一定的局限性,比如特征之间仍存在冗余、约简特征结果不稳定,以及计算成本高等。研究和分析现有垃圾邮件处理方法的一些优缺点,结合现有方法,提出一个新的基于信息增益方法和粒度球邻域粗糙集方法的集成特征选择方法,即IGGBNRS算法。通过在不同分类模型上的对比实验表明,该算法简化了模型,性能较好。

关键词: 垃圾邮件过滤, 特征选择, 属性约简, 文本分类, IGGBNRS

Abstract: Spam usually refers to e-mails with promotional materials,viruses and other contents that are forcibly sent to the user’se-mail address without user’s request.It has the characteristics of batch sending,and will cause great harm on the Internet.Therefore,it is very important to filter out these spams for users.The essence of the spam filtering problem is a text classification problem,which has a very high features dimension.But not all features contribute to classification,so choosing a suitable subset of features that can reflect the entire data set is the basis for constructing a good email classifier.Existing feature selection me-thods have some limitations,such as redundancy between features,unstable result of feature reduction and high computational cost.By studying and analyzing some of the advantages and disadvantages of the existing spam processing methods,a new integrated feature selection method based on the information gain method and the granular ball neighborhood rough set method is proposed,named IGBNRS algorithm.Through the experimental comparison on different classification models,the proposed algorithm simplifies the model and has a good performance.

Key words: Spam filtering, Feature selection, Attribute reduction, Text classification, IGGBNRS

中图分类号: 

  • TP391
[1]BHOWMICK A,HAZARIKA S M.E-Mail Spam Filtering:A Review of Techniques and Trends[M].Singapore:Springer,2018.
[2]GUYON I M,ELISSEEFF R.An Introduction to Variable and Feature Selection[J].The Journal of Machine Learning Research,2003,38(3):1157-1182.
[3]LI H M,WANG J Y.Research on knowledge discovery based on knowledge dependency reduction[J].Software Guide,2015,14(6):135-137.
[4]AZAM N,YAO J.Comparison of Term Frequency and Document Frequency based Feature Selection Metrics in Text Categorization [J].Expert Systems with Applications,2012,39(5):4760-4768.
[5]YANG Y.A Comparative Study on Feature Selection in Text Categorization[C]//Proceedings of International Conference on Machine Learning.1997.
[6]ZHAI J C,QIN Y P,CHE W W.Improvement of Information Gain in Spam Filtering[J].Computer Science,2014,41(6):214-216.
[7]PENG H,LONG F,DING C.Feature Selection based on Mutual Information Criteria of Max-dependency,Max-relevance,and Min-redundancy [J].IEEE Transactions on Pattern Analysis/Machine Intelligence,2005,27(8):1226-1238.
[8]SHANG C,LI M,PENG S,et al.Feature selection via maximizing global information gain for text classification[J].Know-ledge-Based Systems,2013,54(4):298-309.
[9]LEE C,LEE G G.Information Gain and Divergence-based Feature Selection for Machine Learning-based Text Categorization [J].Information Processing/Management,2006,42(1):155-165.
[10]UYSAL A K,GUNAL S.A Novel Probabilistic Feature Selection Method for Text Classification [J].Knowledge-Based Systems,2012,36(13):226-235.
[11]PAWLAK Z.Rough sets[J].International Journal of Compu-ter/Information Sciences,1982,11(5):341-356.
[12]LI Y,FAN B,GUO J,et al.Attribute Reduction Method Based on k-prototypes Clustering and Rough Sets[J].Computer Science,2021,48(6A):342-348.
[13]YANG Y,CHEN D,HUI W.Incremental Perspective for Feature Selection Based on Fuzzy Rough Sets [J].IEEE Transactions on Fuzzy Systems,2018,26(3):1257-1273.
[14]HU Q,ZHANG L,ZHOU Y,et al.Large-Scale Multimodality Attribute Reduction with Multi-Kernel Fuzzy Rough Sets[J].IEEE Transactions on Fuzzy Systems,2018,26(1):226-238.
[15]XIA S,LIU Y,DING X,et al.Granular Ball Computing Classi-fiers for Efficient,Scalable and Robust Learning [J].Information Ences,2019,483(10):136-252.
[16]XIA S,ZHANG Z,LI W,et al.GBNRS:A Novel Rough Set Algorithm for Fast Adaptive Attribute Reduction in Classification[J].IEEE Transactions on Knowledge and Data Engineering,2022,34(3):1231-1241.
[17]CHEN Z X Survey on Spam Filtering Technology[J].Application Research of Computers,2009,26(5):1612-1615.
[18]SUBASI A,ALZAHRANI S,ALJUHANI A,et al.Comparison of Decision Tree Algorithms for Spam E-mail Filtering[C]//2018 1st International Conference on Computer Applications/Information Security(ICCAIS).2018.
[19]LIU Y,DU X P,ZHOU S,et al.Intelligent Analysis and Filtering of “Spam” and Discussion on Rough Sets [C]//Network and Data Communication Academic Conference of China Computer Federation.China Computer Federation,2022.
[20]DRUCKER H,WU D,VAPNIK V N.Support Vector Machines for Spam Categorization [J].IEEE Transactions on Neural Networks,2002,10(5):1048-1054.
[21]WANG Q S,WEI R Y.Bayesian Chinese Spam Filtering Method Based on Phrases[J].Computer Science,2016,43(4):256-259,269.
[22]WANG L,LI Z W,ZHU C D,et al.Research on spam filtering based on NB algorithm[J].Transducer and Microsystem Technologies,2020,39(9):46-48,52.
[23]DONG M G,HUANG Y Y,JING C.K-Nearest Neighbor Classification Training Set Optimization Method Based on Genetic Instance and Feature Selection[J].Computer Science,2020,47(8):178-184.
[24]BO Y,XU Z B.A Comparative Study for Content-based Dyna-mic Spam Classification Using Four Machine Learning Algorithms[J].Knowledge-Based Systems,2008,21(4):355-362.
[25]ZHOU Z H.Machine Learning[M].Beijing:Tsinghua University Press,2016.
[1] 武红鑫, 韩萌, 陈志强, 张喜龙, 李慕航.
监督和半监督学习下的多标签分类综述
Survey of Multi-label Classification Based on Supervised and Semi-supervised Learning
计算机科学, 2022, 49(8): 12-25. https://doi.org/10.11896/jsjkx.210700111
[2] 李斌, 万源.
基于相似度矩阵学习和矩阵校正的无监督多视角特征选择
Unsupervised Multi-view Feature Selection Based on Similarity Matrix Learning and Matrix Alignment
计算机科学, 2022, 49(8): 86-96. https://doi.org/10.11896/jsjkx.210700124
[3] 檀莹莹, 王俊丽, 张超波.
基于图卷积神经网络的文本分类方法研究综述
Review of Text Classification Methods Based on Graph Convolutional Network
计算机科学, 2022, 49(8): 205-216. https://doi.org/10.11896/jsjkx.210800064
[4] 闫佳丹, 贾彩燕.
基于双图神经网络信息融合的文本分类方法
Text Classification Method Based on Information Fusion of Dual-graph Neural Network
计算机科学, 2022, 49(8): 230-236. https://doi.org/10.11896/jsjkx.210600042
[5] 郝志荣, 陈龙, 黄嘉成.
面向文本分类的类别区分式通用对抗攻击方法
Class Discriminative Universal Adversarial Attack for Text Classification
计算机科学, 2022, 49(8): 323-329. https://doi.org/10.11896/jsjkx.220200077
[6] 胡艳羽, 赵龙, 董祥军.
一种用于癌症分类的两阶段深度特征选择提取算法
Two-stage Deep Feature Selection Extraction Algorithm for Cancer Classification
计算机科学, 2022, 49(7): 73-78. https://doi.org/10.11896/jsjkx.210500092
[7] 邓凯, 杨频, 李益洲, 杨星, 曾凡瑞, 张振毓.
一种可快速迁移的领域知识图谱构建方法
Fast and Transmissible Domain Knowledge Graph Construction Method
计算机科学, 2022, 49(6A): 100-108. https://doi.org/10.11896/jsjkx.210900018
[8] 康雁, 王海宁, 陶柳, 杨海潇, 杨学昆, 王飞, 李浩.
混合改进的花授粉算法与灰狼算法用于特征选择
Hybrid Improved Flower Pollination Algorithm and Gray Wolf Algorithm for Feature Selection
计算机科学, 2022, 49(6A): 125-132. https://doi.org/10.11896/jsjkx.210600135
[9] 康雁, 吴志伟, 寇勇奇, 张兰, 谢思宇, 李浩.
融合Bert和图卷积的深度集成学习软件需求分类
Deep Integrated Learning Software Requirement Classification Fusing Bert and Graph Convolution
计算机科学, 2022, 49(6A): 150-158. https://doi.org/10.11896/jsjkx.210500065
[10] 邵欣欣.
TI-FastText自动商品分类算法
TI-FastText Automatic Goods Classification Algorithm
计算机科学, 2022, 49(6A): 206-210. https://doi.org/10.11896/jsjkx.210500089
[11] 邓朝阳, 仲国强, 王栋.
基于注意力门控图神经网络的文本分类
Text Classification Based on Attention Gated Graph Neural Network
计算机科学, 2022, 49(6): 326-334. https://doi.org/10.11896/jsjkx.210400218
[12] 储安琪, 丁志军.
基于灰狼优化算法的信用评估样本均衡化与特征选择同步处理
Application of Gray Wolf Optimization Algorithm on Synchronous Processing of Sample Equalization and Feature Selection in Credit Evaluation
计算机科学, 2022, 49(4): 134-139. https://doi.org/10.11896/jsjkx.210300075
[13] 孙林, 黄苗苗, 徐久成.
基于邻域粗糙集和Relief的弱标记特征选择方法
Weak Label Feature Selection Method Based on Neighborhood Rough Sets and Relief
计算机科学, 2022, 49(4): 152-160. https://doi.org/10.11896/jsjkx.210300094
[14] 王子茵, 李磊军, 米据生, 李美争, 解滨.
基于误分代价的变精度模糊粗糙集属性约简
Attribute Reduction of Variable Precision Fuzzy Rough Set Based on Misclassification Cost
计算机科学, 2022, 49(4): 161-167. https://doi.org/10.11896/jsjkx.210500211
[15] 王志成, 高灿, 邢金明.
一种基于正域的三支近似约简
Three-way Approximate Reduction Based on Positive Region
计算机科学, 2022, 49(4): 168-173. https://doi.org/10.11896/jsjkx.210500067
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!