计算机科学 ›› 2022, Vol. 49 ›› Issue (11A): 211000028-5.doi: 10.11896/jsjkx.211000028
李永红1, 汪盈1, 李腊全1, 赵志强2
LI Yong-hong1, WANG Ying1, LI La-quan1, ZHAO Zhi-qiang2
摘要: 垃圾邮件一般是指未经用户请求强行发到用户电子信箱中的包含宣传资料、病毒等内容的电子邮件,它具有批量发送的特征,且会在互联网上造成巨大危害。因此,为用户过滤掉这些垃圾邮件非常重要。垃圾邮件过滤问题的实质是一个文本分类问题,具有很高的特征维度。但并不是所有特征都对分类有贡献,因此选择一个合适的能够反映整个数据集的特征子集是构造一个好的邮件分类器的基础。现有的特征选择方法存在一定的局限性,比如特征之间仍存在冗余、约简特征结果不稳定,以及计算成本高等。研究和分析现有垃圾邮件处理方法的一些优缺点,结合现有方法,提出一个新的基于信息增益方法和粒度球邻域粗糙集方法的集成特征选择方法,即IGGBNRS算法。通过在不同分类模型上的对比实验表明,该算法简化了模型,性能较好。
中图分类号:
[1]BHOWMICK A,HAZARIKA S M.E-Mail Spam Filtering:A Review of Techniques and Trends[M].Singapore:Springer,2018. [2]GUYON I M,ELISSEEFF R.An Introduction to Variable and Feature Selection[J].The Journal of Machine Learning Research,2003,38(3):1157-1182. [3]LI H M,WANG J Y.Research on knowledge discovery based on knowledge dependency reduction[J].Software Guide,2015,14(6):135-137. [4]AZAM N,YAO J.Comparison of Term Frequency and Document Frequency based Feature Selection Metrics in Text Categorization [J].Expert Systems with Applications,2012,39(5):4760-4768. [5]YANG Y.A Comparative Study on Feature Selection in Text Categorization[C]//Proceedings of International Conference on Machine Learning.1997. [6]ZHAI J C,QIN Y P,CHE W W.Improvement of Information Gain in Spam Filtering[J].Computer Science,2014,41(6):214-216. [7]PENG H,LONG F,DING C.Feature Selection based on Mutual Information Criteria of Max-dependency,Max-relevance,and Min-redundancy [J].IEEE Transactions on Pattern Analysis/Machine Intelligence,2005,27(8):1226-1238. [8]SHANG C,LI M,PENG S,et al.Feature selection via maximizing global information gain for text classification[J].Know-ledge-Based Systems,2013,54(4):298-309. [9]LEE C,LEE G G.Information Gain and Divergence-based Feature Selection for Machine Learning-based Text Categorization [J].Information Processing/Management,2006,42(1):155-165. [10]UYSAL A K,GUNAL S.A Novel Probabilistic Feature Selection Method for Text Classification [J].Knowledge-Based Systems,2012,36(13):226-235. [11]PAWLAK Z.Rough sets[J].International Journal of Compu-ter/Information Sciences,1982,11(5):341-356. [12]LI Y,FAN B,GUO J,et al.Attribute Reduction Method Based on k-prototypes Clustering and Rough Sets[J].Computer Science,2021,48(6A):342-348. [13]YANG Y,CHEN D,HUI W.Incremental Perspective for Feature Selection Based on Fuzzy Rough Sets [J].IEEE Transactions on Fuzzy Systems,2018,26(3):1257-1273. [14]HU Q,ZHANG L,ZHOU Y,et al.Large-Scale Multimodality Attribute Reduction with Multi-Kernel Fuzzy Rough Sets[J].IEEE Transactions on Fuzzy Systems,2018,26(1):226-238. [15]XIA S,LIU Y,DING X,et al.Granular Ball Computing Classi-fiers for Efficient,Scalable and Robust Learning [J].Information Ences,2019,483(10):136-252. [16]XIA S,ZHANG Z,LI W,et al.GBNRS:A Novel Rough Set Algorithm for Fast Adaptive Attribute Reduction in Classification[J].IEEE Transactions on Knowledge and Data Engineering,2022,34(3):1231-1241. [17]CHEN Z X Survey on Spam Filtering Technology[J].Application Research of Computers,2009,26(5):1612-1615. [18]SUBASI A,ALZAHRANI S,ALJUHANI A,et al.Comparison of Decision Tree Algorithms for Spam E-mail Filtering[C]//2018 1st International Conference on Computer Applications/Information Security(ICCAIS).2018. [19]LIU Y,DU X P,ZHOU S,et al.Intelligent Analysis and Filtering of “Spam” and Discussion on Rough Sets [C]//Network and Data Communication Academic Conference of China Computer Federation.China Computer Federation,2022. [20]DRUCKER H,WU D,VAPNIK V N.Support Vector Machines for Spam Categorization [J].IEEE Transactions on Neural Networks,2002,10(5):1048-1054. [21]WANG Q S,WEI R Y.Bayesian Chinese Spam Filtering Method Based on Phrases[J].Computer Science,2016,43(4):256-259,269. [22]WANG L,LI Z W,ZHU C D,et al.Research on spam filtering based on NB algorithm[J].Transducer and Microsystem Technologies,2020,39(9):46-48,52. [23]DONG M G,HUANG Y Y,JING C.K-Nearest Neighbor Classification Training Set Optimization Method Based on Genetic Instance and Feature Selection[J].Computer Science,2020,47(8):178-184. [24]BO Y,XU Z B.A Comparative Study for Content-based Dyna-mic Spam Classification Using Four Machine Learning Algorithms[J].Knowledge-Based Systems,2008,21(4):355-362. [25]ZHOU Z H.Machine Learning[M].Beijing:Tsinghua University Press,2016. |
[1] | 武红鑫, 韩萌, 陈志强, 张喜龙, 李慕航. 监督和半监督学习下的多标签分类综述 Survey of Multi-label Classification Based on Supervised and Semi-supervised Learning 计算机科学, 2022, 49(8): 12-25. https://doi.org/10.11896/jsjkx.210700111 |
[2] | 李斌, 万源. 基于相似度矩阵学习和矩阵校正的无监督多视角特征选择 Unsupervised Multi-view Feature Selection Based on Similarity Matrix Learning and Matrix Alignment 计算机科学, 2022, 49(8): 86-96. https://doi.org/10.11896/jsjkx.210700124 |
[3] | 檀莹莹, 王俊丽, 张超波. 基于图卷积神经网络的文本分类方法研究综述 Review of Text Classification Methods Based on Graph Convolutional Network 计算机科学, 2022, 49(8): 205-216. https://doi.org/10.11896/jsjkx.210800064 |
[4] | 闫佳丹, 贾彩燕. 基于双图神经网络信息融合的文本分类方法 Text Classification Method Based on Information Fusion of Dual-graph Neural Network 计算机科学, 2022, 49(8): 230-236. https://doi.org/10.11896/jsjkx.210600042 |
[5] | 郝志荣, 陈龙, 黄嘉成. 面向文本分类的类别区分式通用对抗攻击方法 Class Discriminative Universal Adversarial Attack for Text Classification 计算机科学, 2022, 49(8): 323-329. https://doi.org/10.11896/jsjkx.220200077 |
[6] | 胡艳羽, 赵龙, 董祥军. 一种用于癌症分类的两阶段深度特征选择提取算法 Two-stage Deep Feature Selection Extraction Algorithm for Cancer Classification 计算机科学, 2022, 49(7): 73-78. https://doi.org/10.11896/jsjkx.210500092 |
[7] | 邓凯, 杨频, 李益洲, 杨星, 曾凡瑞, 张振毓. 一种可快速迁移的领域知识图谱构建方法 Fast and Transmissible Domain Knowledge Graph Construction Method 计算机科学, 2022, 49(6A): 100-108. https://doi.org/10.11896/jsjkx.210900018 |
[8] | 康雁, 王海宁, 陶柳, 杨海潇, 杨学昆, 王飞, 李浩. 混合改进的花授粉算法与灰狼算法用于特征选择 Hybrid Improved Flower Pollination Algorithm and Gray Wolf Algorithm for Feature Selection 计算机科学, 2022, 49(6A): 125-132. https://doi.org/10.11896/jsjkx.210600135 |
[9] | 康雁, 吴志伟, 寇勇奇, 张兰, 谢思宇, 李浩. 融合Bert和图卷积的深度集成学习软件需求分类 Deep Integrated Learning Software Requirement Classification Fusing Bert and Graph Convolution 计算机科学, 2022, 49(6A): 150-158. https://doi.org/10.11896/jsjkx.210500065 |
[10] | 邵欣欣. TI-FastText自动商品分类算法 TI-FastText Automatic Goods Classification Algorithm 计算机科学, 2022, 49(6A): 206-210. https://doi.org/10.11896/jsjkx.210500089 |
[11] | 邓朝阳, 仲国强, 王栋. 基于注意力门控图神经网络的文本分类 Text Classification Based on Attention Gated Graph Neural Network 计算机科学, 2022, 49(6): 326-334. https://doi.org/10.11896/jsjkx.210400218 |
[12] | 储安琪, 丁志军. 基于灰狼优化算法的信用评估样本均衡化与特征选择同步处理 Application of Gray Wolf Optimization Algorithm on Synchronous Processing of Sample Equalization and Feature Selection in Credit Evaluation 计算机科学, 2022, 49(4): 134-139. https://doi.org/10.11896/jsjkx.210300075 |
[13] | 孙林, 黄苗苗, 徐久成. 基于邻域粗糙集和Relief的弱标记特征选择方法 Weak Label Feature Selection Method Based on Neighborhood Rough Sets and Relief 计算机科学, 2022, 49(4): 152-160. https://doi.org/10.11896/jsjkx.210300094 |
[14] | 王子茵, 李磊军, 米据生, 李美争, 解滨. 基于误分代价的变精度模糊粗糙集属性约简 Attribute Reduction of Variable Precision Fuzzy Rough Set Based on Misclassification Cost 计算机科学, 2022, 49(4): 161-167. https://doi.org/10.11896/jsjkx.210500211 |
[15] | 王志成, 高灿, 邢金明. 一种基于正域的三支近似约简 Three-way Approximate Reduction Based on Positive Region 计算机科学, 2022, 49(4): 168-173. https://doi.org/10.11896/jsjkx.210500067 |
|