计算机科学 ›› 2025, Vol. 52 ›› Issue (11A): 241100171-10.doi: 10.11896/jsjkx.241100171

• 计算机图形学&多媒体 • 上一篇    下一篇

基于多模态数据融合的公害网站识别方法研究

赵春蕾1,2, 于杰1,2, 王鹏翔3, 尤伟1,2   

  1. 1 天津理工大学教育部计算机视觉与系统省部共建重点实验室 天津 300384
    2 天津市智能计算与软件新技术重点实验室 天津 300384
    3 白俄罗斯国立大学 白俄罗斯 220030
  • 出版日期:2025-11-15 发布日期:2025-11-10
  • 通讯作者: 王鹏翔(xpwang2000@126.com)
  • 作者简介:zcltjut@1263.com
  • 基金资助:
    :国家自然科学基金(61931019)

Research on Public Nuisance Website Identification Method Based on Multi-modal Data Fusion

ZHAO Chunlei1,2, YU Jie1,2, WANG Pengxiang3, YOU Wei1,2   

  1. 1 Key Laboratory of Computer Vision and System of Ministry of Education,Tianjin University of Technology,Tianjin 300384,China
    2 Tianjin Key Laboratory of Intelligent Computing and Novel Software Technology,Tianjin 300384,China
    3 Belarusian State University,220030,The Republic of Belarus
  • Online:2025-11-15 Published:2025-11-10
  • Supported by:
    National Natural Science Foundation of China(61931019).

摘要: 当前,针对公害网站识别方法存在特征利用不充分、特征融合性差等问题。因此,提出了一种基于HTML文本、网站截图文本、网站截图的多模态融合公害网站识别模型RBI-RA。该模型使用ResNet50+Attention模型提取网站截图的视觉特征,同时借助OCR技术提取截图文本,进而将其用于后续丰富网站的文本特征。该模型使用RoBERTa+BiLSTM+交互注意力机制模型分别对HTML文本和截图文本特征进行提取,并通过交互注意力机制进行融合,实现网站文本特征的丰富与扩展。该模型通过自注意力机制,融合网站的视觉特征和文本特征,得到多模态融合的分类器,实现不同模态间特征的互补。最后,为证明所提出模型的有效性,在自主开发构建的数据集上进行了大量科学实验。实验结果表明,所提出的基于多模态数据融合的模型可以有效提高识别公害网站的性能,在精密度、召回率和F1指标上表现良好。

关键词: 公害网站识别, 模型融合, 深度学习, 注意力机制

Abstract: Currently,methods for identifying public nuisance websites,suffer from insufficient feature utilization and poor feature integration.Therefore,this paper proposes a multi-modal fusion model for identifying public nuisance websites,named RBI-RA.This model uses the ResNet50+Attention model to extract visual features from website screenshots,while utilizing OCR techno-logy to extract text from screenshots to enrich the website’s text features subsequently.The model employs the RoBERTa+Bi-LSTM+interactive attention mechanism model to extract features from HTML text and screenshot text separately,and integrates them through an interactive attention mechanism to enrich and expand the website text features.The model uses a self-attention mechanism to merge the website’s visual and text features,resulting in a multi-modal fusion classifier that leverages the complementary features across different modalities.Finally,to prove the effectiveness of the proposed model,experiments are conducted on a self-developed dataset.Experimental results show that the proposed model based on multi-modal data fusion effectively improves the performance of identifying public nuisance websites,with good precision,recall,and F1 scores.

Key words: Public nuisance website identification, Model fusion, Deep learning, Attention mechanism

中图分类号: 

  • TP393.08
[1]YANG H,DU K,ZHANG Y,et al.Casino royale:A deep exploration of illegal online gambling[C]//Proceedings of the 35th Annual Computer Security Applications Conference.2019:9-13.
[2]BANKS J.Gambling,Problem Gambling,Crime and the Criminal Justice System[M]//Gambling,Crime and Society.Palgrave Macmillan,London,2017:63-109.
[3]GAO Y,WANG H,LI L,et al.Demystifying Illegal MobileGambling Apps[C]//Proceedings of Web Confence.2021:1447-1458.
[4]SAHOO D,LIU C,HOI S.Malicious URL Detection using Machine Learning:A Survey[J].arXiv:1701.07179,2017.
[5]PRAKASH P,KUMAR M,KOMPELLAR,et al.Phishnet:Predictive blacklisting to detect phishing attacks[C]//Proceedings of the 2010 Proceedings IEEE INFOCOM.2010:1-5.
[6]LE H,PHAM Q,SAHOO D,et al.URLNet:Learning a URL Representation with Deep Learning for Malicious URL Detection[J].arXiv:1802.03162,2018.
[7]GARERA S,PROVOS N,CHEW M,et al.A framework for detection and measurement of phishing attacks[C]//Proceedings of the 2007 ACM workshop on Recurring malcode.2007:1-8.
[8]HUANG Y,YANG Q,QIN J,et al.Phishing URL Detection via CNN and Attention-Based Hierarchical RNN[C]//Proceedings of the 2019 18th IEEE International Conference on Trust,Security and Privacy in Computing and Communications/13th IEEE International Conference on Big Data Science and Engineering(TrustCom/BigDataSE).2019:112-119.
[9]MA J,SAUL L,SAVAGE S,et al.Beyond blacklists:Learning to detect malicious web sites from suspicious URLs[C]//Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.2009:1245-1254.
[10]SHIN J,LEE S,WANG T.Semantic Approach for Identifying Harmful Sites Using the Link Relations[C]//Proceedings of the 2014 IEEE International Conference on Semantic Computing.2014:256-257.
[11]SHEU J.Distinguishing Medical Web Pages from Pornographic Ones:An Efficient Pornography Websites Filtering Method[J].International Journal of Network Security,2017,19(5):839-850.
[12]LIU D,LEE J H,WANG W,et al.Malicious websites detection via cnn based screenshot recognition[C]//Proceedings of the 2019 IEEE International Conference on Intelligent Computing and its Emerging Applications(ICEA).2019:115-119.
[13]ZHANG D.Research and Implementation of Content-OrientedWeb page Classification[D].Nanjing:Nanjing University of Posts and Telecommunications,2017.
[14]SUN G,YE F,CHAI T,et al.Gambling Domain Name Recognition via Certificate and Textual Analysis[J].The Computer Journal,2023,66(8):1829-1839.
[15]LI L,GOU G,XIONG G,et al.Identifying Gambling and Porn Websites with Image Recognition[C]//Pacific Rim Conference on Multimedia.Berlin:Springer,2017:488-497.
[16]YUAN K,TANG D,LIAO X,et al.Stealthy Porn:Understan-ding Real-World Adversarial Images for Illicit Online Promotion[C]//Proceedings of the 2019 IEEE Symposium on Security and Privacy(SP).2019:952-966.
[17]JAIN A.K,GUPTAB B.A machine learning based approach for phishing detection using hyperlinks information[J].Journal of Ambient Intelligence and Humanized Computing,2019,10:2015-2028.
[18]PAUL S,SAHA S,HASANUZZAMANM.Identification of cyberbullying:A deep learning based multimodal approach[J].Multimedia Tools and Applications,2022,81:26989-27008.
[19]AL-KHASAWNEH M A,FAHEEM M,ALAROOD A A,et al.Towards Multi-Modal Approach for Identification and Detection of Cyberbullying in Social Networks[J].IEEE Access,2024,12:90158-90170.
[20]KUMAR A,SACHDEVA N.Multimodal Cyberbullying Detection Using Capsule Network with Dynamic Routing and Deep Convolutional Neural Network[J].Multimedia Systems,2022,28:2043-2052.
[21]CHEN Y,ZHENG R,ZHOU A,et al.Automatic detection ofpornographic and gambling websites based on visual and textual content using a decision mechanism[J].Sensors,2020,20:3989.
[22]GAW N,YOUSEFI S,GAHROOEI M R.Multimodal data fusion for systems improvement:A review[J].IISE Transactions,2022,54:1098-1116.
[23]ZHOU S,RUAN L,XUQ,et al.Multimodal fraudulent website identification method based on heterogeneous model ensemble[J].China Communications,2023,20(5):263-274.
[24]GALLO I,CALEFATI A,NAWAZ S,et al.Image and encoded text fusion for multi-modal classification[C]//Proceedings of the 2018 IEEE Digital Image Computing:Techniques and Applications(DICTA).2018:1-7.
[25]WANG C,XUE P,ZHANG M,et al.Identifying GamblingWebsites with Co-training[C]//Proceedings of the Internatio-nal Conference on Software Engineering and Knowledge Enginee-ring.2022:1-10.
[26]WANG C,ZHANG M,SHI F,et al.A hybrid multimodal data fusion-based method for identifying gambling websites[J].Electronics,2022,11(16):2489.
[27]JOULIN A,GRAVE E,BOJANOWSKI P,et al.Bag of Tricks for Efficient Text Classification[C]//Proceedings of the Confe-rence of the European Chapter of the Association for Computational Linguistics.2016:7-12.
[28]ZHOU S,RUAN L,XU Q,et al.Multimodal Fraudulent Website Identification Method Based on Heterogeneous Model Ensemble[J].China Communications,2023,20(5):263-274.
[29]MIKOLOV T,CHEN K,CORRADO G,et al.Efficient Estimation of Word Representations in Vector Space[J].arXiv:1301.3781.2013.
[30]PENNINGTON J,SOCHER R,MANNING C D.Glove:Global vectors for word representation[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Proces-sing(EMNLP).2014:1532-1543.
[31]DEVLIN J,CHANG M W,LEE K,et al.Bert:Pre-training of deep bidirectional transformers for language understanding[J].arXiv:1810.04805,2018.
[32]CHO K,VAN MERRIËNBOER B,GULCEHRE C,et al.Learning phrase representations using RNN encoder-decoder for statistical machine translation[J].arXiv:1406.1078,2014.
[33]WAGAN A.A,LI Q,ZALAND Z,et al.A Unified Learning Approach for Malicious Domain Name Detection[J].Axioms,2023,12(5):458.
[34]PARFENOVA A,CLAUSEL M.Risk prediction of pathological gambling on social media[J].arXiv:2403.19358,2024.
[35]HE K,ZHANG X,REN S,et al.Deep Residual Learning forImage Recognition[C]//Proceedings of the 2016 IEEE Confe-rence on Computer Vision and Pattern Recognition(CVPR).2016:770-778.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!