基于非平衡数据处理方法的网络在线广告中 点击欺诈检测的研究

计算机科学 ›› 2018, Vol. 45 ›› Issue (6A): 371-374.

基于非平衡数据处理方法的网络在线广告中点击欺诈检测的研究

李鑫,郭汉,张欣,胡方强,帅仁俊

南京工业大学计算机科学与技术学院南京211816

出版日期:2018-06-20 发布日期:2018-08-03
作者简介:李鑫讲师,主要研究方向为机器学习、数据挖掘,E-mail:lixin@njtech.edu.cn;郭汉硕士生,主要研究方向为机器学习、智能医学信息处理;张欣硕士生,主要研究方向为机器学习、计算广告;胡方强讲师,主要研究方向为人工智能及嵌入式系统;帅仁俊副教授,主要研究方向为智能建筑、智能医学图像处理。
基金资助:
国家自然科学基金资助项目(61672279),江苏省重点研发计划项目(BE2015697)资助

Study on Click Fraud Detection in Online Advertising with Imbalanced Data Processing Methods

LI Xin, GUO Han,ZHANG Xin,HU Fang-qiang,SHUAI Ren-jun

College of Computer Science and Technology,Nanjing Tech University,Nanjing 211816,China

Online:2018-06-20 Published:2018-08-03

摘要/Abstract

摘要： 网络在线广告中以套取广告费为目的的点击欺诈检测是机器学习应用的重要内容之一。支持向量机(Support Vector Machine,SVM)是一种优秀的解决二分类和回归问题的机器学习算法,但应用于网络在线广告中的欺诈点击检测时,由于数据集的极端非平衡性,算法性能受到极大的限制。从FDMA2012竞赛欺诈发布商检测的真实数据集出发,在详细研究与对比了3种非平衡数据处理方法后,选取最佳的混合采样方法对原始数据进行处理,再将其应用于SVM分类器。实验结果表明,所提方法能够有效识别实施欺诈点击行为的非法发布商,准确度达到95%左右,满足了网络在线广告中点击欺诈检测的要求。

关键词: 点击欺诈, 非平衡, 混合采样, 支持向量机

Abstract: Click fraud detection in online advertising is one of the most important applications of machine learning.Support vector machine (SVM) is a prominent supervised machine learning algorithm on classification problems with roughly equal distributions datasets.However,when applied to click fraud detection problems,the success of SVM is greatly limited due to the extreme imbalanced distribution of FDMA2012 competition dataset.In this paper,three data preprocess methods,random under-sample (RUS),synthetic minority over-sampling technique (SMOTE) and SMOTE+edited nearest neighbor(ENN),were detailed investigated,followed by SVM classifier to solve the question.Results show that the method combining SMOTE＋ENN with SVM achieves accuracy about 95% on minority samples,which basically reaches the requirements of online advertising click fraud detection system.

Key words: Click fraud, Imbalanced, Mixed-sampling, SVM

中图分类号:

TP393

李鑫,郭汉,张欣,胡方强,帅仁俊. 基于非平衡数据处理方法的网络在线广告中点击欺诈检测的研究[J]. 计算机科学, 2018, 45(6A): 371-374. https://doi.org/

LI Xin, GUO Han,ZHANG Xin,HU Fang-qiang,SHUAI Ren-jun. Study on Click Fraud Detection in Online Advertising with Imbalanced Data Processing Methods[J]. Computer Science, 2018, 45(6A): 371-374. https://doi.org/

参考文献

[1]ZHANG S,SADAOUI S,MOUHOUB M.An Empirical Analysis of Imbalanced Data Classification[J].Computer & Information Science,2015,8(1):151-162.
[2]尹留志.关于非平衡数据特征问题的研究[D].合肥:中国科学技术大学,2014.
[3]JIAN C,GAO J,AO Y.A new sampling method for classifying imbalanced data based on support vector machine ensemble[J].Neurocomputing,2016,193(C):115-122.
[4]VAPNIK V N.The nature of statistical learning theory [M].New York:Springer Verlag,1995.
[5]崔建明.基于SVM算法的文本分类技术研究[J].计算机仿真,2013,30(2):299-302.
[6]董亚楠,刘学军,李斌.一种基于用户行为特征选择的点击欺诈检测方法[J].计算机科学,2016,43(10):145-149.
[7]OENTARYO R,LIM E P,FINEGOLD M,et al.Detecting click fraud in online advertising:a data mining approach [J].Journal of Machine Learning Research,2014,15(1):99-140.
[8]CHAWLA NV,BOWYER KW,HALL LO,et al.SMOTE:synthetic minority over-sampling technique[J].Journal of Artificial Intelligence Research,2011,16(1):321-357.
[9]GUSTAVO E A,BATISTA P A,RONALDO C,et al.A study of the behavior of several methods for balancing machine lear-ning training data[J].SIGKDD Explorations,2004,6(1):20-29.
[10]于化龙,高尚,赵靖,等.基于过采样技术和随机森林的不平衡微阵列数据分类方法研究[J].计算机科学,2012,39(5):190-194.

相关文章 15

[1]	单晓英, 任迎春. 基于改进麻雀搜索优化支持向量机的渔船捕捞方式识别 Fishing Type Identification of Marine Fishing Vessels Based on Support Vector Machine Optimized by Improved Sparrow Search Algorithm 计算机科学, 2022, 49(6A): 211-216. https://doi.org/10.11896/jsjkx.220300216
[2]	陈景年. 一种适于多分类问题的支持向量机加速方法 Acceleration of SVM for Multi-class Classification 计算机科学, 2022, 49(6A): 297-300. https://doi.org/10.11896/jsjkx.210400149
[3]	侯夏晔, 陈海燕, 张兵, 袁立罡, 贾亦真. 一种基于支持向量机的主动度量学习算法 Active Metric Learning Based on Support Vector Machines 计算机科学, 2022, 49(6A): 113-118. https://doi.org/10.11896/jsjkx.210500034
[4]	邢云冰, 龙广玉, 胡春雨, 忽丽莎. 基于SVM的类别增量人体活动识别方法 Human Activity Recognition Method Based on Class Increment SVM 计算机科学, 2022, 49(5): 78-83. https://doi.org/10.11896/jsjkx.210400024
[5]	武玉坤, 李伟, 倪敏雅, 许志骋. 单类支持向量机融合深度自编码器的异常检测模型 Anomaly Detection Model Based on One-class Support Vector Machine Fused Deep Auto-encoder 计算机科学, 2022, 49(3): 144-151. https://doi.org/10.11896/jsjkx.210100142
[6]	黄颖琦, 陈红梅. 基于代价敏感卷积神经网络的非平衡问题混合方法 Cost-sensitive Convolutional Neural Network Based Hybrid Method for Imbalanced Data Classification 计算机科学, 2021, 48(9): 77-85. https://doi.org/10.11896/jsjkx.200900013
[7]	侯春萍, 赵春月, 王致芃. 基于自反馈最优子类挖掘的视频异常检测算法 Video Abnormal Event Detection Algorithm Based on Self-feedback Optimal Subclass Mining 计算机科学, 2021, 48(7): 199-205. https://doi.org/10.11896/jsjkx.200800146
[8]	郭福民, 张华, 胡瑢华, 宋岩. 一种基于表面肌电信号的腕部肌力估计方法研究 Study on Method for Estimating Wrist Muscle Force Based on Surface EMG Signals 计算机科学, 2021, 48(6A): 317-320. https://doi.org/10.11896/jsjkx.200600021
[9]	卓雅倩, 欧博. 噪声环境下的人脸防伪识别算法研究 Face Anti-spoofing Algorithm for Noisy Environment 计算机科学, 2021, 48(6A): 443-447. https://doi.org/10.11896/jsjkx.200900207
[10]	雷剑梅, 曾令秋, 牟洁, 陈立东, 王淙, 柴勇. 基于整车EMC标准测试和机器学习的反向诊断方法 Reverse Diagnostic Method Based on Vehicle EMC Standard Test and Machine Learning 计算机科学, 2021, 48(6): 190-195. https://doi.org/10.11896/jsjkx.200700204
[11]	刘全明, 李尹楠, 郭婷, 李岩纬. 基于Borderline-SMOTE和双Attention的入侵检测方法 Intrusion Detection Method Based on Borderline-SMOTE and Double Attention 计算机科学, 2021, 48(3): 327-332. https://doi.org/10.11896/jsjkx.200600025
[12]	郇文明, 林海涛. 基于采样集成算法的入侵检测系统设计 Design of Intrusion Detection System Based on Sampling Ensemble Algorithm 计算机科学, 2021, 48(11A): 705-712. https://doi.org/10.11896/jsjkx.201100101
[13]	王友卫, 朱晨, 朱建明, 李洋, 凤丽洲, 刘江淳. 基于用户兴趣词典和LSTM的个性化情感分类方法 User Interest Dictionary and LSTM Based Method for Personalized Emotion Classification 计算机科学, 2021, 48(11A): 251-257. https://doi.org/10.11896/jsjkx.201200202
[14]	鲁淑霞, 张振莲. 基于最优间隔的AdaBoost_v算法的非平衡数据分类 Imbalanced Data Classification of AdaBoost_v Algorithm Based on Optimum Margin 计算机科学, 2021, 48(11): 184-191. https://doi.org/10.11896/jsjkx.200900107
[15]	曹素娥, 杨泽民. 基于聚类分析算法和优化支持向量机的无线网络流量预测 Prediction of Wireless Network Traffic Based on Clustering Analysis and Optimized Support Vector Machine 计算机科学, 2020, 47(8): 319-322. https://doi.org/10.11896/jsjkx.190800075

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed