计算机科学 ›› 2019, Vol. 46 ›› Issue (8): 16-22.doi: 10.11896/j.issn.1002-137X.2019.08.003

• 大数据与数据科学* • 上一篇    下一篇

面向城市热点区域的不平衡数据聚类挖掘研究

蔡莉1,2, 李英姿2, 江芳2, 梁宇2   

  1. (复旦大学计算机科学技术学院 上海200433)1
    (云南大学软件学院 昆明650091)2
  • 收稿日期:2018-11-27 出版日期:2019-08-15 发布日期:2019-08-15
  • 通讯作者: 梁宇(1964-),男,硕士,教授,主要研究方向为智能交通、云计算,E-mail:yuliang@ynu.edu.cn
  • 作者简介:蔡莉(1975-),女,博士生,副教授,主要研究方向为数据挖掘、智能交通;李英姿(1994-),女,硕士,主要研究方向为数据挖掘、数据质量;江芳(1993-),女,硕士生,主要研究方向为数据挖掘、数据质量
  • 基金资助:
    国家自然科学基金(61663047)

Study on Clustering Mining of Imbalanced Data Fusion Towards Urban Hotspots

CAI Li 1,2, LI Ying-zi2, JIANG Fang2, LIANG Yu2   

  1. (School of Computer Science,Fudan University,Shanghai 200433,China)1
    (School of Software,Yunnan University,Kunming 650091,China)2
  • Received:2018-11-27 Online:2019-08-15 Published:2019-08-15

摘要: 在大数据时代,数据来源众多,因此多源数据的融合成为数据挖掘领域的一个研究热点。现有的多源数据融合研究主要聚焦于相同领域内平衡数据集的融合模型和算法,对来自不同领域的非平衡数据集的聚类挖掘关注较少。DBSCAN(Density-Based Spatial Clustering of Applications with Noise)算法是挖掘热点区域的主要算法,但其无法处理不平衡的融合数据,少数类数据形成的聚类结果很难被发现。针对不平衡数据的融合,文中提出了一种基于时空特征的位置数据融合模型;同时,从数据层面和算法层面提出新颖的方法来解决不平衡数据的挖掘问题。鉴于目前的聚类算法的评价指标并不适用于不平衡数据的聚类结果评估,提出了一种新的综合评价指标来反映聚类质量。将来自交通领域的GPS轨迹数据(多数类数据)和社交领域的微博签到数据(少数类数据)进行融合,然后采用所提方法来挖掘热点区域。实验结果表明:基于多源数据融合的热点区域挖掘结果优于单源挖掘结果,所发现的热点区域位置、分布和数量与实际情况一致。文中所提出的融合模型、改进算法和评估指标法是有效且可行的,还可用于其他来源的位置数据的融合与分析。

关键词: 不平衡数据, 城市热点区域, 聚类评价标准, 数据融合, 位置数据

Abstract: In the era of big data,multi-source data fusion is a trending topic in the field of data mining.Previous studies have mostly focused on fusion models and algorithms of balanced data sets,but seldom on issues of clustering mining for imbalanced data sets.DBSCAN algorithm is a classical algorithm for mining urban hotspots.However,it cannot deal with imbalanced location data,and the clustering results generated by the minority class are difficult to discovery.Aiming at the imbalanced data fusion,this paper proposed a novel fusion model based on spatio-temporal features,at the same time,proposed a novel approach to solve the mining problem of imbalance data from data aspect and algorithm aspect.Since the evaluation index of current clustering algorithm is not suitable for the evaluation of unbalanced data clustering results,a new comprehensive evaluation index was proposed to reflect the clustering quality.GPS trajectory data (the majority class data) from the traffic field and microblog check-in data (the minority class data) from the social field are fused,and then the proposed method is used to mine hot spots.The mining results of hot spots based on multi-source data fusion are better than those of single source data fusion.The location,distribution and number of hot spots are consistent with the actual situation.The proposed fusion model algorithm and evaluation index method are effective and feasible,and can also be used for the fusion and analysis of location data from other sources

Key words: Clustering criteria, Data fusion, Imbalanced data, Location data, Urban hotspots

中图分类号: 

  • TP301
[1]YUAN J,ZHENG Y,XIE X.Discovering Regions of Different Functions in a City Using Human Mobility and POIs[C]∥Proceedings of the 18th ACM International Conference on Knowledge Discovery and Data Mining.New York:ACM,2012:186-194.
[2]CHEN Y,YUAN P,QIU M,et al.An Indoor Trajectory Frequent Pattern Mining Algorithm Based on Vague Grid Sequence[J].Expert Systems With Applications,2019,118:614-624.
[3]ZHENG Y.Methodologies for Cross-Domain Data Fusion:An Overview[J].IEEE Transactions on Big Data,2015,1(1):16-34.
[4]DING Z Y,JIA Y,ZHOU B.Research Summary of Wei bo Data Mining[J].Journal of Computer Research and Development,2014,51(4):691-706.(in Chinese) 丁兆云,贾焰,周斌.微博数据挖掘研究综述[J].计算机研究与发展,2014,51(4):691-706.
[5]LEE J,SHIN I,PARK G,et al.Analysis of the Passenger Pick-up Pattern for Taxi Location Recommendation[C]∥2008 Fourth International Conference on Networked Computing and Advanced Information Management.New York:IEEE,2008,1:199-204.
[6]KISILEVICH S,MANSMANN F,KEIM D.P-DBSCAN:A Density Based Clustering Algorithm for Exploration and Analysis of Attractive Areas Using Collections of Geo-tagged photos[C]∥Proceedings of the First International Conference and Exhibition on Computing for Geospatial Research & Application.New York:ACM,2010:38-41.
[7]VERMA N,BALIYAN N.PAM Clustering Based Taxi Hotspot Detection for Informed Driving[C]∥2017 8th International Conference on Computing,Communication and Networking Technologies (ICCCNT).New York:IEEE,2017:1-7.
[8]NING P F,WANG Y,SHEN Y R,et al.Identification of Urban Interest Function Region by Using Social Medida Check-in Data[J].Journal of Geomatics,2018,43(2):110-114.(in Chinese) 宁鹏飞,万幼,沈怡然,等.基于签到数据的城市热点功能区识别研究[J].测绘地理信息,2018,43(2):110-114.
[9]ORRIOLS-PUIG A,BERNADO-MANSILLA E,GOLDBERG D E,et al.Facetwise Analysis of XCS for Problems With Class Imbalances[J].IEEE Transactions on Evolutionary Computation,2009,13(5):1093-1119.
[10]KRAWCZYK B,MCINNES B T.Local ensemble learning from imbalanced and noisy data for word sense disambiguation[J].Pattern Recognition,2017,78:103-119.
[11]SEBASTIÁN M,JULIO L.Dealing with High-dimensional Class-imbalanced Data sets:Embedded Feature Selection for SVM Classification[J].Applied Soft Computing,2018,67:94-105.
[12]ZHAI Y,YANG B R,QU W.Survey of Mining Imbalanced Datasets[J].Computer Science,2010,37(10):27-32.(in Chinese) 翟云,杨炳儒,曲武.不平衡类数据挖掘研究综述[J].计算机科学,2010,37(10):27-32.
[13]ZHU Y J,WANG Z,ZHA H Y,et al.Boundary-Eliminated Pseudo Inverse Linear Discriminant for Imbalanced Problems[J].IEEE Transactions on Neural Networks and Learning Systems,2018,29(6):2581-2594.
[14]LI X,CHENG Z G,Fan Y,et al.Exploring of Clustering Algorithm on Class-imbalanced Data[C]∥2013 8th International Conference on Computer Science & Education.New York:IEEE,2013:89-93.
[15]PAN Q,WANG Z F,LIANG Y,et al.Basic Methods and Progress of Information Fusion[J].Control Theory & Applications,2012,29(10):1234-1244.(in Chinese) 潘泉,王增福,梁彦,等.信息融合理论的基本方法与进展[J].控制理论与应用,2012,29(10):1234-1244.
[16]HALL D L,LLINAS J.Handbook of Multi-sensor Data fusion[M].New York:CRC Press,2001.
[17]BRODINOVÁ Š,ZAHARIEVA M,FILZMOSER P,et al.Clustering of Imbalanced High-dimensional Media data [J].Advances in Data Analysis and Classification,2018,12(2):261-284.
[18]GUO H X,LI Y J,JENNIFER S,et al.Learning from Class-imbalanced Data:Review of Methods and Applications[J].Expert Systems with Applications,2017,73:720-739.
[19]LI K,ZHANG W,LU Q,et al.An Improved SMOTE Imba- lanced Data Classification Method Based on Support Degree[C]∥2014 International Conference on Identification,Information and Knowledge in the Internet of Things.New York:IEEE,2014:34-38.
[20]DENG X,ZHONG W,REN J,et al.An Imbalanced Data Classification Method Based on Automatic Clustering Under-sampling[C]∥Proceedings of IEEE Conference on Performance Computing and Communications.New York:IEEE Press,2016:1-8.
[21]XIE J Y,ZHOU Y,WANG M Z,et al.New Criteria for Evaluating the Validity of Clustering[J].CAAI Transactions on Intelligent Systems,2017,12(6):873-882.(in Chinese) 谢娟英,周颖,王明钊,等.聚类有效性评价新指标[J].智能系统学报,2017,12(6):873-882.
[1] 陈明鑫, 张钧波, 李天瑞.
联邦学习攻防研究综述
Survey on Attacks and Defenses in Federated Learning
计算机科学, 2022, 49(7): 310-323. https://doi.org/10.11896/jsjkx.211000079
[2] 林夕, 陈孜卓, 王中卿.
基于不平衡数据与集成学习的属性级情感分类
Aspect-level Sentiment Classification Based on Imbalanced Data and Ensemble Learning
计算机科学, 2022, 49(6A): 144-149. https://doi.org/10.11896/jsjkx.210500205
[3] 杨斐斐, 沈思妤, 申德荣, 聂铁铮, 寇月.
面向数据融合的多粒度数据溯源方法
Method on Multi-granularity Data Provenance for Data Fusion
计算机科学, 2022, 49(5): 120-128. https://doi.org/10.11896/jsjkx.210300092
[4] 董奇达, 王喆, 吴松洋.
结合注意力机制与几何信息的特征融合框架
Feature Fusion Framework Combining Attention Mechanism and Geometric Information
计算机科学, 2022, 49(5): 129-134. https://doi.org/10.11896/jsjkx.210300180
[5] 周新民, 胡宜桂, 刘文洁, 孙荣俊.
基于多模态多层级数据融合方法的城市功能识别研究
Research on Urban Function Recognition Based on Multi-modal and Multi-level Data Fusion Method
计算机科学, 2021, 48(9): 50-58. https://doi.org/10.11896/jsjkx.210500220
[6] 郑建华, 李小敏, 刘双印, 李迪.
融合级联上采样与下采样的改进随机森林不平衡数据分类算法
Improved Random Forest Imbalance Data Classification Algorithm Combining Cascaded Up-sampling and Down-sampling
计算机科学, 2021, 48(7): 145-154. https://doi.org/10.11896/jsjkx.200800120
[7] 吴成凤, 蔡莉, 李劲, 梁宇.
基于多源位置数据的居民出行频繁模式挖掘
Frequent Pattern Mining of Residents’ Travel Based on Multi-source Location Data
计算机科学, 2021, 48(7): 155-163. https://doi.org/10.11896/jsjkx.200800072
[8] 陈静杰, 王琨.
不平衡油耗数据的区间预测方法
Interval Prediction Method for Imbalanced Fuel Consumption Data
计算机科学, 2021, 48(7): 178-183. https://doi.org/10.11896/jsjkx.200500145
[9] 张人之, 朱焱.
基于主动学习的社交网络恶意用户检测方法
Malicious User Detection Method for Social Network Based on Active Learning
计算机科学, 2021, 48(6): 332-337. https://doi.org/10.11896/jsjkx.200700151
[10] 王萧萧, 王亭雯, 马玉玲, 范佳奕, 崔超然.
基于深度森林的P2P网贷借款人信用风险评估方法
Credit Risk Assessment Method of P2P Online Loan Borrowers Based on Deep Forest
计算机科学, 2021, 48(11A): 429-434. https://doi.org/10.11896/jsjkx.201000013
[11] 张俊, 王杨, 李坤豪, 李昌, 赵传信.
基于流形学习的多源传感器体域网数据融合模型
Multi-source Sensor Body Area Network Data Fusion Model Based on Manifold Learning
计算机科学, 2020, 47(8): 323-328. https://doi.org/10.11896/jsjkx.191000012
[12] 马虹.
基于5G的视觉辅助BDS移动机器人融合定位算法
Fusion Localization Algorithm of Visual Aided BDS Mobile Robot Based on 5G
计算机科学, 2020, 47(6A): 631-633. https://doi.org/10.11896/JsJkx.190400156
[13] 宋玲玲, 王时绘, 杨超, 盛潇.
改进的XGBoost在不平衡数据处理中的应用研究
Application Research of Improved XGBoost in Imbalanced Data Processing
计算机科学, 2020, 47(6): 98-103. https://doi.org/10.11896/jsjkx.191200138
[14] 向伟, 王新维.
基于多类邻域三支决策模型的不平衡数据分类
Imbalance Data Classification Based on Model of Multi-class Neighbourhood Three-way Decision
计算机科学, 2020, 47(5): 103-109. https://doi.org/10.11896/jsjkx.180601099
[15] 黄婷婷, 冯锋.
无线传感器网络异构数据融合模型优化研究
Study on Optimization of Heterogeneous Data Fusion Model in Wireless Sensor Network
计算机科学, 2020, 47(11A): 339-344. https://doi.org/10.11896/jsjkx.200100109
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!