计算机科学 ›› 2019, Vol. 46 ›› Issue (8): 16-22.doi: 10.11896/j.issn.1002-137X.2019.08.003
蔡莉1,2, 李英姿2, 江芳2, 梁宇2
CAI Li 1,2, LI Ying-zi2, JIANG Fang2, LIANG Yu2
摘要: 在大数据时代,数据来源众多,因此多源数据的融合成为数据挖掘领域的一个研究热点。现有的多源数据融合研究主要聚焦于相同领域内平衡数据集的融合模型和算法,对来自不同领域的非平衡数据集的聚类挖掘关注较少。DBSCAN(Density-Based Spatial Clustering of Applications with Noise)算法是挖掘热点区域的主要算法,但其无法处理不平衡的融合数据,少数类数据形成的聚类结果很难被发现。针对不平衡数据的融合,文中提出了一种基于时空特征的位置数据融合模型;同时,从数据层面和算法层面提出新颖的方法来解决不平衡数据的挖掘问题。鉴于目前的聚类算法的评价指标并不适用于不平衡数据的聚类结果评估,提出了一种新的综合评价指标来反映聚类质量。将来自交通领域的GPS轨迹数据(多数类数据)和社交领域的微博签到数据(少数类数据)进行融合,然后采用所提方法来挖掘热点区域。实验结果表明:基于多源数据融合的热点区域挖掘结果优于单源挖掘结果,所发现的热点区域位置、分布和数量与实际情况一致。文中所提出的融合模型、改进算法和评估指标法是有效且可行的,还可用于其他来源的位置数据的融合与分析。
中图分类号:
[1]YUAN J,ZHENG Y,XIE X.Discovering Regions of Different Functions in a City Using Human Mobility and POIs[C]∥Proceedings of the 18th ACM International Conference on Knowledge Discovery and Data Mining.New York:ACM,2012:186-194. [2]CHEN Y,YUAN P,QIU M,et al.An Indoor Trajectory Frequent Pattern Mining Algorithm Based on Vague Grid Sequence[J].Expert Systems With Applications,2019,118:614-624. [3]ZHENG Y.Methodologies for Cross-Domain Data Fusion:An Overview[J].IEEE Transactions on Big Data,2015,1(1):16-34. [4]DING Z Y,JIA Y,ZHOU B.Research Summary of Wei bo Data Mining[J].Journal of Computer Research and Development,2014,51(4):691-706.(in Chinese) 丁兆云,贾焰,周斌.微博数据挖掘研究综述[J].计算机研究与发展,2014,51(4):691-706. [5]LEE J,SHIN I,PARK G,et al.Analysis of the Passenger Pick-up Pattern for Taxi Location Recommendation[C]∥2008 Fourth International Conference on Networked Computing and Advanced Information Management.New York:IEEE,2008,1:199-204. [6]KISILEVICH S,MANSMANN F,KEIM D.P-DBSCAN:A Density Based Clustering Algorithm for Exploration and Analysis of Attractive Areas Using Collections of Geo-tagged photos[C]∥Proceedings of the First International Conference and Exhibition on Computing for Geospatial Research & Application.New York:ACM,2010:38-41. [7]VERMA N,BALIYAN N.PAM Clustering Based Taxi Hotspot Detection for Informed Driving[C]∥2017 8th International Conference on Computing,Communication and Networking Technologies (ICCCNT).New York:IEEE,2017:1-7. [8]NING P F,WANG Y,SHEN Y R,et al.Identification of Urban Interest Function Region by Using Social Medida Check-in Data[J].Journal of Geomatics,2018,43(2):110-114.(in Chinese) 宁鹏飞,万幼,沈怡然,等.基于签到数据的城市热点功能区识别研究[J].测绘地理信息,2018,43(2):110-114. [9]ORRIOLS-PUIG A,BERNADO-MANSILLA E,GOLDBERG D E,et al.Facetwise Analysis of XCS for Problems With Class Imbalances[J].IEEE Transactions on Evolutionary Computation,2009,13(5):1093-1119. [10]KRAWCZYK B,MCINNES B T.Local ensemble learning from imbalanced and noisy data for word sense disambiguation[J].Pattern Recognition,2017,78:103-119. [11]SEBASTIÁN M,JULIO L.Dealing with High-dimensional Class-imbalanced Data sets:Embedded Feature Selection for SVM Classification[J].Applied Soft Computing,2018,67:94-105. [12]ZHAI Y,YANG B R,QU W.Survey of Mining Imbalanced Datasets[J].Computer Science,2010,37(10):27-32.(in Chinese) 翟云,杨炳儒,曲武.不平衡类数据挖掘研究综述[J].计算机科学,2010,37(10):27-32. [13]ZHU Y J,WANG Z,ZHA H Y,et al.Boundary-Eliminated Pseudo Inverse Linear Discriminant for Imbalanced Problems[J].IEEE Transactions on Neural Networks and Learning Systems,2018,29(6):2581-2594. [14]LI X,CHENG Z G,Fan Y,et al.Exploring of Clustering Algorithm on Class-imbalanced Data[C]∥2013 8th International Conference on Computer Science & Education.New York:IEEE,2013:89-93. [15]PAN Q,WANG Z F,LIANG Y,et al.Basic Methods and Progress of Information Fusion[J].Control Theory & Applications,2012,29(10):1234-1244.(in Chinese) 潘泉,王增福,梁彦,等.信息融合理论的基本方法与进展[J].控制理论与应用,2012,29(10):1234-1244. [16]HALL D L,LLINAS J.Handbook of Multi-sensor Data fusion[M].New York:CRC Press,2001. [17]BRODINOVÁ ,ZAHARIEVA M,FILZMOSER P,et al.Clustering of Imbalanced High-dimensional Media data [J].Advances in Data Analysis and Classification,2018,12(2):261-284. [18]GUO H X,LI Y J,JENNIFER S,et al.Learning from Class-imbalanced Data:Review of Methods and Applications[J].Expert Systems with Applications,2017,73:720-739. [19]LI K,ZHANG W,LU Q,et al.An Improved SMOTE Imba- lanced Data Classification Method Based on Support Degree[C]∥2014 International Conference on Identification,Information and Knowledge in the Internet of Things.New York:IEEE,2014:34-38. [20]DENG X,ZHONG W,REN J,et al.An Imbalanced Data Classification Method Based on Automatic Clustering Under-sampling[C]∥Proceedings of IEEE Conference on Performance Computing and Communications.New York:IEEE Press,2016:1-8. [21]XIE J Y,ZHOU Y,WANG M Z,et al.New Criteria for Evaluating the Validity of Clustering[J].CAAI Transactions on Intelligent Systems,2017,12(6):873-882.(in Chinese) 谢娟英,周颖,王明钊,等.聚类有效性评价新指标[J].智能系统学报,2017,12(6):873-882. |
[1] | 陈明鑫, 张钧波, 李天瑞. 联邦学习攻防研究综述 Survey on Attacks and Defenses in Federated Learning 计算机科学, 2022, 49(7): 310-323. https://doi.org/10.11896/jsjkx.211000079 |
[2] | 林夕, 陈孜卓, 王中卿. 基于不平衡数据与集成学习的属性级情感分类 Aspect-level Sentiment Classification Based on Imbalanced Data and Ensemble Learning 计算机科学, 2022, 49(6A): 144-149. https://doi.org/10.11896/jsjkx.210500205 |
[3] | 杨斐斐, 沈思妤, 申德荣, 聂铁铮, 寇月. 面向数据融合的多粒度数据溯源方法 Method on Multi-granularity Data Provenance for Data Fusion 计算机科学, 2022, 49(5): 120-128. https://doi.org/10.11896/jsjkx.210300092 |
[4] | 董奇达, 王喆, 吴松洋. 结合注意力机制与几何信息的特征融合框架 Feature Fusion Framework Combining Attention Mechanism and Geometric Information 计算机科学, 2022, 49(5): 129-134. https://doi.org/10.11896/jsjkx.210300180 |
[5] | 周新民, 胡宜桂, 刘文洁, 孙荣俊. 基于多模态多层级数据融合方法的城市功能识别研究 Research on Urban Function Recognition Based on Multi-modal and Multi-level Data Fusion Method 计算机科学, 2021, 48(9): 50-58. https://doi.org/10.11896/jsjkx.210500220 |
[6] | 郑建华, 李小敏, 刘双印, 李迪. 融合级联上采样与下采样的改进随机森林不平衡数据分类算法 Improved Random Forest Imbalance Data Classification Algorithm Combining Cascaded Up-sampling and Down-sampling 计算机科学, 2021, 48(7): 145-154. https://doi.org/10.11896/jsjkx.200800120 |
[7] | 吴成凤, 蔡莉, 李劲, 梁宇. 基于多源位置数据的居民出行频繁模式挖掘 Frequent Pattern Mining of Residents’ Travel Based on Multi-source Location Data 计算机科学, 2021, 48(7): 155-163. https://doi.org/10.11896/jsjkx.200800072 |
[8] | 陈静杰, 王琨. 不平衡油耗数据的区间预测方法 Interval Prediction Method for Imbalanced Fuel Consumption Data 计算机科学, 2021, 48(7): 178-183. https://doi.org/10.11896/jsjkx.200500145 |
[9] | 张人之, 朱焱. 基于主动学习的社交网络恶意用户检测方法 Malicious User Detection Method for Social Network Based on Active Learning 计算机科学, 2021, 48(6): 332-337. https://doi.org/10.11896/jsjkx.200700151 |
[10] | 王萧萧, 王亭雯, 马玉玲, 范佳奕, 崔超然. 基于深度森林的P2P网贷借款人信用风险评估方法 Credit Risk Assessment Method of P2P Online Loan Borrowers Based on Deep Forest 计算机科学, 2021, 48(11A): 429-434. https://doi.org/10.11896/jsjkx.201000013 |
[11] | 张俊, 王杨, 李坤豪, 李昌, 赵传信. 基于流形学习的多源传感器体域网数据融合模型 Multi-source Sensor Body Area Network Data Fusion Model Based on Manifold Learning 计算机科学, 2020, 47(8): 323-328. https://doi.org/10.11896/jsjkx.191000012 |
[12] | 马虹. 基于5G的视觉辅助BDS移动机器人融合定位算法 Fusion Localization Algorithm of Visual Aided BDS Mobile Robot Based on 5G 计算机科学, 2020, 47(6A): 631-633. https://doi.org/10.11896/JsJkx.190400156 |
[13] | 宋玲玲, 王时绘, 杨超, 盛潇. 改进的XGBoost在不平衡数据处理中的应用研究 Application Research of Improved XGBoost in Imbalanced Data Processing 计算机科学, 2020, 47(6): 98-103. https://doi.org/10.11896/jsjkx.191200138 |
[14] | 向伟, 王新维. 基于多类邻域三支决策模型的不平衡数据分类 Imbalance Data Classification Based on Model of Multi-class Neighbourhood Three-way Decision 计算机科学, 2020, 47(5): 103-109. https://doi.org/10.11896/jsjkx.180601099 |
[15] | 黄婷婷, 冯锋. 无线传感器网络异构数据融合模型优化研究 Study on Optimization of Heterogeneous Data Fusion Model in Wireless Sensor Network 计算机科学, 2020, 47(11A): 339-344. https://doi.org/10.11896/jsjkx.200100109 |
|