Computer Science ›› 2019, Vol. 46 ›› Issue (8): 16-22.doi: 10.11896/j.issn.1002-137X.2019.08.003

• Big Data & Data Science • Previous Articles     Next Articles

Study on Clustering Mining of Imbalanced Data Fusion Towards Urban Hotspots

CAI Li 1,2, LI Ying-zi2, JIANG Fang2, LIANG Yu2   

  1. (School of Computer Science,Fudan University,Shanghai 200433,China)1
    (School of Software,Yunnan University,Kunming 650091,China)2
  • Received:2018-11-27 Online:2019-08-15 Published:2019-08-15

Abstract: In the era of big data,multi-source data fusion is a trending topic in the field of data mining.Previous studies have mostly focused on fusion models and algorithms of balanced data sets,but seldom on issues of clustering mining for imbalanced data sets.DBSCAN algorithm is a classical algorithm for mining urban hotspots.However,it cannot deal with imbalanced location data,and the clustering results generated by the minority class are difficult to discovery.Aiming at the imbalanced data fusion,this paper proposed a novel fusion model based on spatio-temporal features,at the same time,proposed a novel approach to solve the mining problem of imbalance data from data aspect and algorithm aspect.Since the evaluation index of current clustering algorithm is not suitable for the evaluation of unbalanced data clustering results,a new comprehensive evaluation index was proposed to reflect the clustering quality.GPS trajectory data (the majority class data) from the traffic field and microblog check-in data (the minority class data) from the social field are fused,and then the proposed method is used to mine hot spots.The mining results of hot spots based on multi-source data fusion are better than those of single source data fusion.The location,distribution and number of hot spots are consistent with the actual situation.The proposed fusion model algorithm and evaluation index method are effective and feasible,and can also be used for the fusion and analysis of location data from other sources

Key words: Imbalanced data, Data fusion, Urban hotspots, Clustering criteria, Location data

CLC Number: 

  • TP301
[1] YUAN J,ZHENG Y,XIE X.Discovering Regions of Different Functions in a City Using Human Mobility and POIs[C]∥Proceedings of the 18th ACM International Conference on Knowledge Discovery and Data Mining.New York:ACM,2012:186-194.
[2] CHEN Y,YUAN P,QIU M,et al.An Indoor Trajectory Frequent Pattern Mining Algorithm Based on Vague Grid Sequence[J].Expert Systems With Applications,2019,118:614-624.
[3] ZHENG Y.Methodologies for Cross-Domain Data Fusion:An Overview[J].IEEE Transactions on Big Data,2015,1(1):16-34.
[4] DING Z Y,JIA Y,ZHOU B.Research Summary of Wei bo Data Mining[J].Journal of Computer Research and Development,2014,51(4):691-706.(in Chinese) 丁兆云,贾焰,周斌.微博数据挖掘研究综述[J].计算机研究与发展,2014,51(4):691-706.
[5] LEE J,SHIN I,PARK G,et al.Analysis of the Passenger Pick-up Pattern for Taxi Location Recommendation[C]∥2008 Fourth International Conference on Networked Computing and Advanced Information Management.New York:IEEE,2008,1:199-204.
[6] KISILEVICH S,MANSMANN F,KEIM D.P-DBSCAN:A Density Based Clustering Algorithm for Exploration and Analysis of Attractive Areas Using Collections of Geo-tagged photos[C]∥Proceedings of the First International Conference and Exhibition on Computing for Geospatial Research & Application.New York:ACM,2010:38-41.
[7] VERMA N,BALIYAN N.PAM Clustering Based Taxi Hotspot Detection for Informed Driving[C]∥2017 8th International Conference on Computing,Communication and Networking Technologies (ICCCNT).New York:IEEE,2017:1-7.
[8] NING P F,WANG Y,SHEN Y R,et al.Identification of Urban Interest Function Region by Using Social Medida Check-in Data[J].Journal of Geomatics,2018,43(2):110-114.(in Chinese) 宁鹏飞,万幼,沈怡然,等.基于签到数据的城市热点功能区识别研究[J].测绘地理信息,2018,43(2):110-114.
[9] ORRIOLS-PUIG A,BERNADO-MANSILLA E,GOLDBERG D E,et al.Facetwise Analysis of XCS for Problems With Class Imbalances[J].IEEE Transactions on Evolutionary Computation,2009,13(5):1093-1119.
[10] KRAWCZYK B,MCINNES B T.Local ensemble learning from imbalanced and noisy data for word sense disambiguation[J].Pattern Recognition,2017,78:103-119.
[11] SEBASTIÁN M,JULIO L.Dealing with High-dimensional Class-imbalanced Data sets:Embedded Feature Selection for SVM Classification[J].Applied Soft Computing,2018,67:94-105.
[12] ZHAI Y,YANG B R,QU W.Survey of Mining Imbalanced Datasets[J].Computer Science,2010,37(10):27-32.(in Chinese) 翟云,杨炳儒,曲武.不平衡类数据挖掘研究综述[J].计算机科学,2010,37(10):27-32.
[13] ZHU Y J,WANG Z,ZHA H Y,et al.Boundary-Eliminated Pseudo Inverse Linear Discriminant for Imbalanced Problems[J].IEEE Transactions on Neural Networks and Learning Systems,2018,29(6):2581-2594.
[14] LI X,CHENG Z G,Fan Y,et al.Exploring of Clustering Algorithm on Class-imbalanced Data[C]∥2013 8th International Conference on Computer Science & Education.New York:IEEE,2013:89-93.
[15] PAN Q,WANG Z F,LIANG Y,et al.Basic Methods and Progress of Information Fusion[J].Control Theory & Applications,2012,29(10):1234-1244.(in Chinese) 潘泉,王增福,梁彦,等.信息融合理论的基本方法与进展[J].控制理论与应用,2012,29(10):1234-1244.
[16] HALL D L,LLINAS J.Handbook of Multi-sensor Data fusion[M].New York:CRC Press,2001.
[17] BRODINOVÁ Š,ZAHARIEVA M,FILZMOSER P,et al.Clustering of Imbalanced High-dimensional Media data [J].Advances in Data Analysis and Classification,2018,12(2):261-284.
[18] GUO H X,LI Y J,JENNIFER S,et al.Learning from Class-imbalanced Data:Review of Methods and Applications[J].Expert Systems with Applications,2017,73:720-739.
[19] LI K,ZHANG W,LU Q,et al.An Improved SMOTE Imba- lanced Data Classification Method Based on Support Degree[C]∥2014 International Conference on Identification,Information and Knowledge in the Internet of Things.New York:IEEE,2014:34-38.
[20] DENG X,ZHONG W,REN J,et al.An Imbalanced Data Classification Method Based on Automatic Clustering Under-sampling[C]∥Proceedings of IEEE Conference on Performance Computing and Communications.New York:IEEE Press,2016:1-8.
[21] XIE J Y,ZHOU Y,WANG M Z,et al.New Criteria for Evaluating the Validity of Clustering[J].CAAI Transactions on Intelligent Systems,2017,12(6):873-882.(in Chinese) 谢娟英,周颖,王明钊,等.聚类有效性评价新指标[J].智能系统学报,2017,12(6):873-882.
[1] ZHANG Jun, WANG Yang, LI Kun-hao, LI Chang, ZHAO Chuan-xin. Multi-source Sensor Body Area Network Data Fusion Model Based on Manifold Learning [J]. Computer Science, 2020, 47(8): 323-328.
[2] CUI Wei, JIA Xiao-lin, FAN Shuai-shuai and ZHU Xiao-yan. New Associative Classification Algorithm for Imbalanced Data [J]. Computer Science, 2020, 47(6A): 488-493.
[3] MA Hong. Fusion Localization Algorithm of Visual Aided BDS Mobile Robot Based on 5G [J]. Computer Science, 2020, 47(6A): 631-633.
[4] SONG Ling-ling, WANG Shi-hui, YANG Chao, SHENG Xiao. Application Research of Improved XGBoost in Imbalanced Data Processing [J]. Computer Science, 2020, 47(6): 98-103.
[5] HUANG Ting-ting, FENG Feng. Study on Optimization of Heterogeneous Data Fusion Model in Wireless Sensor Network [J]. Computer Science, 2020, 47(11A): 339-344.
[6] YANG Hao, CHEN HONG-mei. Mixed-sampling Method for Imbalanced Data Based on Quantum Evolutionary Algorithm [J]. Computer Science, 2020, 47(11): 88-94.
[7] WU Yu-xi, WANG Jun-li, YANG Li, YU Miao-miao. Survey on Cost-sensitive Deep Learning Methods [J]. Computer Science, 2019, 46(5): 1-12.
[8] CAO Ya-xi, HUANG Hai-yan. Imbalanced Data Classification Algorithm Based on Probability Sampling and Ensemble Learning [J]. Computer Science, 2019, 46(5): 203-208.
[9] XIA Ying, LI Liu-jie, ZHANG XU, BAE Hae-young. Weighted Oversampling Method Based on Hierarchical Clustering for Unbalanced Data [J]. Computer Science, 2019, 46(4): 22-27.
[10] LI Zhi, MA Chun-lai, MA Tao, SHAN Hong. Anomaly Detection Method of Mobile Terminal User Based on Location Information [J]. Computer Science, 2019, 46(3): 180-187.
[11] ZHOU Xiao-min, CAO Fu-yuan, YU Li-qin. Bi-directional Oversampling Method Based on Sample Stratification [J]. Computer Science, 2019, 46(12): 83-88.
[12] YANG Si-xing, GUO Yan, LI Ning, SUN Bao-ming, QIAN Peng. Compressive Sensing Multi-target Localization Algorithm Based on Data Fusion [J]. Computer Science, 2018, 45(9): 161-165.
[13] CHEN Sheng-ling ,SHEN Si-qi, LI Dong-sheng. Ensemble Learning Method for Imbalanced Data Based on Sample Weight Updating [J]. Computer Science, 2018, 45(7): 31-37.
[14] ZHAO Nan, ZHANG Xiao-fang, ZHANG Li-jun. Overview of Imbalanced Data Classification [J]. Computer Science, 2018, 45(6A): 22-27.
[15] PENG Zheng, WANG Ling-jiao, GUO Hua. Parallel Text Categorization of Random Forest [J]. Computer Science, 2018, 45(12): 148-152.
Full text



[1] LEI Li-hui and WANG Jing. Parallelization of LTL Model Checking Based on Possibility Measure[J]. Computer Science, 2018, 45(4): 71 -75 .
[2] SUN Qi, JIN Yan, HE Kun and XU Ling-xuan. Hybrid Evolutionary Algorithm for Solving Mixed Capacitated General Routing Problem[J]. Computer Science, 2018, 45(4): 76 -82 .
[3] ZHANG Jia-nan and XIAO Ming-yu. Approximation Algorithm for Weighted Mixed Domination Problem[J]. Computer Science, 2018, 45(4): 83 -88 .
[4] WU Jian-hui, HUANG Zhong-xiang, LI Wu, WU Jian-hui, PENG Xin and ZHANG Sheng. Robustness Optimization of Sequence Decision in Urban Road Construction[J]. Computer Science, 2018, 45(4): 89 -93 .
[5] SHI Wen-jun, WU Ji-gang and LUO Yu-chun. Fast and Efficient Scheduling Algorithms for Mobile Cloud Offloading[J]. Computer Science, 2018, 45(4): 94 -99 .
[6] ZHOU Yan-ping and YE Qiao-lin. L1-norm Distance Based Least Squares Twin Support Vector Machine[J]. Computer Science, 2018, 45(4): 100 -105 .
[7] LIU Bo-yi, TANG Xiang-yan and CHENG Jie-ren. Recognition Method for Corn Borer Based on Templates Matching in Muliple Growth Periods[J]. Computer Science, 2018, 45(4): 106 -111 .
[8] GENG Hai-jun, SHI Xin-gang, WANG Zhi-liang, YIN Xia and YIN Shao-ping. Energy-efficient Intra-domain Routing Algorithm Based on Directed Acyclic Graph[J]. Computer Science, 2018, 45(4): 112 -116 .
[9] CUI Qiong, LI Jian-hua, WANG Hong and NAN Ming-li. Resilience Analysis Model of Networked Command Information System Based on Node Repairability[J]. Computer Science, 2018, 45(4): 117 -121 .
[10] WANG Zhen-chao, HOU Huan-huan and LIAN Rui. Path Optimization Scheme for Restraining Degree of Disorder in CMT[J]. Computer Science, 2018, 45(4): 122 -125 .