计算机科学 ›› 2020, Vol. 47 ›› Issue (5): 90-95.doi: 10.11896/jsjkx.190300150

• 数据库&大数据&数据科学 • 上一篇    下一篇

大数据环境下基于关联规则的多标签学习算法

王青松, 姜富山, 李菲   

  1. 辽宁大学信息学院 沈阳110036
  • 收稿日期:2019-03-28 出版日期:2020-05-15 发布日期:2020-05-19
  • 通讯作者: 王青松(1301833668@qq.com)
  • 基金资助:
    国家自然科学基金(61802160)

Multi-label Learning Algorithm Based on Association Rules in Big Data Environment

WANG Qing-song, JIANG Fu-shan, LI Fei   

  1. College of Information,Liaoning University,Shenyang 110036,China
  • Received:2019-03-28 Online:2020-05-15 Published:2020-05-19
  • About author:WANG Qing-song,born in 1974,asso-ciate professor.His main research inte-rests include big data and data mining.
  • Supported by:
    This work was supported by the National Natural Science Foundation of China (61802160).

摘要: 传统单标签挖掘技术研究中,每个样本只属于一个标签且标签之间两两互斥。而在多标签学习问题中,一个样本可能对应多个标签,并且各标签之间往往具有关联性。目前,标签间关联性研究逐渐成为多标签学习研究的热门问题。首先为适应大数据环境,对传统关联规则挖掘算法Apriori进行并行化改进,提出基于Hadoop的并行化算法Apriori_ING,实现各节点独立完成候选项集的生成、剪枝与支持数统计,充分发挥并行化的优势;通过Apriori_ING算法得到的频繁项集和关联规则生成标签集合,提出基于推理机的标签集合生成算法IETG。然后,将标签集合应用到多标签学习中,提出多标签学习算法FreLP。FreLP利用关联规则生成标签集合,将原始标签集分解为多个子集,再使用LP算法训练分类器。通过实验将FreLP与现有的多标签学习算法进行对比,结果表明在不同评价指标下所提算法可以取得更好的结果。

关键词: Apriori, Hadoop, LP, 多标签学习, 关联规则

Abstract: In the traditional single-label mining technology research,each sample belongs to only one label and the labels are mutually exclusive.In the multi-label learning problem,one sample may correspond to multiple labels,and each label is often asso-ciated with each other.At present,the research on the correlation between tags gradually becomes a hot issue in multi-label lear-ning research.Firstly,in order to adapt to the big data environment,the traditional association rule mining algorithm Apriori is parallelized and improved.The Hadoop-based parallelization algorithm Apriori_ING is proposed to realize the generation of the candidate set,the pruning and the support number statistics,and the parallelization.The advantage is that the frequent itemsets and association rules obtained by the Apriori_ING algorithm generate tag sets,and the inference engine based tag set generation algorithm IETG is proposed.Then,the label set is applied to multi-label learning,and a multi-label learning algorithm FreLP is proposed.FreLP uses association rules to generate a set of labels,decomposes the original set of labels into multiple subsets,and then uses the LP algorithm to train the classifier.FreLP was compared with the existing multi-label learning algorithms.Experiment results show that the proposed algorithm can obtain better results under different evaluation indicators.

Key words: Apriori, Association rule, Hadoop, LP, Multi-label learning

中图分类号: 

  • TP301
[1]TSOUMAKAS G,KATAKIS I,VLAHAVAS I.Mining multi-labeldata[M]//Data mining and knowledge discovery handbook.US:Springer,2010:667-685.
[2]LI L,WANG M,ZHANG L,et al.Learning semantic similarityfor multi-label text categorization[C]//Chinese LexicalSemantics Lecture Notes in Computer Science.2014:260-269.
[3]RUBIN T N,CHAMBERS A,SMYTH P,et al.Statistical topic models for multi-label document classification[J].Machine Learning,2012,88(1):157-208.
[4]JIANG J Y,TSAI S C,LEE S J.FSKNN:multi-label text categorization based on fuzzy similarity and k nearest neighbors[J].Expert Systems with Applications,2012,39(1):521-530.
[5]LIU S M,CHEN J H.A multi-label classification based ap-proach for sentiment classification[J].Expert Systems with Applications,2015,42(3):1083-1093.
[6]HUANG S,PENG W,LI J,et al.Sentiment and topic analysis on social media:a multi-task multi-label classification approach[C]//Proceedings of the 5th Annual ACM Web Science Confe-rence.2013:172-181.
[7]LO H Y,WANG J C,WANG H M,et al.Cost-Sensitive multi-label learning for audio tag annotation and retrieval[J].IEEE Trans.on Multimedia,2011,13(3):518-529.
[8]WU B,LYU S,HU B G,et al.Multi-label learning with missing labels for image annotation and facial action unit recognition[J].Pattern Recognition,2015,48(7):2279-2289.
[9]ZHANG M L,ZHOU Z H.Multi-label neural networks withapplications to functional genomics and text categorization [J].IEEE Transactions on Knowledge and Data Engineering,2007,18(10):1338-1351.
[10]ZHOU Y,XUE H,GENG X.Emotion distribution recognition from facial expressions[C]//Proc.of the ACM Int'l Conf.on Multimedia.2015:1247-1250.
[11]BOUTELL M R,LUO J,SHEN X,et al.Learning multi-label scene classification[J].Pattern Recognition,2004,37(9):1757-1771.
[12]READ J,PFAHRINGER B,HOLMES G.Multi-label classification using ensembles of pruned sets[C]//8th IEEE Internatio-nal Conference on Data Mining (ICDM'08).2008:995-1000.
[13]READ J,PFAHRINGER B,HOLMES G,et al.Classifier chains for multi-label classification[C]//20th European Conference on Machine Learning(ECML'09).Berlin:Springer,2009:254-269.
[14]SCHAPIRE R E,SINGER Y.BoosTexter:a boosting-based system for text categorization[J].Machine Learning,2000,39(2/3):135-168.
[15]DOQUIRE,GAUTHIER,VERLEYSEN,et al.Mutual information-based feature selection for multilabel classification [J].Neurocomputing,2013,122:148-155.
[16]LI S N,LI N,LI Z H.Multi-label Data Mining Technology:A Review [J].Computer Science,2013,40(4):14-21.
[17]LIU J Y,JIA X Y.A multi-label classification algorithm using association rules mining [J].Journal of Software,2017,28(11):2865-2878.
[18]XIAO W,HU J,ZHOU X F.A Survey of Algorithms for Mi-ning Parallel Association Rules Based on MapReduce-based Computing Model [J].Computer Applied Research,2018,35(1):13-23.
[19]ZHANG M L,ZHOU Z H.A Review on Multi-Label Learning Algorithms [J].IEEE Trans. on Knowledge and Data Enginee-ring,2014,26(8):1819-1837.
[20]FURNKRANZ J,HULLERMEIER E,MENCIA E L,et al.Multi-labelclas-sification via calibrated label ranking [J].Machine Learning,2008,73(2):133-152.
[21]TSOUMAKAS G,VLAHAVAS I.Random k-labelsets:an ensemble method for multilabel classification[C]//Proceedings of the 18th European Conference on Machine Learning.2007:406-417.
[22]CHENG X Q,JIN X L,WANG Y Z,et al.Survey on big data system and analytic technology[J].Journal of Software,2014,25(9):1889-1908.
[23]AGRAWAL R,SRIKANT R.Fast algorithm for mining association rules[C]//Processdings of 20th Int.Conf.Very Large Data Bases(VLDB).Morgan Kaufman Press.1994:487-499.
[24]XING C Z,AN W G,WANG X.Improvement of algorithm for mining frequent itemsets in vertical data format [J].Computer Engineering and Science,2017,39(7):1365-1370.
[25]LIU S H,LIU S J,CHEN S X,et al.IOMRA:a high efficiency frequent itemset mining algorithm based on the MapReduce computation model[C]//Proc of IEEE International Conference on Computational Science and Engineering.2014:1290-1295.
[26]TSOUMAKAS G,VILCEK J,XIOUFITS E S.Mulan:A Java library for multi-label learning[OL].http://mulan.sourceforge.net/datasets.html.
[1] 曹扬晨, 朱国胜, 孙文和, 吴善超.
未知网络攻击识别关键技术研究
Study on Key Technologies of Unknown Network Attack Identification
计算机科学, 2022, 49(6A): 581-587. https://doi.org/10.11896/jsjkx.210400044
[2] 田冰川, 田臣, 周宇航, 陈贵海, 窦万春.
减少Hadoop集群中网络队头阻塞的调度算法
Reducing Head-of-Line Blocking on Network in Hadoop Clusters
计算机科学, 2022, 49(3): 11-22. https://doi.org/10.11896/jsjkx.210900117
[3] 徐慧慧, 晏华.
基于相对危险度的儿童先心病风险因素分析算法
Relative Risk Degree Based Risk Factor Analysis Algorithm for Congenital Heart Disease in Children
计算机科学, 2021, 48(6): 210-214. https://doi.org/10.11896/jsjkx.200500082
[4] 沈夏炯, 杨继勇, 张磊.
基于不相关属性集合的属性探索算法
Attribute Exploration Algorithm Based on Unrelated Attribute Set
计算机科学, 2021, 48(4): 54-62. https://doi.org/10.11896/jsjkx.200800082
[5] 廉文娟, 赵朵朵, 范修斌, 耿玉年, 范新桐.
基于认证及区块链的CFL_BLP_BC模型
CFL_BLP_BC Model Based on Authentication and Blockchain
计算机科学, 2021, 48(11): 36-45. https://doi.org/10.11896/jsjkx.201000002
[6] 崔巍, 贾晓琳, 樊帅帅, 朱晓燕.
一种新的不均衡关联分类算法
New Associative Classification Algorithm for Imbalanced Data
计算机科学, 2020, 47(6A): 488-493. https://doi.org/10.11896/JsJkx.190600132
[7] 张素梅, 张波涛.
一种基于量子耗散粒子群的评估模型构建方法
Evaluation Model Construction Method Based on Quantum Dissipative Particle Swarm Optimization
计算机科学, 2020, 47(6A): 84-88. https://doi.org/10.11896/JsJkx.190900148
[8] 陈孟辉, 曹黔峰, 兰彦琦.
基于区块挖掘与重组的启发式算法求解置换流水车间调度问题
Heuristic Algorithm Based on Block Mining and Recombination for Permutation Flow-shop Scheduling Problem
计算机科学, 2020, 47(6A): 108-113. https://doi.org/10.11896/JsJkx.190300151
[9] 刘晓玲,刘柏嵩,王洋洋,唐浩.
基于深度学习的多标签生成研究进展
Research and Development of Multi-label Generation Based on Deep Learning
计算机科学, 2020, 47(3): 192-199. https://doi.org/10.11896/jsjkx.190300137
[10] 朱岸青, 李帅, 唐晓东.
Spark平台中的并行化FP_growth关联规则挖掘方法
Parallel FP_growth Association Rules Mining Method on Spark Platform
计算机科学, 2020, 47(12): 139-143. https://doi.org/10.11896/jsjkx.191000110
[11] 张蕾,蔡明.
基于主题融合和关联规则挖掘的图像标注
Image Annotation Based on Topic Fusion and Frequent Patterns Mining
计算机科学, 2019, 46(7): 246-251. https://doi.org/10.11896/j.issn.1002-137X.2019.07.037
[12] 张维国.
面向知识推荐服务的选课决策
Decision Making of Course Selection Oriented by Knowledge Recommendation Service
计算机科学, 2019, 46(6A): 507-510.
[13] 贾宁, 李瑛达.
基于智能可穿戴设备的个性化健康监管平台的构建
Construction of Personalized Health Monitoring Platform Based on Intelligent Wearable Device
计算机科学, 2019, 46(6A): 566-570.
[14] 白若琛, 庞成鑫, 贾佳, 邱曙光, 邵嘉, 卢小姣.
多协议融合LPWAN能源物联网云平台的设计
Design of Cloud Platform for Energy Internet of Things Based on LPWAN Multi-protocol
计算机科学, 2019, 46(6A): 589-592.
[15] 陆鑫赟, 王兴芬.
基于领域关联冗余的教务数据关联规则挖掘
Educational Administration Data Mining of Association Rules Based on Domain Association Redundancy
计算机科学, 2019, 46(6A): 427-430.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!