计算机科学 ›› 2020, Vol. 47 ›› Issue (5): 90-95.doi: 10.11896/jsjkx.190300150

• 数据库&大数据&数据科学 • 上一篇    下一篇

大数据环境下基于关联规则的多标签学习算法

王青松, 姜富山, 李菲   

  1. 辽宁大学信息学院 沈阳110036
  • 收稿日期:2019-03-28 出版日期:2020-05-15 发布日期:2020-05-19
  • 通讯作者: 王青松(1301833668@qq.com)
  • 基金资助:
    国家自然科学基金(61802160)

Multi-label Learning Algorithm Based on Association Rules in Big Data Environment

WANG Qing-song, JIANG Fu-shan, LI Fei   

  1. College of Information,Liaoning University,Shenyang 110036,China
  • Received:2019-03-28 Online:2020-05-15 Published:2020-05-19
  • About author:WANG Qing-song,born in 1974,asso-ciate professor.His main research inte-rests include big data and data mining.
  • Supported by:
    This work was supported by the National Natural Science Foundation of China (61802160).

摘要: 传统单标签挖掘技术研究中,每个样本只属于一个标签且标签之间两两互斥。而在多标签学习问题中,一个样本可能对应多个标签,并且各标签之间往往具有关联性。目前,标签间关联性研究逐渐成为多标签学习研究的热门问题。首先为适应大数据环境,对传统关联规则挖掘算法Apriori进行并行化改进,提出基于Hadoop的并行化算法Apriori_ING,实现各节点独立完成候选项集的生成、剪枝与支持数统计,充分发挥并行化的优势;通过Apriori_ING算法得到的频繁项集和关联规则生成标签集合,提出基于推理机的标签集合生成算法IETG。然后,将标签集合应用到多标签学习中,提出多标签学习算法FreLP。FreLP利用关联规则生成标签集合,将原始标签集分解为多个子集,再使用LP算法训练分类器。通过实验将FreLP与现有的多标签学习算法进行对比,结果表明在不同评价指标下所提算法可以取得更好的结果。

关键词: 多标签学习, LP, 关联规则, Apriori, Hadoop

Abstract: In the traditional single-label mining technology research,each sample belongs to only one label and the labels are mutually exclusive.In the multi-label learning problem,one sample may correspond to multiple labels,and each label is often asso-ciated with each other.At present,the research on the correlation between tags gradually becomes a hot issue in multi-label lear-ning research.Firstly,in order to adapt to the big data environment,the traditional association rule mining algorithm Apriori is parallelized and improved.The Hadoop-based parallelization algorithm Apriori_ING is proposed to realize the generation of the candidate set,the pruning and the support number statistics,and the parallelization.The advantage is that the frequent itemsets and association rules obtained by the Apriori_ING algorithm generate tag sets,and the inference engine based tag set generation algorithm IETG is proposed.Then,the label set is applied to multi-label learning,and a multi-label learning algorithm FreLP is proposed.FreLP uses association rules to generate a set of labels,decomposes the original set of labels into multiple subsets,and then uses the LP algorithm to train the classifier.FreLP was compared with the existing multi-label learning algorithms.Experiment results show that the proposed algorithm can obtain better results under different evaluation indicators.

Key words: Multi-label learning, LP, Association rule, Apriori, Hadoop

中图分类号: 

  • TP301
[1] TSOUMAKAS G,KATAKIS I,VLAHAVAS I.Mining multi-labeldata[M]//Data mining and knowledge discovery handbook.US:Springer,2010:667-685.
[2] LI L,WANG M,ZHANG L,et al.Learning semantic similarityfor multi-label text categorization[C]//Chinese LexicalSemantics Lecture Notes in Computer Science.2014:260-269.
[3] RUBIN T N,CHAMBERS A,SMYTH P,et al.Statistical topic models for multi-label document classification[J].Machine Learning,2012,88(1):157-208.
[4] JIANG J Y,TSAI S C,LEE S J.FSKNN:multi-label text categorization based on fuzzy similarity and k nearest neighbors[J].Expert Systems with Applications,2012,39(1):521-530.
[5] LIU S M,CHEN J H.A multi-label classification based ap-proach for sentiment classification[J].Expert Systems with Applications,2015,42(3):1083-1093.
[6] HUANG S,PENG W,LI J,et al.Sentiment and topic analysis on social media:a multi-task multi-label classification approach[C]//Proceedings of the 5th Annual ACM Web Science Confe-rence.2013:172-181.
[7] LO H Y,WANG J C,WANG H M,et al.Cost-Sensitive multi-label learning for audio tag annotation and retrieval[J].IEEE Trans.on Multimedia,2011,13(3):518-529.
[8] WU B,LYU S,HU B G,et al.Multi-label learning with missing labels for image annotation and facial action unit recognition[J].Pattern Recognition,2015,48(7):2279-2289.
[9] ZHANG M L,ZHOU Z H.Multi-label neural networks withapplications to functional genomics and text categorization [J].IEEE Transactions on Knowledge and Data Engineering,2007,18(10):1338-1351.
[10] ZHOU Y,XUE H,GENG X.Emotion distribution recognition from facial expressions[C]//Proc.of the ACM Int'l Conf.on Multimedia.2015:1247-1250.
[11] BOUTELL M R,LUO J,SHEN X,et al.Learning multi-label scene classification[J].Pattern Recognition,2004,37(9):1757-1771.
[12] READ J,PFAHRINGER B,HOLMES G.Multi-label classification using ensembles of pruned sets[C]//8th IEEE Internatio-nal Conference on Data Mining (ICDM'08).2008:995-1000.
[13] READ J,PFAHRINGER B,HOLMES G,et al.Classifier chains for multi-label classification[C]//20th European Conference on Machine Learning(ECML'09).Berlin:Springer,2009:254-269.
[14] SCHAPIRE R E,SINGER Y.BoosTexter:a boosting-based system for text categorization[J].Machine Learning,2000,39(2/3):135-168.
[15] DOQUIRE,GAUTHIER,VERLEYSEN,et al.Mutual information-based feature selection for multilabel classification [J].Neurocomputing,2013,122:148-155.
[16] LI S N,LI N,LI Z H.Multi-label Data Mining Technology:A Review [J].Computer Science,2013,40(4):14-21.
[17] LIU J Y,JIA X Y.A multi-label classification algorithm using association rules mining [J].Journal of Software,2017,28(11):2865-2878.
[18] XIAO W,HU J,ZHOU X F.A Survey of Algorithms for Mi-ning Parallel Association Rules Based on MapReduce-based Computing Model [J].Computer Applied Research,2018,35(1):13-23.
[19] ZHANG M L,ZHOU Z H.A Review on Multi-Label Learning Algorithms [J].IEEE Trans. on Knowledge and Data Enginee-ring,2014,26(8):1819-1837.
[20] FURNKRANZ J,HULLERMEIER E,MENCIA E L,et al.Multi-labelclas-sification via calibrated label ranking [J].Machine Learning,2008,73(2):133-152.
[21] TSOUMAKAS G,VLAHAVAS I.Random k-labelsets:an ensemble method for multilabel classification[C]//Proceedings of the 18th European Conference on Machine Learning.2007:406-417.
[22] CHENG X Q,JIN X L,WANG Y Z,et al.Survey on big data system and analytic technology[J].Journal of Software,2014,25(9):1889-1908.
[23] AGRAWAL R,SRIKANT R.Fast algorithm for mining association rules[C]//Processdings of 20th Int.Conf.Very Large Data Bases(VLDB).Morgan Kaufman Press.1994:487-499.
[24] XING C Z,AN W G,WANG X.Improvement of algorithm for mining frequent itemsets in vertical data format [J].Computer Engineering and Science,2017,39(7):1365-1370.
[25] LIU S H,LIU S J,CHEN S X,et al.IOMRA:a high efficiency frequent itemset mining algorithm based on the MapReduce computation model[C]//Proc of IEEE International Conference on Computational Science and Engineering.2014:1290-1295.
[26] TSOUMAKAS G,VILCEK J,XIOUFITS E S.Mulan:A Java library for multi-label learning[OL].http://mulan.sourceforge.net/datasets.html.
[1] 张素梅, 张波涛. 一种基于量子耗散粒子群的评估模型构建方法[J]. 计算机科学, 2020, 47(6A): 84-88.
[2] 陈孟辉, 曹黔峰, 兰彦琦. 基于区块挖掘与重组的启发式算法求解置换流水车间调度问题[J]. 计算机科学, 2020, 47(6A): 108-113.
[3] 崔巍, 贾晓琳, 樊帅帅, 朱晓燕. 一种新的不均衡关联分类算法[J]. 计算机科学, 2020, 47(6A): 488-493.
[4] 刘晓玲,刘柏嵩,王洋洋,唐浩. 基于深度学习的多标签生成研究进展[J]. 计算机科学, 2020, 47(3): 192-199.
[5] 朱岸青, 李帅, 唐晓东. Spark平台中的并行化FP_growth关联规则挖掘方法[J]. 计算机科学, 2020, 47(12): 139-143.
[6] 张蕾,蔡明. 基于主题融合和关联规则挖掘的图像标注[J]. 计算机科学, 2019, 46(7): 246-251.
[7] 张维国. 面向知识推荐服务的选课决策[J]. 计算机科学, 2019, 46(6A): 507-510.
[8] 贾宁, 李瑛达. 基于智能可穿戴设备的个性化健康监管平台的构建[J]. 计算机科学, 2019, 46(6A): 566-570.
[9] 白若琛, 庞成鑫, 贾佳, 邱曙光, 邵嘉, 卢小姣. 多协议融合LPWAN能源物联网云平台的设计[J]. 计算机科学, 2019, 46(6A): 589-592.
[10] 陆鑫赟, 王兴芬. 基于领域关联冗余的教务数据关联规则挖掘[J]. 计算机科学, 2019, 46(6A): 427-430.
[11] 郑诚, 洪彤彤, 薛满意. 用于短文本分类的BLSTM_MLPCNN模型[J]. 计算机科学, 2019, 46(6): 206-211.
[12] 朱峙成, 刘佳玮, 阎少宏. 多标签学习在智能推荐中的研究与应用[J]. 计算机科学, 2019, 46(11A): 189-193.
[13] 李智星, 任诗雅, 王化明, 沈柯. 基于非结构化文本增强关联规则的知识推理方法[J]. 计算机科学, 2019, 46(11): 209-215.
[14] 王斌, 马俊杰, 房新秀, 魏天佑. 基于时间戳和垂直格式的关联规则挖掘算法[J]. 计算机科学, 2019, 46(10): 71-76.
[15] 温雯, 陈颖, 蔡瑞初, 郝志峰, 王丽娟. 基于多视角多标签学习的读者情绪分类[J]. 计算机科学, 2018, 45(8): 191-197.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 雷丽晖,王静. 可能性测度下的LTL模型检测并行化研究[J]. 计算机科学, 2018, 45(4): 71 -75 .
[2] 孙启,金燕,何琨,徐凌轩. 用于求解混合车辆路径问题的混合进化算法[J]. 计算机科学, 2018, 45(4): 76 -82 .
[3] 张佳男,肖鸣宇. 带权混合支配问题的近似算法研究[J]. 计算机科学, 2018, 45(4): 83 -88 .
[4] 伍建辉,黄中祥,李武,吴健辉,彭鑫,张生. 城市道路建设时序决策的鲁棒优化[J]. 计算机科学, 2018, 45(4): 89 -93 .
[5] 史雯隽,武继刚,罗裕春. 针对移动云计算任务迁移的快速高效调度算法[J]. 计算机科学, 2018, 45(4): 94 -99 .
[6] 周燕萍,业巧林. 基于L1-范数距离的最小二乘对支持向量机[J]. 计算机科学, 2018, 45(4): 100 -105 .
[7] 刘博艺,唐湘滟,程杰仁. 基于多生长时期模板匹配的玉米螟识别方法[J]. 计算机科学, 2018, 45(4): 106 -111 .
[8] 耿海军,施新刚,王之梁,尹霞,尹少平. 基于有向无环图的互联网域内节能路由算法[J]. 计算机科学, 2018, 45(4): 112 -116 .
[9] 崔琼,李建华,王宏,南明莉. 基于节点修复的网络化指挥信息系统弹性分析模型[J]. 计算机科学, 2018, 45(4): 117 -121 .
[10] 王振朝,侯欢欢,连蕊. 抑制CMT中乱序程度的路径优化方案[J]. 计算机科学, 2018, 45(4): 122 -125 .