计算机科学 ›› 2016, Vol. 43 ›› Issue (8): 190-193.doi: 10.11896/j.issn.1002-137X.2016.08.038

• 人工智能 • 上一篇    下一篇

多类文本分类算法GS-SVDD

吴德,刘三阳,梁锦锦   

  1. 西安电子科技大学计算机学院 西安710071,西安电子科技大学计算机学院 西安710071,西安石油大学理学院 西安710065
  • 出版日期:2018-12-01 发布日期:2018-12-01
  • 基金资助:
    本文受国家自然科学基金(61373174),陕西省教育厅自然科学基金(2010JK773),西安石油大学博士专项科研基金(Z10027)资助

Multiclass Text Classification by Golden Selection and Support Vector Domain Description

WU De, LIU San-yang and LIANG Jin-jin   

  • Online:2018-12-01 Published:2018-12-01

摘要: 传统多类文本多分类算法存在计算量大和训练时间长的问题,为此利用黄金分割(Golden Selection,GS)和支持向量域描述(Support Vector Domain Description,SVDD)对多类文本构造一种分类算法。GS-SVDD首先利用词频逆向文件频率(Term Frequency-Inverse Document Frequency,TF-IDF)公式计算词条的相对词频,根据该值将词条降序排列,并对得到的文本向量进行归一化;其次采用黄金分割法对文本向量进行维数约简,使得冗余的样本特征数不超过一个;最后根据支持向量域描述进行多类分类,判断待测文本归属相对类距离之值较小的类。不同数据集的数值实验表明,GS-SVDD比“一对一”和“一对多”支持向量机具有更好的稳定性、更高的分类精度和更短的训练时间,从而更适 用于海量文本的多分类。

关键词: 文本多分类,黄金分割,支持向量域描述,维数约简,海量文本

Abstract: Traditional multiclass text classification methods have disadvantages such as large computation and long training time.An algorithm based on golden selection and support vector domain description (SVDD) was proposed for text classification.The proposed method utilizes TF-IDF formula to compute the relative word frequency for each entry,sorts them in descending order and normalizes the text vector.Then golden selection method is introduced for dimension reduction,where the number of redundant sample features is no more than one.Finally,SVDD is applied for classification,which assigns the test text to the class with the smallest value of the relative class distance.Numerical experiments on various datasets demonstrate that,the proposed method has better robustness,higher classification accuracy and less training time,compared with “one-against-one” and “one-against-all” support vector machine.It is more appropriate for huge text multi-classification problems.

Key words: Multiclass text classification,Golden selection,SVDD,Dimension reduction,Huge text

[1] Sebastiani F.Machine learning in automated text categorization [J].ACM Computing Surveys,2002,34(1):1-47
[2] Su Jin-shu,Zhang Bo-feng,Xu Xin.Advances in machine lear-ning based text categorization[J].Journal of Software,2006,17(9):1848-1859(in Chinese) 苏金树,张博锋,徐昕.基于机器学习的文本分类技术研究进展[J].软件学报,2006,17(9):1848-1859
[3] Dong Yue-hua,Guo Shi-chuan.Text clustering algorithm with improved weighting factor and feature vector[J].Computer Engineering and Design,2015,35(4):1051-1057(in Chinese) 董跃华,郭士串.结合权重因子与特征向量改进的文本聚类算法[J].计算机工程与设计,2015,35(4):1051-1057
[4] Zhang Pei-yun,Chen Chuan-ming,Huang Bo.Texts similarity algorithm based on subtrees matching[J].Pattern Recognition and Artificial Intelligence,2014,7(3):226-234(in Chinese) 张佩云,陈传明,黄波.基于子树匹配的文本相似度算法[J].模式识别与人工智能,2014,7(3):226-234
[5] Wan C H,Lee L H,Rajkumar R,et al.A hybrid text classification approach with low dependency on parameter by integrating K-nearest neighbor and support vector machine[J].Expert System with Application,2012,39(15):11880-11888
[6] Arun K M,Gopal M.A comparison study on multiple binary-class SVM methods for unilabel text categorization[J].Pattern Recognition Letters,2010,31(11):1437-1444
[7] Kumar M A,Gopal M.One-against-one fuzzy support vectormachine classifier:An approach to text categorization[J].Expert System with Application,2009,36(6):10030-10034
[8] Lin Xu-dong,Liu Han-xing,Lin Pi-yuan,et al.Chinese question classification using alternating and iterative one-against-one algorithm[J].Journal of Convergence Information Technology,2010,5(3):61-67
[9] Kumar M A,Gopal M.Reduced one-against-all method for mul-ticlass SVM classification[J].Expert System with Application,2011,38(11):14238-14248
[10] Wu De,Liu San-yang.Multiple support vector domain classifier[J].Journal of Xi’an Jiaotong University,2012,46(6):87-91(in Chinese) 吴德,刘三阳.支持向量域多分类器[J].西安交通大学学报,2012,46(6):87-91
[11] Zhang Yu-fang,Wan Bin-hou,Xiong Zhong-yang.Research onfeature dimension reduction in text classification[J].Application Research of Computer,2012,29(7):2541-2543(in Chinese) 张玉芳,万斌候,熊忠阳.文本分类中的特征降维方法研究[J].计算机应用研究,2012,29(7):2541-2543
[12] Xia Shi-xiong,Li You-wen,Zhou Yong.Method based on semi-supervised local linear algorithm for text classification[J].Application Research of Computer,2010,7(1):64-67(in Chinese) 夏士雄,李佑文,周勇.一种半监督局部线性嵌入算法的文本分类方法[J].计算机应用研究,2010,7(1):64-67
[13] Li Jian-lin.A combination of feature extraction in text classification based on PCA[J].Application Research of Computer,2013,0(8):2398-2401(in Chinese) 李建林.一种基于PCA的组合特征提取文本分类方法[J].计算机应用研究,2013,0(8):2398-2401
[14] Duan Jie,Hu Qing-hua,Zhang Ling-jun,et al.Feature selection for multi-label classification based on neighborhood rough sets[J].Journal of Coumputer Research and Development,2015,2(1):56-65(in Chinese) 段洁,胡清华,张灵均,等.基于邻域粗糙集的多标记分类特征选择算法[J].计算机研究与发展,2015,2(1):56-65
[15] Song Ju-long,Qian Fu-cai.The global optimization methodbased on golden-section[J].Computer Engineering and Applications,2005,8(4):95-96(in Chinese) 宋巨龙,钱富才.基于黄金分割的全局最优化方法[J].计算机工程与应用,2005,8(4):95-96
[16] Yang Wen-chen,Zhang Lun,Rao Qian,et al.Multi-objective optimization for traffic signals with golden Ration based genetic algorithm[J].Journal of Transportation Systems Engineering and Information Technology,2013,3(5):48-55(in Chinese) 杨文臣,张轮,饶倩,等.基于黄金分割点遗传算法的交通信号多目标优化[J].交通运输系统工程与信息,2013,3(5):48-55
[17] Zhong Hua,Wang Yong,Shao Chang-xing.Golden-section adaptive control based on disturbances and model error compensations[J].Application Research of Computer,2015,2(8):2343-2346(in Chinese) 钟华,王永,邵长星.基于扰动和模型误差补偿的黄金分割自适应控制[J].计算机应用研究,2015,2(8):2343-2346
[18] Zhang Li-na,Zhou Run-jing,Na Ri-su.A method for characte-ristic extraction from large sample databased on the golden section method’s ISODATA Algorithm[J].Journal of Inner Mongolia University(Natural Science Edition),2013,4(1):93-96(in Chinese) 张丽娜,周润景,那日苏.基于黄金分割法的ISODATA算法的大样本特征数据提取方法[J].内蒙古大学学报(自然科学),2013,4(1):93-96

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!