计算机科学 ›› 2018, Vol. 45 ›› Issue (6): 208-210.doi: 10.11896/j.issn.1002-137X.2018.06.037

• 人工智能 • 上一篇    下一篇

基于改进自编码器的文本分类算法

许卓斌, 郑海山, 潘竹虹   

  1. 厦门大学信息与网络中心 福建 厦门361005
  • 收稿日期:2018-02-28 出版日期:2018-06-15 发布日期:2018-07-24
  • 作者简介:许卓斌(1975-),男,硕士,高级工程师,主要研究方向为数据中心、园区网、高校信息化应用建设、虚拟化、并行计算、大数据,E-mail:zbxu@xmu.edu.cn(通信作者);郑海山(1979-),男,硕士,高级工程师,主要研究方向为大数据、校园信息化建设与管理;潘竹虹(1982-),女,硕士,高级工程师,主要研究方向为园区网络管理、网络日志、智能分析
  • 基金资助:
    本文受赛尔网络下一代互联网技术创新项目(NGII20160410)资助

Improved Autoencoder Based Classification Algorithm for Text

XU Zhuo-bin, ZHENG Hai-shan, PAN Zhu-hong   

  1. Information and Network Center,Xiamen University,Xiamen,Fujian 361005,China
  • Received:2018-02-28 Online:2018-06-15 Published:2018-07-24

摘要: 词的向量化表达是文本挖掘应用的必要前提。为了改善自编码器在词嵌入中的效果,提高文本分类的准确性,提出了一种改进的自编码器并将其用于文本分类。在传统自编码器的基础上,在隐藏层加入了一个全局调整函数,其将绝对值小的特征值调整到绝对值大的特征值上,实现了隐藏层特征向量的稀疏化。得到调整后的特征向量之后,采用全连接神经网络进行文本分类。在20news数据集上的实验结果表明,所提方法具有更好的词向量嵌入式效果,并且在文本分类中也具有更好的效果。

关键词: 嵌入式向量, 神经网络, 文本挖掘, 自编码器

Abstract: Vector representation of words is the premise of applications in text mining.In order to improve the effectiveness of autoencoders in words embedding and theaccuracy of text lassification,this paper proposed an improved autoencoderand applied it for text classification.Based on traditional autoencoder,a global adjustable function is added to the latent layer,which adjusts smaller absolute values to bigger absolute values and implements the sparsity of characteristic vector in the latent layer.With the adjusted latent characteristic vector,a full connected neural network is used to classify text.The experiments on 20news dataset show that the proposed method is more effective in words embedding,and has better performance in text classification.

Key words: Autoencoder, Embedding vector, Neutral network, Text mining

中图分类号: 

  • TP391.4
[1]ELLISON N B.Social network sites:definition,history,and scholarship[J].Journal of Computer-Mediated Communication,2007,13(1):210-230.
[2]HOFMANN T.Probabilistic latent semantic analysis[C]//Fif-teenth Conference on Uncertainty in Artificial Intelligence.1999:289-296.
[3]SONG Y,PAN S,LIU S,et al.Topic and keyword re-ranking for LDA-based topic modeling[C]//18th ACM Conference on Information and Knowledge Management.2009:1757-1760.
[4]BROWN P F,DESOUZA P V,MERCER R L,et al.Class-based n-gram models of natural language[J].Computational linguistics,1992,18(4):467-479.
[5]MIKOLOV T,SUTSKEVER I,CHEN K,et al.Distributed representations of words and phrases and their compositionality[C]//26th International Conference on Neural Information Processing Systems.2013:3111-3119.
[6]LE Q,MIKOLOV T.Distributed representations of sentences and documents[C]//31st International Conference on Machine Learning.2014:1188-1196.
[7]TANG Z H,ZHU Q X,HONG C Q,et al.Based on self encoders and hypergraph learning[J].Acta Automatica Sinica,2016,42(1):1014-1021.(in Chinese)
唐朝辉,朱清新,洪朝群,等.基于自编码器及超图学习的多标签特征提取[J].自动化学报,2016,42(1):1014-1021.
[8]XING C,MA L,YANG X.Stacked denoise autoencoder based feature extraction and classification for hyperspectral images[J].Journal of Sensors,2016(2016):1-10.
[9]HOU X,SHEN L,SUN K,et al.Deep feature consistent variational autoencoder[C]//2017 IEEE Winter Conference on Applications of Computer Vision (WACV).2017:1133-1141.
[10]TAO C,PAN H B,LI Y S,et al.Unsupervised spectral-spatial feature learning with stacked sparse autoencoder for hyperspectral imagery classification[J].IEEE Geoscience and Remote Sensing Letters,2015,12(12):2438-2442.
[11]CIREGAN D,MEIER U,SCHMIDHUBER J.Multi-column deep neural networks for image classification[C]//2012 IEEE conference on Computer vision and pattern recognition (CVPR).2012:3642-3649.
[12]KRIZHEVSKY A,SUTSKEVER I,HINTON G E.Imagenet classification with deep convolutional neural networks[C]//Advances in Neural Information Processing Systems.2012:1097-1105.
[13]URIARTE-ARCIA A V,LÓPEZ-YÁNEZ I,YÁNEZ-MÁRQUEZ C.One-hot vector hybrid associative classifier for medical data classification PloS one[J].Public Library of Science,2014,9(10):95-105.
[14]ZHANG Y Y,HUO J,YANG W Q,et al.A deep belief network-based heterogeneous face verification method for the se-cond-generation identity card[J].CAAI Transactions on Intelligent Systems,2015,10(2):193-200.(in Chinese)
张媛媛,霍静,杨婉琪,等.others深度信念网络的二代身份证异构人脸核实算法[J].智能系统学报,2015,10(2):193-200.
[15]HINTON G E,SALAKHUTDINOV R R.Replicated softmax:an undirected topic model[C]//22nd International Conference on Neural Information Processing Systems.2009:1607-1614.
[16]LV F,HAN M,QIU T.Remote Sensing Image Classification Based on Ensemble Extreme Learning Machine with Stacked Autoencoder[J].IEEE Access,2017,3(99):1-11.
[17]GAO J,ZHANG C X,WANG Z,et al.Question Classification Based on Improved TFIDF Algorithm[C]//International Conference on Control,Automation and Artificial Intelligence.2017:354-357.
[18]YANG B,HAN Q W,LEI M,et al.Short Text Classification Algorithm Based on Improved TF-IDF Weight[J].Journal of Chongqing University of Technology(Natural Sicence),2016,30(12):103-113.(in Chinese)
杨彬,韩庆文,雷敏,等.基于改进的TF-IDF权重的短文本分类算法[J].重庆理工大学学报(自然科学),2016,30(12):103-113.
[1] 宁晗阳, 马苗, 杨波, 刘士昌.
密码学智能化研究进展与分析
Research Progress and Analysis on Intelligent Cryptology
计算机科学, 2022, 49(9): 288-296. https://doi.org/10.11896/jsjkx.220300053
[2] 王冠宇, 钟婷, 冯宇, 周帆.
基于矢量量化编码的协同过滤推荐方法
Collaborative Filtering Recommendation Method Based on Vector Quantization Coding
计算机科学, 2022, 49(9): 48-54. https://doi.org/10.11896/jsjkx.210700109
[3] 周芳泉, 成卫青.
基于全局增强图神经网络的序列推荐
Sequence Recommendation Based on Global Enhanced Graph Neural Network
计算机科学, 2022, 49(9): 55-63. https://doi.org/10.11896/jsjkx.210700085
[4] 周乐员, 张剑华, 袁甜甜, 陈胜勇.
多层注意力机制融合的序列到序列中国连续手语识别和翻译
Sequence-to-Sequence Chinese Continuous Sign Language Recognition and Translation with Multi- layer Attention Mechanism Fusion
计算机科学, 2022, 49(9): 155-161. https://doi.org/10.11896/jsjkx.210800026
[5] 李宗民, 张玉鹏, 刘玉杰, 李华.
基于可变形图卷积的点云表征学习
Deformable Graph Convolutional Networks Based Point Cloud Representation Learning
计算机科学, 2022, 49(8): 273-278. https://doi.org/10.11896/jsjkx.210900023
[6] 郝志荣, 陈龙, 黄嘉成.
面向文本分类的类别区分式通用对抗攻击方法
Class Discriminative Universal Adversarial Attack for Text Classification
计算机科学, 2022, 49(8): 323-329. https://doi.org/10.11896/jsjkx.220200077
[7] 王润安, 邹兆年.
基于物理操作级模型的查询执行时间预测方法
Query Performance Prediction Based on Physical Operation-level Models
计算机科学, 2022, 49(8): 49-55. https://doi.org/10.11896/jsjkx.210700074
[8] 陈泳全, 姜瑛.
基于卷积神经网络的APP用户行为分析方法
Analysis Method of APP User Behavior Based on Convolutional Neural Network
计算机科学, 2022, 49(8): 78-85. https://doi.org/10.11896/jsjkx.210700121
[9] 朱承璋, 黄嘉儿, 肖亚龙, 王晗, 邹北骥.
基于注意力机制的医学影像深度哈希检索算法
Deep Hash Retrieval Algorithm for Medical Images Based on Attention Mechanism
计算机科学, 2022, 49(8): 113-119. https://doi.org/10.11896/jsjkx.210700153
[10] 檀莹莹, 王俊丽, 张超波.
基于图卷积神经网络的文本分类方法研究综述
Review of Text Classification Methods Based on Graph Convolutional Network
计算机科学, 2022, 49(8): 205-216. https://doi.org/10.11896/jsjkx.210800064
[11] 闫佳丹, 贾彩燕.
基于双图神经网络信息融合的文本分类方法
Text Classification Method Based on Information Fusion of Dual-graph Neural Network
计算机科学, 2022, 49(8): 230-236. https://doi.org/10.11896/jsjkx.210600042
[12] 齐秀秀, 王佳昊, 李文雄, 周帆.
基于概率元学习的矩阵补全预测融合算法
Fusion Algorithm for Matrix Completion Prediction Based on Probabilistic Meta-learning
计算机科学, 2022, 49(7): 18-24. https://doi.org/10.11896/jsjkx.210600126
[13] 杨炳新, 郭艳蓉, 郝世杰, 洪日昌.
基于数据增广和模型集成策略的图神经网络在抑郁症识别上的应用
Application of Graph Neural Network Based on Data Augmentation and Model Ensemble in Depression Recognition
计算机科学, 2022, 49(7): 57-63. https://doi.org/10.11896/jsjkx.210800070
[14] 胡艳羽, 赵龙, 董祥军.
一种用于癌症分类的两阶段深度特征选择提取算法
Two-stage Deep Feature Selection Extraction Algorithm for Cancer Classification
计算机科学, 2022, 49(7): 73-78. https://doi.org/10.11896/jsjkx.210500092
[15] 张颖涛, 张杰, 张睿, 张文强.
全局信息引导的真实图像风格迁移
Photorealistic Style Transfer Guided by Global Information
计算机科学, 2022, 49(7): 100-105. https://doi.org/10.11896/jsjkx.210600036
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!