计算机科学 ›› 2021, Vol. 48 ›› Issue (3): 220-226.doi: 10.11896/jsjkx.200200061
鲁博仁, 胡世哲, 娄铮铮, 叶阳东
LU Bo-ren, HU Shi-zhe, LOU Zheng-zheng, YE Yang-dong
摘要: 铁路文本分类对于我国铁路事业的发展具有重要的实用意义。现有的中文文本特征提取方法依赖于事先对文本的分词处理,然而面向铁路文本数据进行分词的准确率不高,导致铁路文本的特征提取存在语义理解不充分、特征获取不全面等局限性。针对以上问题,提出了一种字符级特征提取方法CLW2V(Character Level-Word2Vec),有效地解决了铁路文本中专业词汇丰富且复杂度高所导致的问题。与基于词汇特征的TF-IDF和Word2Vec方法相比,基于字符特征的CLW2V方法能够提取更为精细的文本特征,解决了传统方法依赖事先分词而导致的特征提取效果不佳的问题。在铁路安监发牌数据集上进行的实验验证表明,面向铁路文本分类的CLW2V特征提取方法优于传统的依赖分词的TF-IDF和Word2Vec方法。
中图分类号:
[1]SHI T Y,LIU J,LI P,et al.Research on The Overall Scheme and Key Technologies of Railway Big Data Platform[J].Railway Computer Application 2016(9):1-6. [2]LIU M J,WANG X F.Data Preprocessing in Data Mining [J].Computer science,2000,27(4):54-57. [3]JONES K S.A statistical interpretation of term specificity and its application in retrieval[J].Journal of Documentation,1972,28(1):11-21. [4]YANG L B,LI P,XUE R,et al.Fault Intelligent Classificationof Railway Signal Equipment Based on Unbalanced Text Data Mining[J].Journal of Railway Science,2018,40(2):59-66. [5]MIKOLOV T,CHEN K,CORRADOG,et al.Efficient estimation of word representations in vector space[J].arXiv:1301.3781,2013. [6]KOWSARI,MEIMANDI J,HEIDARYSAFA,et al.Text Classification Algorithms:A Survey[J].Information,2019,10(4). [7]MIKOLOV T,SUTSKEVER I,CHEN K,et al.Distributed Representations of Words and Phrases and Their Compositionality[C]//Advances in Neural Information Processing Systems.2013:3111-3119. [8]CHEN L,LI J.Text feature selection methods based on word vector[J].Journal of Chinese Computer Systems,2018,39(5):991-994. [9]DONG W,QIAN L P.Text similarity calculation based on semantic dictionary and word frequency information [J].Compu-ter Science,2017,44(Z11):422-427. [10]LI X,XIE H,LIL J.Sentence semantic similarity calculationbased on Word2vec [J].Computer Science,2017,44(9):256-260. [11]ZHANG M L,ROBLES V.Feature selection for multi-labelnaive Bayes classification[J].Information Sciences,2009,179(19):3218-3229. [12]VRIES A D,MAMOULIS N,NES N,et al.Efficient KNNSearch on Vertically Decomposed Data[C]//Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data Madison,Wisconsin.Madiso:ACM Press,2002:322-333. [13]JOACHIMS T.Text categorization with Support Vector Ma-chines:Learning with many relevant features[M].Machine Learning:ECML-98.1998:137-142. [14]BERGERA L.A maximum entropy approach to natural lan-guage processing[J].Computational Linguistics,1996,22(1):39-71. [15]KIM Y.Convolutional Neural Networks for Sentence Classification[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP).2014:1746-1751. [16]ZHANG X,ZHAO J,LECUN Y.Character-level Convolutional Networks for Text Classification[C]//Advances in Neural Information Processing Systems.2015:649-657. [17]GOODFELLOW I,BENGIO Y,COURVILLE A.Deep learning[M].MIT Press,2016:121-128. [18]NAIR V,HINTON G E.Rectified Linear Units Improve Re-stricted Boltzmann Machines[C]//Proceedings of the 27th International Conference on Machine Learning (ICML-10).2010:807-814. [19]SHEN Y,HE X,GAO J,et al.Learning semantic Representa-tions Using Convolutional Neural Networks For Web Search[C]//Proceedings of the 23rd International Conference on World Wide Web.ACM,2014:373-374. |
[1] | 郝志荣, 陈龙, 黄嘉成. 面向文本分类的类别区分式通用对抗攻击方法 Class Discriminative Universal Adversarial Attack for Text Classification 计算机科学, 2022, 49(8): 323-329. https://doi.org/10.11896/jsjkx.220200077 |
[2] | 武红鑫, 韩萌, 陈志强, 张喜龙, 李慕航. 监督和半监督学习下的多标签分类综述 Survey of Multi-label Classification Based on Supervised and Semi-supervised Learning 计算机科学, 2022, 49(8): 12-25. https://doi.org/10.11896/jsjkx.210700111 |
[3] | 檀莹莹, 王俊丽, 张超波. 基于图卷积神经网络的文本分类方法研究综述 Review of Text Classification Methods Based on Graph Convolutional Network 计算机科学, 2022, 49(8): 205-216. https://doi.org/10.11896/jsjkx.210800064 |
[4] | 闫佳丹, 贾彩燕. 基于双图神经网络信息融合的文本分类方法 Text Classification Method Based on Information Fusion of Dual-graph Neural Network 计算机科学, 2022, 49(8): 230-236. https://doi.org/10.11896/jsjkx.210600042 |
[5] | 邓凯, 杨频, 李益洲, 杨星, 曾凡瑞, 张振毓. 一种可快速迁移的领域知识图谱构建方法 Fast and Transmissible Domain Knowledge Graph Construction Method 计算机科学, 2022, 49(6A): 100-108. https://doi.org/10.11896/jsjkx.210900018 |
[6] | 康雁, 吴志伟, 寇勇奇, 张兰, 谢思宇, 李浩. 融合Bert和图卷积的深度集成学习软件需求分类 Deep Integrated Learning Software Requirement Classification Fusing Bert and Graph Convolution 计算机科学, 2022, 49(6A): 150-158. https://doi.org/10.11896/jsjkx.210500065 |
[7] | 邵欣欣. TI-FastText自动商品分类算法 TI-FastText Automatic Goods Classification Algorithm 计算机科学, 2022, 49(6A): 206-210. https://doi.org/10.11896/jsjkx.210500089 |
[8] | 邓朝阳, 仲国强, 王栋. 基于注意力门控图神经网络的文本分类 Text Classification Based on Attention Gated Graph Neural Network 计算机科学, 2022, 49(6): 326-334. https://doi.org/10.11896/jsjkx.210400218 |
[9] | 刘硕, 王庚润, 彭建华, 李柯. 基于混合字词特征的中文短文本分类算法 Chinese Short Text Classification Algorithm Based on Hybrid Features of Characters and Words 计算机科学, 2022, 49(4): 282-287. https://doi.org/10.11896/jsjkx.210200027 |
[10] | 钟桂凤, 庞雄文, 隋栋. 基于Word2Vec和改进注意力机制AlexNet-2的文本分类方法 Text Classification Method Based on Word2Vec and AlexNet-2 with Improved AttentionMechanism 计算机科学, 2022, 49(4): 288-293. https://doi.org/10.11896/jsjkx.211100016 |
[11] | 邓维斌, 朱坤, 李云波, 胡峰. FMNN:融合多神经网络的文本分类模型 FMNN:Text Classification Model Fused with Multiple Neural Networks 计算机科学, 2022, 49(3): 281-287. https://doi.org/10.11896/jsjkx.210200090 |
[12] | 张虎, 柏萍. 融入句子中远距离词语依赖的图卷积短文本分类方法 Graph Convolutional Networks with Long-distance Words Dependency in Sentences for Short Text Classification 计算机科学, 2022, 49(2): 279-284. https://doi.org/10.11896/jsjkx.201200062 |
[13] | 杜少华, 万怀宇, 武志昊, 林友芳. 融合文本序列和图信息的海关商品HS编码分类 Customs Commodity HS Code Classification Integrating Text Sequence and Graph Information 计算机科学, 2021, 48(4): 97-103. https://doi.org/10.11896/jsjkx.200900053 |
[14] | 李可悦, 陈轶, 牛少彰. 基于BERT的社交电商文本分类算法 Social E-commerce Text Classification Algorithm Based on BERT 计算机科学, 2021, 48(2): 87-92. https://doi.org/10.11896/jsjkx.200700111 |
[15] | 郁友琴, 李弼程. 基于多粒度文本特征表示的微博用户兴趣识别 Microblog User Interest Recognition Based on Multi-granularity Text Feature Representation 计算机科学, 2021, 48(12): 219-225. https://doi.org/10.11896/jsjkx.201100128 |
|