计算机科学 ›› 2021, Vol. 48 ›› Issue (3): 220-226.doi: 10.11896/jsjkx.200200061

• 人工智能 • 上一篇    下一篇

面向铁路文本分类的字符级特征提取方法

鲁博仁, 胡世哲, 娄铮铮, 叶阳东   

  1. 郑州大学信息工程学院 郑州450001
  • 收稿日期:2020-02-15 修回日期:2020-06-17 出版日期:2021-03-15 发布日期:2021-03-05
  • 通讯作者: 娄铮铮(zzlou@zzu.edu.cn)
  • 作者简介:zlyylbr4412@zzu.edu.cn
  • 基金资助:
    国家重点研发计划课题基金项目(2018YFB1201403);国家自然科学青年基金项目(61502434)

Character-level Feature Extraction Method for Railway Text Classification

LU Bo-ren, HU Shi-zhe, LOU Zheng-zheng, YE Yang-dong   

  1. School of Information Engineering,Zhengzhou University,Zhengzhou 450001,China
  • Received:2020-02-15 Revised:2020-06-17 Online:2021-03-15 Published:2021-03-05
  • About author:LU Bo-ren,born in 1996,master.His main research interests include machine learning and natural language proces-sing.
    LOU Zheng-zheng,born in 1984,asso-ciate professor,master supervisor,is a member of China Computer Federation.His main research interests include machine learning and pattern recognition.
  • Supported by:
    National Key Research and Development Program (2018YFB1201403) and Youth Program of National Natural Science Foundation of China(61502434).

摘要: 铁路文本分类对于我国铁路事业的发展具有重要的实用意义。现有的中文文本特征提取方法依赖于事先对文本的分词处理,然而面向铁路文本数据进行分词的准确率不高,导致铁路文本的特征提取存在语义理解不充分、特征获取不全面等局限性。针对以上问题,提出了一种字符级特征提取方法CLW2V(Character Level-Word2Vec),有效地解决了铁路文本中专业词汇丰富且复杂度高所导致的问题。与基于词汇特征的TF-IDF和Word2Vec方法相比,基于字符特征的CLW2V方法能够提取更为精细的文本特征,解决了传统方法依赖事先分词而导致的特征提取效果不佳的问题。在铁路安监发牌数据集上进行的实验验证表明,面向铁路文本分类的CLW2V特征提取方法优于传统的依赖分词的TF-IDF和Word2Vec方法。

关键词: 特征提取方法, 铁路短文本, 文本分类, 字符级数据

Abstract: Railway text classification is of great practical significance to the development of China’s railway industry.Existing Chinese text feature extraction methods rely on word segmentation in advance.However,due to the low accuracy of word segmentation for railway text data,the feature extraction of railway text has limitations such as inadequate semantic understanding and incomplete feature acquisition.In view of the above problems,a character-level feature extraction method,CLW2V (Character Le-vel-Word2Vec),is proposed,which effectively solves the problem caused by the rich and high complexity of professional vocabulary in railway texts.Compared with the TF-IDF and Word2Vec methods based on lexical features,the CLW2V method based on character features extracts more refined text features,which solves the problem of poor feature extraction effect caused by the dependence on presegmentation in traditional methods.Experimental verification is carried out on the data set of railway safety supervision and licensing,which shows that the CLW2V feature extraction method for railway text classification is superior to the traditional TF-IDF and Word2Vec methods that rely on word segmentation.

Key words: Character level vector, Feature extraction method, Railway short text, Text classification

中图分类号: 

  • U229
[1]SHI T Y,LIU J,LI P,et al.Research on The Overall Scheme and Key Technologies of Railway Big Data Platform[J].Railway Computer Application 2016(9):1-6.
[2]LIU M J,WANG X F.Data Preprocessing in Data Mining [J].Computer science,2000,27(4):54-57.
[3]JONES K S.A statistical interpretation of term specificity and its application in retrieval[J].Journal of Documentation,1972,28(1):11-21.
[4]YANG L B,LI P,XUE R,et al.Fault Intelligent Classificationof Railway Signal Equipment Based on Unbalanced Text Data Mining[J].Journal of Railway Science,2018,40(2):59-66.
[5]MIKOLOV T,CHEN K,CORRADOG,et al.Efficient estimation of word representations in vector space[J].arXiv:1301.3781,2013.
[6]KOWSARI,MEIMANDI J,HEIDARYSAFA,et al.Text Classification Algorithms:A Survey[J].Information,2019,10(4).
[7]MIKOLOV T,SUTSKEVER I,CHEN K,et al.Distributed Representations of Words and Phrases and Their Compositionality[C]//Advances in Neural Information Processing Systems.2013:3111-3119.
[8]CHEN L,LI J.Text feature selection methods based on word vector[J].Journal of Chinese Computer Systems,2018,39(5):991-994.
[9]DONG W,QIAN L P.Text similarity calculation based on semantic dictionary and word frequency information [J].Compu-ter Science,2017,44(Z11):422-427.
[10]LI X,XIE H,LIL J.Sentence semantic similarity calculationbased on Word2vec [J].Computer Science,2017,44(9):256-260.
[11]ZHANG M L,ROBLES V.Feature selection for multi-labelnaive Bayes classification[J].Information Sciences,2009,179(19):3218-3229.
[12]VRIES A D,MAMOULIS N,NES N,et al.Efficient KNNSearch on Vertically Decomposed Data[C]//Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data Madison,Wisconsin.Madiso:ACM Press,2002:322-333.
[13]JOACHIMS T.Text categorization with Support Vector Ma-chines:Learning with many relevant features[M].Machine Learning:ECML-98.1998:137-142.
[14]BERGERA L.A maximum entropy approach to natural lan-guage processing[J].Computational Linguistics,1996,22(1):39-71.
[15]KIM Y.Convolutional Neural Networks for Sentence Classification[C]//Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP).2014:1746-1751.
[16]ZHANG X,ZHAO J,LECUN Y.Character-level Convolutional Networks for Text Classification[C]//Advances in Neural Information Processing Systems.2015:649-657.
[17]GOODFELLOW I,BENGIO Y,COURVILLE A.Deep learning[M].MIT Press,2016:121-128.
[18]NAIR V,HINTON G E.Rectified Linear Units Improve Re-stricted Boltzmann Machines[C]//Proceedings of the 27th International Conference on Machine Learning (ICML-10).2010:807-814.
[19]SHEN Y,HE X,GAO J,et al.Learning semantic Representa-tions Using Convolutional Neural Networks For Web Search[C]//Proceedings of the 23rd International Conference on World Wide Web.ACM,2014:373-374.
[1] 郝志荣, 陈龙, 黄嘉成.
面向文本分类的类别区分式通用对抗攻击方法
Class Discriminative Universal Adversarial Attack for Text Classification
计算机科学, 2022, 49(8): 323-329. https://doi.org/10.11896/jsjkx.220200077
[2] 武红鑫, 韩萌, 陈志强, 张喜龙, 李慕航.
监督和半监督学习下的多标签分类综述
Survey of Multi-label Classification Based on Supervised and Semi-supervised Learning
计算机科学, 2022, 49(8): 12-25. https://doi.org/10.11896/jsjkx.210700111
[3] 檀莹莹, 王俊丽, 张超波.
基于图卷积神经网络的文本分类方法研究综述
Review of Text Classification Methods Based on Graph Convolutional Network
计算机科学, 2022, 49(8): 205-216. https://doi.org/10.11896/jsjkx.210800064
[4] 闫佳丹, 贾彩燕.
基于双图神经网络信息融合的文本分类方法
Text Classification Method Based on Information Fusion of Dual-graph Neural Network
计算机科学, 2022, 49(8): 230-236. https://doi.org/10.11896/jsjkx.210600042
[5] 邓凯, 杨频, 李益洲, 杨星, 曾凡瑞, 张振毓.
一种可快速迁移的领域知识图谱构建方法
Fast and Transmissible Domain Knowledge Graph Construction Method
计算机科学, 2022, 49(6A): 100-108. https://doi.org/10.11896/jsjkx.210900018
[6] 康雁, 吴志伟, 寇勇奇, 张兰, 谢思宇, 李浩.
融合Bert和图卷积的深度集成学习软件需求分类
Deep Integrated Learning Software Requirement Classification Fusing Bert and Graph Convolution
计算机科学, 2022, 49(6A): 150-158. https://doi.org/10.11896/jsjkx.210500065
[7] 邵欣欣.
TI-FastText自动商品分类算法
TI-FastText Automatic Goods Classification Algorithm
计算机科学, 2022, 49(6A): 206-210. https://doi.org/10.11896/jsjkx.210500089
[8] 邓朝阳, 仲国强, 王栋.
基于注意力门控图神经网络的文本分类
Text Classification Based on Attention Gated Graph Neural Network
计算机科学, 2022, 49(6): 326-334. https://doi.org/10.11896/jsjkx.210400218
[9] 刘硕, 王庚润, 彭建华, 李柯.
基于混合字词特征的中文短文本分类算法
Chinese Short Text Classification Algorithm Based on Hybrid Features of Characters and Words
计算机科学, 2022, 49(4): 282-287. https://doi.org/10.11896/jsjkx.210200027
[10] 钟桂凤, 庞雄文, 隋栋.
基于Word2Vec和改进注意力机制AlexNet-2的文本分类方法
Text Classification Method Based on Word2Vec and AlexNet-2 with Improved AttentionMechanism
计算机科学, 2022, 49(4): 288-293. https://doi.org/10.11896/jsjkx.211100016
[11] 邓维斌, 朱坤, 李云波, 胡峰.
FMNN:融合多神经网络的文本分类模型
FMNN:Text Classification Model Fused with Multiple Neural Networks
计算机科学, 2022, 49(3): 281-287. https://doi.org/10.11896/jsjkx.210200090
[12] 张虎, 柏萍.
融入句子中远距离词语依赖的图卷积短文本分类方法
Graph Convolutional Networks with Long-distance Words Dependency in Sentences for Short Text Classification
计算机科学, 2022, 49(2): 279-284. https://doi.org/10.11896/jsjkx.201200062
[13] 杜少华, 万怀宇, 武志昊, 林友芳.
融合文本序列和图信息的海关商品HS编码分类
Customs Commodity HS Code Classification Integrating Text Sequence and Graph Information
计算机科学, 2021, 48(4): 97-103. https://doi.org/10.11896/jsjkx.200900053
[14] 李可悦, 陈轶, 牛少彰.
基于BERT的社交电商文本分类算法
Social E-commerce Text Classification Algorithm Based on BERT
计算机科学, 2021, 48(2): 87-92. https://doi.org/10.11896/jsjkx.200700111
[15] 郁友琴, 李弼程.
基于多粒度文本特征表示的微博用户兴趣识别
Microblog User Interest Recognition Based on Multi-granularity Text Feature Representation
计算机科学, 2021, 48(12): 219-225. https://doi.org/10.11896/jsjkx.201100128
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!