计算机科学 ›› 2017, Vol. 44 ›› Issue (10): 276-282, 288.doi: 10.11896/j.issn.1002-137X.2017.10.050

• 人工智能 • 上一篇    下一篇

基于词频统计规律的文本数据预处理方法

池云仙,赵书良,罗燕,高琳,赵骏鹏,李超   

  1. 河北师范大学数学与信息科学学院 石家庄050024 河北师范大学河北省计算数学与应用数学重点实验室 石家庄050024,河北师范大学数学与信息科学学院 石家庄050024 河北师范大学河北省计算数学与应用数学重点实验室 石家庄050024,河北师范大学数学与信息科学学院 石家庄050024 河北师范大学河北省计算数学与应用数学重点实验室 石家庄050024,河北师范大学数学与信息科学学院 石家庄050024 河北师范大学河北省计算数学与应用数学重点实验室 石家庄050024,河北师范大学数学与信息科学学院 石家庄050024 河北师范大学河北省计算数学与应用数学重点实验室 石家庄050024,河北师范大学数学与信息科学学院 石家庄050024 河北师范大学河北省计算数学与应用数学重点实验室 石家庄050024
  • 出版日期:2018-12-01 发布日期:2018-12-01
  • 基金资助:
    本文受国家自然科学基金项目(71271067),国家社科基金重大项目(13&ZD091),河北省高等学校科学技术研究项目(QN2014196),河北师范大学硕士基金(xj2015003)资助

Text Data Preprocessing Based on Term Frequency Statistics Rules

CHI Yun-xian, ZHAO Shu-liang, LUO Yan, GAO Lin, ZHAO Jun-peng and LI Chao   

  • Online:2018-12-01 Published:2018-12-01

摘要: 在大数据时代,文本挖掘 面临特征的“高维-稀疏”问题,海量文本词汇与稀少关键特征间的矛盾导致了高时空复杂度和低效率等问题,严重制约了文本挖掘效率,因此在文本挖掘前进行有效的数据预处理至关重要。传统文本挖掘算法在数据预处理阶段只进行分词和去停用词操作。为提高性能,提出基于词频统计规律的文本数据预处理方法。首先,基于齐普夫定律和最大值法推导同频词数表达式;然后,基于同频词数表达式探究各频次词语在文中的分布规律,结果表明词频为1和2的词语与文档的关联度较低,但比重高达 2/3;最后,基于词频统计规律进行数据预处理,在预处理阶段去除低频词,减小特征维度。在公共数据集Reuters-21578和20-Newsgroups上进行的实验的结果表明,各频次词语的分布规律是正确的,基于词频统计规律的文本数据预处理方法在分类准确率、精确率、召回率以及F1度量值方面均有提升,运行时间明显降低,文本挖掘效率得到显著提高。

关键词: 大数据,文本挖掘,数据预处理,词频统计

Abstract: In age of big data,it is a severe problem that feature t erms are faced with “high-dimension and sparse” challenge in text mining.Contradiction between enormous scale of terms and scarce of features will cause high-time-space complexity and poor efficiency,and restricts the efficiency of text mining seriously.Thus,it is crucial to preprocess data before mining text.Terms-dividing and stop-words-deleting are operated merely in data preprocessing of traditional text mining algorithms.In order to improve process of data preprocessing,data preprocessing algorithm based on term frequency statistics rules (DPTFSR) was proposed.To begin with,expression about number of terms with identical frequency is deduced based on Zif’s Law and rule of maximum area.What’s more,regularities of distribution based on terms with identical frequency is explored.It is discovered that proportion of low-frequency terms in documents reach up to 2/3,but there is little relevancy between them.Lastly,data is preprocessed based on terms frequency statistics rules.Low-frequency terms are deleted,and features dimension is decreased greatly.Correctness of term frequency statistics rules and validity of algorithm DPTFSR are verified on data sets from Reuters-21578 and 20-Newgroups.Experimental results show that accuracy,precision,recall and F1 measure are increased,and running time is shortened obviously.Thus,efficiency of text mining is significantly enhanced.

Key words: Big data,Text mining,Data preprocessing,Term frequency statistics

[1] HAN J,FAN J,et al.Semanti-Enhanced Spatial Keyword Search[J].Journal of Computer Research and Development,2015,52(9):1954-1964.(in Chinese) 韩军,范举,等.一种语义增强的空间关键词搜索方法[J].计算机研究与发展,2015,52(9):1954-1964.
[2] HU J,FAN J,LI G L,et al.Top-k Fuzzy Spatial KeywordSearch[J].Chinese Journal of Computers,2012,35(11):2237-2246.(in Chinese) 胡骏,范举,李国良,等.空间数据上Top-k关键词模糊查询算法[J].计算机学报,2012,35(11):2237-2246.
[3] REN P J,CHEN Z M,et al.Search Result Diversification Combing Semantic and Temporal Intent[J].Chinese Journal of Computers,2015,38(10):2076-2091.(in Chinese) 任鹏杰,陈竹敏,等.一种综合语义和实效性意图的检索多样化方法[J].计算机学报,2015,38(10):2076-2091.
[4] DING Z Y,JIA Y,et al.Survey of Data Mining for microblogs [J].Journal of Computer Reaserch and Development,2014,51(4):691-706.(in Chinese) 丁兆云,贾焰,等.微博数据挖掘综述[J].计算机研究与发展,2014,51(4):691-706.
[5] SONG Q,NI J,WANG G.A fast clustering-based feature subset selection algorithm for high-dimensional data[J].IEEE Transactions on Knowledge and Data Engineering,2013,25(1):1-14.
[6] ZHAO Z,HE X F,CAI D,et al.Graph Regularized Feature Selection with Data Reconstruction[J].IEEE Transactions on Knowledge and Data Engineering,2016,28(3):689-700.
[7] ZIPF G K.Human behavior and the principle of least effort:an introduction to human ecology[M].Addison-Wesley Press,1949:23.
[8] BOOTH A D.A law of occurrences for words of low frequency [J].Information and Control,1967,10(4):386-393.
[9] EGGHE L.A new short proof of Naranan’s theorem,explaining Lotka’s law and Zipf’s law[J].Journal of the American Society for Information Science & Technology,2010,61(12):2581-2583.
[10] CHAN P,HIJIKATA Y,NISHIDA S.Computing semantic relatedness using word frequency and layout information of wikipedia[C]∥ Proceedings of the 28th Annual ACM Symposium on Applied Computing.ACM,2013:282-287.
[11] SURYASEN R,RANA M S.Content analysis and application of Zipf’s law in computer science literature [C]∥ 2015 4th International Symposium on Emerging Trends and Technologies in Libraries and Information Services (ETTLIS).IEEE,2015:223-227.
[12] GEORGE K Z.Human Behavior and the Principle of Least Effort:An Introduction to Human Ecology[M].New York:Addison-Wesley Press,1949:573-584.
[13] 邱均平.文献计量学[M].科学技术文献出版社,1988:157.
[14] BOOTH A D.A law of occurrences for words of low frequency [J].Information and Control,1967,10(4):386-393.
[15] AGRAWAL R,GOLLAPUDI S,KENTHAPADI K.Enriching Textbooks Through Data Mining [C]∥ Proceedings of the First ACM Symposium on Computing for Development.2010:1-9.
[16] AGRAWAL R,GOLLAPUDI S,KANNAN A,et al.Data mi-ning for improving textbooks[J].ACM SIGKDD Explorations Newsletter,2012,13(2):7-19.

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 雷丽晖,王静. 可能性测度下的LTL模型检测并行化研究[J]. 计算机科学, 2018, 45(4): 71 -75, 88 .
[2] 夏庆勋,庄毅. 一种基于局部性原理的远程验证机制[J]. 计算机科学, 2018, 45(4): 148 -151, 162 .
[3] 厉柏伸,李领治,孙涌,朱艳琴. 基于伪梯度提升决策树的内网防御算法[J]. 计算机科学, 2018, 45(4): 157 -162 .
[4] 王欢,张云峰,张艳. 一种基于CFDs规则的修复序列快速判定方法[J]. 计算机科学, 2018, 45(3): 311 -316 .
[5] 孙启,金燕,何琨,徐凌轩. 用于求解混合车辆路径问题的混合进化算法[J]. 计算机科学, 2018, 45(4): 76 -82 .
[6] 张佳男,肖鸣宇. 带权混合支配问题的近似算法研究[J]. 计算机科学, 2018, 45(4): 83 -88 .
[7] 伍建辉,黄中祥,李武,吴健辉,彭鑫,张生. 城市道路建设时序决策的鲁棒优化[J]. 计算机科学, 2018, 45(4): 89 -93 .
[8] 刘琴. 计算机取证过程中基于约束的数据质量问题研究[J]. 计算机科学, 2018, 45(4): 169 -172 .
[9] 钟菲,杨斌. 基于主成分分析网络的车牌检测方法[J]. 计算机科学, 2018, 45(3): 268 -273 .
[10] 史雯隽,武继刚,罗裕春. 针对移动云计算任务迁移的快速高效调度算法[J]. 计算机科学, 2018, 45(4): 94 -99, 116 .