Computer Science ›› 2017, Vol. 44 ›› Issue (10): 276-282.doi: 10.11896/j.issn.1002-137X.2017.10.050

Previous Articles     Next Articles

Text Data Preprocessing Based on Term Frequency Statistics Rules

CHI Yun-xian, ZHAO Shu-liang, LUO Yan, GAO Lin, ZHAO Jun-peng and LI Chao   

  • Online:2018-12-01 Published:2018-12-01

Abstract: In age of big data,it is a severe problem that feature t erms are faced with “high-dimension and sparse” challenge in text mining.Contradiction between enormous scale of terms and scarce of features will cause high-time-space complexity and poor efficiency,and restricts the efficiency of text mining seriously.Thus,it is crucial to preprocess data before mining text.Terms-dividing and stop-words-deleting are operated merely in data preprocessing of traditional text mining algorithms.In order to improve process of data preprocessing,data preprocessing algorithm based on term frequency statistics rules (DPTFSR) was proposed.To begin with,expression about number of terms with identical frequency is deduced based on Zif’s Law and rule of maximum area.What’s more,regularities of distribution based on terms with identical frequency is explored.It is discovered that proportion of low-frequency terms in documents reach up to 2/3,but there is little relevancy between them.Lastly,data is preprocessed based on terms frequency statistics rules.Low-frequency terms are deleted,and features dimension is decreased greatly.Correctness of term frequency statistics rules and validity of algorithm DPTFSR are verified on data sets from Reuters-21578 and 20-Newgroups.Experimental results show that accuracy,precision,recall and F1 measure are increased,and running time is shortened obviously.Thus,efficiency of text mining is significantly enhanced.

Key words: Big data,Text mining,Data preprocessing,Term frequency statistics

[1] HAN J,FAN J,et al.Semanti-Enhanced Spatial Keyword Search[J].Journal of Computer Research and Development,2015,52(9):1954-1964.(in Chinese) 韩军,范举,等.一种语义增强的空间关键词搜索方法[J].计算机研究与发展,2015,52(9):1954-1964.
[2] HU J,FAN J,LI G L,et al.Top-k Fuzzy Spatial KeywordSearch[J].Chinese Journal of Computers,2012,35(11):2237-2246.(in Chinese) 胡骏,范举,李国良,等.空间数据上Top-k关键词模糊查询算法[J].计算机学报,2012,35(11):2237-2246.
[3] REN P J,CHEN Z M,et al.Search Result Diversification Combing Semantic and Temporal Intent[J].Chinese Journal of Computers,2015,38(10):2076-2091.(in Chinese) 任鹏杰,陈竹敏,等.一种综合语义和实效性意图的检索多样化方法[J].计算机学报,2015,38(10):2076-2091.
[4] DING Z Y,JIA Y,et al.Survey of Data Mining for microblogs [J].Journal of Computer Reaserch and Development,2014,51(4):691-706.(in Chinese) 丁兆云,贾焰,等.微博数据挖掘综述[J].计算机研究与发展,2014,51(4):691-706.
[5] SONG Q,NI J,WANG G.A fast clustering-based feature subset selection algorithm for high-dimensional data[J].IEEE Transactions on Knowledge and Data Engineering,2013,25(1):1-14.
[6] ZHAO Z,HE X F,CAI D,et al.Graph Regularized Feature Selection with Data Reconstruction[J].IEEE Transactions on Knowledge and Data Engineering,2016,28(3):689-700.
[7] ZIPF G K.Human behavior and the principle of least effort:an introduction to human ecology[M].Addison-Wesley Press,1949:23.
[8] BOOTH A D.A law of occurrences for words of low frequency [J].Information and Control,1967,10(4):386-393.
[9] EGGHE L.A new short proof of Naranan’s theorem,explaining Lotka’s law and Zipf’s law[J].Journal of the American Society for Information Science & Technology,2010,61(12):2581-2583.
[10] CHAN P,HIJIKATA Y,NISHIDA S.Computing semantic relatedness using word frequency and layout information of wikipedia[C]∥ Proceedings of the 28th Annual ACM Symposium on Applied Computing.ACM,2013:282-287.
[11] SURYASEN R,RANA M S.Content analysis and application of Zipf’s law in computer science literature [C]∥ 2015 4th International Symposium on Emerging Trends and Technologies in Libraries and Information Services (ETTLIS).IEEE,2015:223-227.
[12] GEORGE K Z.Human Behavior and the Principle of Least Effort:An Introduction to Human Ecology[M].New York:Addison-Wesley Press,1949:573-584.
[13] 邱均平.文献计量学[M].科学技术文献出版社,1988:157.
[14] BOOTH A D.A law of occurrences for words of low frequency [J].Information and Control,1967,10(4):386-393.
[15] AGRAWAL R,GOLLAPUDI S,KENTHAPADI K.Enriching Textbooks Through Data Mining [C]∥ Proceedings of the First ACM Symposium on Computing for Development.2010:1-9.
[16] AGRAWAL R,GOLLAPUDI S,KANNAN A,et al.Data mi-ning for improving textbooks[J].ACM SIGKDD Explorations Newsletter,2012,13(2):7-19.

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!