Computer Science ›› 2017, Vol. 44 ›› Issue (Z11): 39-45.doi: 10.11896/j.issn.1002-137X.2017.11A.007

Previous Articles     Next Articles

Repetitive Pattern Recognition Algorithms and Applications in Web Information Extraction and Clustering Analysis

Munina YUSUFU and Gulina YUSUFU   

  • Online:2018-12-01 Published:2018-12-01

Abstract: Detection of repetitive patterns in sequences and their applications have become fundamental research areas in data mining.It is a very important means for extracting useful information from sequences.In recent years,many works have been conducted focusing on the definitions of repetitive patterns,efficient recognition algorithm designs,and applications in the relevant areas.In this paper,the classification and characteristics of repetitive patterns in sequence were briefly described,and the commonly used data structures in the algorithms were discussed.Recent studies on the applications of detection algorithms in relevant fields and their main design ideas were reviewed by discussing and evaluating certain aspects.These aspects include the field knowledge and constraints,identifying results,scalabilities of algorithms and existing main problems.Application areas include Web information extraction,Web document feature extraction and clustering algorithms,and related Uyghur language information processing.Finally,some challenges on detection algorithms of repetitive patterns in sequences and applications in various fields were discussed and future research trends were also explored.

Key words: Repetitive patterns,Web document features,Web information extraction,Clustering algorithms,Uyghur information processing

[1] LANDER E S,LINTON L M,BIRREN B,et al.Initial Sequencing and Analysis of the Human Genome[J].Nature,2001,409(6822):860-921.
[2] MAKALOWSKI W.Not junk after all[J].Science,2003,300(5623):1246-1247.
[3] Int’l Human Genome Sequencing Consortium.Finishing theeuchromatic sequence of the human genome[J].Nature,2004,431(7011):931-945.
[4] SHAPIRO J A,STERNBERG R V.Why repetitive DNA is essential to genome function[J].Biological Reviews,2005,80(2):227.
[5] TREANGEN T J,SALZBERG S L.Repetitive DNA and next-generation sequencing:Computational challenges and solutions [J].Nature Reviews Genetics,2011,13(1):36-46.
[6] SMYTH W F,YUSUFU M.Computing Regularities in Strings[C]∥Proceedings of the 2nd IEEE International Conference on Computer Science and Information Technology.2009:298-302.
[7] UKKONEN E.On-line Construction of Suffix Trees[J].Algorithmica,1995,14(3):249-260.
[8] ABOUELHODA M I,KURTZ S,OHLEBUSCH E.ReplacingSuffix Trees with Enhanced Suffix Arrays[J].Journal of Discrete Algorithms,2004,2(1):53-86.
[9] FERRAGINA P,GROSSI R,MUTHUKRISHNAN S.The StringB-Tree:a New Data Structure for String Search in External Memeory and Its Applications[J].Journal of the ACM,1999,46(2):236-280.
[10] MANBER U,MYERS G.Suffix Arrays:a New Method for On-line String Searches[J].SIAM Journal on Computing,1993,22(5):935-948.
[11] PUGLISI S J,SMYTH W F,TURPIN A H.A taxonomy of suffix array construction algorithms[J].ACM Computing Surveys,2007,39(2):1-35.
[12] NONG G,ZHANG S,CHAN W H.Two Efficient Algorithms for Linear Time Suffix Array Construction[J].IEEE Transactions on Computers,2011,60(10):1471-1484.
[13] FRANEK F,SMYTH W F,TANG Y D.Computing all repeats using suffix arrays[J].Automata,Languages and Combinatorics,2003,8(4):579-591.
[14] PUGLISI S J,SMYTH W F,YUSUFU M.Fast Optimal Algorithms for Computing all the Repeats in a String[J].Mathema-tics in Computer Science,2010,3(4):373-389.
[15] FISCHER J,HEUN V,KRAMER S.Fast Frequent String Mi-ning Using Suffix Arrays[C]∥Proceedings of the Fifth IEEE International Conference on Data Mining.2005:609-612.
[16] ILIE L,SMYTH W F.Minimum Unique Substrings and Maximum Repeats[J].Fundamenta Informaticae,2011,110(1-4):183-195.
[17] PUGLISH S J,TURPIN A.Space-time tradeoffs for longest-common prefix array computation[C]∥Proceedings of 19th International Symposium on Algorithms and Computation.2008:124-135.
[18] 王镝,王国仁,吴青泉,等.DNA序列中基于后继数组索引的LPR查找算法[J].计算机研究与发展,2006,43(z3):195-199.
[19] 木妮娜·玉素甫,古丽娜·玉素甫,张海军.基于QSA数组计算序列中所有NE重复模式的算法[J].计算机科学,2014,41(3):249-252,262.
[20] KULEKCI M O,VITTER J S,XU B J.Efficient maximal repeat finding using the burrows-wheeler transform and wavelet tree[J].IEEE/ACM Transactions on Computational Biology and Bioinformatics,2012,9(2):421-429.
[21] 高强,张敬之,耿桦,等.基于重复模式的Web信息抽取[J].计算机科学,2007,34(4):210-212,1.
[22] CHANG C H,LUI S C.IEPAD:Information extraction based on pattern discovery[C]∥Proceedings of the 10th International Conference on World Wide Web.2001:223-231.
[23] MUROLO A,NORRIE M C.Revisiting Web Data ExtractionUsing In-Browser Structural Analysis and Visual Cues in Mo-dern Web Designs[C]∥Web Engineering,Lecture Notes in Computer Science.2016:114-131.
[24] 韩普,王泽.基于重复模式的论坛信息抽取研究[J].南京师范大学学报(工程技术版),2010,10(3):74-77.
[25] 高克宁,马安香,张斌,等.基于重复模式的Web信息语义表示方法的研究[J].小型微型计算机系统,2009,30(1):26-30.
[26] ZAMIR O,ETZIONI O,MADANI O,et al.Fast and intuitive clustering of Web documents[C]∥Proceedings of the KDD’97.1997:287-290.
[27] ALLAN J,et al.Topic Detection and Tracking:Event-Based Information Organization[D].Dordrecht:Kluwer Academic Publishers,2002.
[28] HASHIMOTOA K,KONTONATSIOSB G,M IWAC M,et al.Topic detection using paragraph vectors to support active lear-ning in systematic reviews[J].Journal of Biomedical Informa-tics,2016,62:59-65.
[29] FRANZ M,MC CARLEY J S.Unsupervised and supervisedclustering for topic tracking[C]∥Proceedings of the 24th An-nual International ACM SIGIR.New Orleans,Louisiana,USA:ACM,2001:310-317.
[30] XU R,WUNSCH D.Survey of Clustering Algorithms[J].IEEE Transactions on Neural Networks,2005,6(3):645-678.
[31] 胡吉祥,许洪波,刘悦,等.重复串特征提取算法及其在文本聚类中的应用[J].计算机工程,2007,33(2):65-67.
[32] ZAMIR O,ETZIONI O,MADANI O,et al.Fast and intuitive clustering of Web documents[C]∥Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining.New York:AAAI Press,1997:287-290.
[33] HONG Y,SAM K.Learning assignment order of instances for the constrained K-means clustering algorithm[J].IEEE Tran-sactions on Systems Man and Cybernetics Part B-Cybernetics,2009,39(2):568-574.
[34] HALL L O,GOLDGOF D B.On convergence properties of the single pass and online fuzzy c-means algorithm[C]∥2010 IEEE International Conference on Fuzzy Systems.Washington,DC:IEEE,2010:1-3.
[35] AIOLLI F,SAN-MARTINO G,H AGENBUCHNER M,et al.Learning nonsparse kernels by self organizing maps for structured data[J].IEEE Transactions on Neural Networks,2009, 20(12):1938-1949.
[36] ZAMIR O,ETZIONI O.Web Document Clustering:A Feasibility Demonstration[C]∥Proceedings of the 21st International ACM SIGIR Conference on Research and Development in Information Retrieval.Melbourne,Australia,1998:46-54.
[37] JANRUANG J,GUHA S.Semantic Suffix Tree Clustering[C]∥Proceedings of International Conference on Data Engineering and Internet Technology.2011:35-40.
[38] 林建敏,谢康林.基于PAT-array和模糊聚类的文本聚类方法[J].计算机工程,2004,30(12):126-127.
[39] 翟献民,田生伟,禹龙,等.面向维吾尔语文本的改进后缀树聚类[J].计算机应用,2012,32(4):1078-1081.
[40] ZHANG D,DONG Y S.Semantic,hierarchical,online clustering of web search results[C]∥Advanced Web Technologies and Applications,Lecture Notes in Computer Science.2004:69-78.
[41] YAMAMOTO M,CHURCH K W.Using Suffix Arrays to Com-pute Term Frequency and Document Frequency for All Substrings in a Corpus[J].Computational Linguistics,2001,27(1):1-30.
[42] LEE S D,DE RAEDT L.An efficient algorithm for mining stringdatabases under constraints[C]∥Knowledge Discovery in Inductive Databases,Lecture Notes in Computer Science.2005:108-129.
[43] BEIL F,ESTER M,XU X.Frequent term-based text clustering[C]∥Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.Edmonton,Alberta,Canada,2002:436-422.
[44] FUNG B C M,WANG K,ESTER M.Hierarchical Document Clustering Using Frequent Itemsets[C]∥Proceedings of the SIAM International Conference on Data Mining.2003.
[45] GURALNIK V,KARYPIS G.A scalable algorithm for clustering sequential data[C]∥Proceedings of the IEEE International Conference on Data Mining.2001:179-186.
[46] WANG J Y,ZHANG Y Z,ZHOU L Z,et al.Discriminating subsequence discovery for sequence clustering[C]∥Proceedings of the 7th International Conference on Data Mining.2007:605-610.
[47] GROSSI R,VITTER J S.Compressed suffix arrays and suffixtrees with applications to text indexing and string matching[J].SIAM Journal on Computing,2005,5(2):378-407.
[48] 刘小珠,彭智勇.全文索引技术时空效率分析[J].软件学报,2009,20(7):1768-1784.
[49] OKANOHARA D,SADAKANE K.An online space-efficient algorithm for searching the longest match string[M]∥Lecture Notes in Computer Science.2008:696-707.
[50] VU L,ALAGHBAND G.Novel parallel method for mining fre-quent patterns on multi-core shared memory systems[C]∥Proceedings of the International Workshop on Data-Intensive Scalable Computing Systems.2013:49-54.
[51] BEDATHUR S,HARITSA J R.Search-optimized persistentsuffix tree storage for biological applications[C]∥Proceedings of 12th IEEE International Conference on High Performance Computing.2005.
[52] MOEN S,AKSEHIRLI E,GOETHALS B.Frequent Itemset Mi-ning for Big Data[C]∥Proceedings of IEEE International Conference on Big Data.2013.
[53] MOENS S,AKSEHIRLI E,GOETHALS B.Frequent itemset mining for big data[C]∥IEEE International Conference on Big Data.2013:111-118.
[54] 阿力木江·艾沙,库尔班·吾布力,吐尔根·依布拉音.维吾尔文文本分类中若干问题的研究[M].Scientific Research Publishing,Inc.USA,2015.
[55] 崔世起,刘群,孟遥,等.基于大规模语料库的新词检测[J].计算机研究与发展,2006,43(5):927-932.
[56] LUO Z Y,SONG R.An Integrated Method for Chinese Unknown Word Extraction[C]∥Proceedings of the Third SIGHAN Workshop on Chinese Language Learning.Barcelona,Spain,2004:148-155.

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!