重复模式识别算法及在Web信息抽取和聚类分析中的应用

doi:10.11896/j.issn.1002-137X.2017.11A.007

摘要/Abstract

摘要： 序列中的重复模式识别算法及应用研究是数据挖掘领域的重要问题,是提取序列中有用信息的主要手段之一。近年来,针对各种重复模式定义、有效的识别算法设计以及重复模式识别算法在有关领域中的应用有了很多研究成果。文中对序列中重复模式的类型与特点作了描述,讨论了识别算法中常用的数据结构,以分类的方式重点回顾并总结了近年来重复模式在一些相关领域中的应用及相关算法的设计思路与技巧,并从加入的领域知识及约束、识别结果与算法扩充性、存在的主要问题等方面进行了讨论,其中包括在网络信息抽取、Web文档特征提取与聚类算法及相关的维文信息处理等领域中的应用。最后,讨论了关于序列重复模式识别算法在各个相关领域中的应用研究所面临的挑战,并探讨了未来的研究方向。

关键词: 重复模式,Web文档特征,网络信息抽取,聚类算法,维文信息处理

Abstract: Detection of repetitive patterns in sequences and their applications have become fundamental research areas in data mining.It is a very important means for extracting useful information from sequences.In recent years,many works have been conducted focusing on the definitions of repetitive patterns,efficient recognition algorithm designs,and applications in the relevant areas.In this paper,the classification and characteristics of repetitive patterns in sequence were briefly described,and the commonly used data structures in the algorithms were discussed.Recent studies on the applications of detection algorithms in relevant fields and their main design ideas were reviewed by discussing and evaluating certain aspects.These aspects include the field knowledge and constraints,identifying results,scalabilities of algorithms and existing main problems.Application areas include Web information extraction,Web document feature extraction and clustering algorithms,and related Uyghur language information processing.Finally,some challenges on detection algorithms of repetitive patterns in sequences and applications in various fields were discussed and future research trends were also explored.

Key words: Repetitive patterns,Web document features,Web information extraction,Clustering algorithms,Uyghur information processing

木妮娜·玉素甫,古丽娜·玉素甫. 重复模式识别算法及在Web信息抽取和聚类分析中的应用[J]. 计算机科学, 2017, 44(Z11): 39-45. https://doi.org/10.11896/j.issn.1002-137X.2017.11A.007

Munina YUSUFU and Gulina YUSUFU. Repetitive Pattern Recognition Algorithms and Applications in Web Information Extraction and Clustering Analysis[J]. Computer Science, 2017, 44(Z11): 39-45. https://doi.org/10.11896/j.issn.1002-137X.2017.11A.007

参考文献

[1] LANDER E S,LINTON L M,BIRREN B,et al.Initial Sequencing and Analysis of the Human Genome[J].Nature,2001,409(6822):860-921.
[2] MAKALOWSKI W.Not junk after all[J].Science,2003,300(5623):1246-1247.
[3] Int’l Human Genome Sequencing Consortium.Finishing theeuchromatic sequence of the human genome[J].Nature,2004,431(7011):931-945.
[4] SHAPIRO J A,STERNBERG R V.Why repetitive DNA is essential to genome function[J].Biological Reviews,2005,80(2):227.
[5] TREANGEN T J,SALZBERG S L.Repetitive DNA and next-generation sequencing:Computational challenges and solutions [J].Nature Reviews Genetics,2011,13(1):36-46.
[6] SMYTH W F,YUSUFU M.Computing Regularities in Strings[C]∥Proceedings of the 2nd IEEE International Conference on Computer Science and Information Technology.2009:298-302.
[7] UKKONEN E.On-line Construction of Suffix Trees[J].Algorithmica,1995,14(3):249-260.
[8] ABOUELHODA M I,KURTZ S,OHLEBUSCH E.ReplacingSuffix Trees with Enhanced Suffix Arrays[J].Journal of Discrete Algorithms,2004,2(1):53-86.
[9] FERRAGINA P,GROSSI R,MUTHUKRISHNAN S.The StringB-Tree:a New Data Structure for String Search in External Memeory and Its Applications[J].Journal of the ACM,1999,46(2):236-280.
[10] MANBER U,MYERS G.Suffix Arrays:a New Method for On-line String Searches[J].SIAM Journal on Computing,1993,22(5):935-948.
[11] PUGLISI S J,SMYTH W F,TURPIN A H.A taxonomy of suffix array construction algorithms[J].ACM Computing Surveys,2007,39(2):1-35.
[12] NONG G,ZHANG S,CHAN W H.Two Efficient Algorithms for Linear Time Suffix Array Construction[J].IEEE Transactions on Computers,2011,60(10):1471-1484.
[13] FRANEK F,SMYTH W F,TANG Y D.Computing all repeats using suffix arrays[J].Automata,Languages and Combinatorics,2003,8(4):579-591.
[14] PUGLISI S J,SMYTH W F,YUSUFU M.Fast Optimal Algorithms for Computing all the Repeats in a String[J].Mathema-tics in Computer Science,2010,3(4):373-389.
[15] FISCHER J,HEUN V,KRAMER S.Fast Frequent String Mi-ning Using Suffix Arrays[C]∥Proceedings of the Fifth IEEE International Conference on Data Mining.2005:609-612.
[16] ILIE L,SMYTH W F.Minimum Unique Substrings and Maximum Repeats[J].Fundamenta Informaticae,2011,110(1-4):183-195.
[17] PUGLISH S J,TURPIN A.Space-time tradeoffs for longest-common prefix array computation[C]∥Proceedings of 19th International Symposium on Algorithms and Computation.2008:124-135.
[18] 王镝,王国仁,吴青泉,等.DNA序列中基于后继数组索引的LPR查找算法[J].计算机研究与发展,2006,43(z3):195-199.
[19] 木妮娜·玉素甫,古丽娜·玉素甫,张海军.基于QSA数组计算序列中所有NE重复模式的算法[J].计算机科学,2014,41(3):249-252,262.
[20] KULEKCI M O,VITTER J S,XU B J.Efficient maximal repeat finding using the burrows-wheeler transform and wavelet tree[J].IEEE/ACM Transactions on Computational Biology and Bioinformatics,2012,9(2):421-429.
[21] 高强,张敬之,耿桦,等.基于重复模式的Web信息抽取[J].计算机科学,2007,34(4):210-212,1.
[22] CHANG C H,LUI S C.IEPAD:Information extraction based on pattern discovery[C]∥Proceedings of the 10th International Conference on World Wide Web.2001:223-231.
[23] MUROLO A,NORRIE M C.Revisiting Web Data ExtractionUsing In-Browser Structural Analysis and Visual Cues in Mo-dern Web Designs[C]∥Web Engineering,Lecture Notes in Computer Science.2016:114-131.
[24] 韩普,王泽.基于重复模式的论坛信息抽取研究[J].南京师范大学学报(工程技术版),2010,10(3):74-77.
[25] 高克宁,马安香,张斌,等.基于重复模式的Web信息语义表示方法的研究[J].小型微型计算机系统,2009,30(1):26-30.
[26] ZAMIR O,ETZIONI O,MADANI O,et al.Fast and intuitive clustering of Web documents[C]∥Proceedings of the KDD’97.1997:287-290.
[27] ALLAN J,et al.Topic Detection and Tracking:Event-Based Information Organization[D].Dordrecht:Kluwer Academic Publishers,2002.
[28] HASHIMOTOA K,KONTONATSIOSB G,M IWAC M,et al.Topic detection using paragraph vectors to support active lear-ning in systematic reviews[J].Journal of Biomedical Informa-tics,2016,62:59-65.
[29] FRANZ M,MC CARLEY J S.Unsupervised and supervisedclustering for topic tracking[C]∥Proceedings of the 24th An-nual International ACM SIGIR.New Orleans,Louisiana,USA:ACM,2001:310-317.
[30] XU R,WUNSCH D.Survey of Clustering Algorithms[J].IEEE Transactions on Neural Networks,2005,6(3):645-678.
[31] 胡吉祥,许洪波,刘悦,等.重复串特征提取算法及其在文本聚类中的应用[J].计算机工程,2007,33(2):65-67.
[32] ZAMIR O,ETZIONI O,MADANI O,et al.Fast and intuitive clustering of Web documents[C]∥Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining.New York:AAAI Press,1997:287-290.
[33] HONG Y,SAM K.Learning assignment order of instances for the constrained K-means clustering algorithm[J].IEEE Tran-sactions on Systems Man and Cybernetics Part B-Cybernetics,2009,39(2):568-574.
[34] HALL L O,GOLDGOF D B.On convergence properties of the single pass and online fuzzy c-means algorithm[C]∥2010 IEEE International Conference on Fuzzy Systems.Washington,DC:IEEE,2010:1-3.
[35] AIOLLI F,SAN-MARTINO G,H AGENBUCHNER M,et al.Learning nonsparse kernels by self organizing maps for structured data[J].IEEE Transactions on Neural Networks,2009, 20(12):1938-1949.
[36] ZAMIR O,ETZIONI O.Web Document Clustering:A Feasibility Demonstration[C]∥Proceedings of the 21st International ACM SIGIR Conference on Research and Development in Information Retrieval.Melbourne,Australia,1998:46-54.
[37] JANRUANG J,GUHA S.Semantic Suffix Tree Clustering[C]∥Proceedings of International Conference on Data Engineering and Internet Technology.2011:35-40.
[38] 林建敏,谢康林.基于PAT-array和模糊聚类的文本聚类方法[J].计算机工程,2004,30(12):126-127.
[39] 翟献民,田生伟,禹龙,等.面向维吾尔语文本的改进后缀树聚类[J].计算机应用,2012,32(4):1078-1081.
[40] ZHANG D,DONG Y S.Semantic,hierarchical,online clustering of web search results[C]∥Advanced Web Technologies and Applications,Lecture Notes in Computer Science.2004:69-78.
[41] YAMAMOTO M,CHURCH K W.Using Suffix Arrays to Com-pute Term Frequency and Document Frequency for All Substrings in a Corpus[J].Computational Linguistics,2001,27(1):1-30.
[42] LEE S D,DE RAEDT L.An efficient algorithm for mining stringdatabases under constraints[C]∥Knowledge Discovery in Inductive Databases,Lecture Notes in Computer Science.2005:108-129.
[43] BEIL F,ESTER M,XU X.Frequent term-based text clustering[C]∥Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.Edmonton,Alberta,Canada,2002:436-422.
[44] FUNG B C M,WANG K,ESTER M.Hierarchical Document Clustering Using Frequent Itemsets[C]∥Proceedings of the SIAM International Conference on Data Mining.2003.
[45] GURALNIK V,KARYPIS G.A scalable algorithm for clustering sequential data[C]∥Proceedings of the IEEE International Conference on Data Mining.2001:179-186.
[46] WANG J Y,ZHANG Y Z,ZHOU L Z,et al.Discriminating subsequence discovery for sequence clustering[C]∥Proceedings of the 7th International Conference on Data Mining.2007:605-610.
[47] GROSSI R,VITTER J S.Compressed suffix arrays and suffixtrees with applications to text indexing and string matching[J].SIAM Journal on Computing,2005,5(2):378-407.
[48] 刘小珠,彭智勇.全文索引技术时空效率分析[J].软件学报,2009,20(7):1768-1784.
[49] OKANOHARA D,SADAKANE K.An online space-efficient algorithm for searching the longest match string[M]∥Lecture Notes in Computer Science.2008:696-707.
[50] VU L,ALAGHBAND G.Novel parallel method for mining fre-quent patterns on multi-core shared memory systems[C]∥Proceedings of the International Workshop on Data-Intensive Scalable Computing Systems.2013:49-54.
[51] BEDATHUR S,HARITSA J R.Search-optimized persistentsuffix tree storage for biological applications[C]∥Proceedings of 12th IEEE International Conference on High Performance Computing.2005.
[52] MOEN S,AKSEHIRLI E,GOETHALS B.Frequent Itemset Mi-ning for Big Data[C]∥Proceedings of IEEE International Conference on Big Data.2013.
[53] MOENS S,AKSEHIRLI E,GOETHALS B.Frequent itemset mining for big data[C]∥IEEE International Conference on Big Data.2013:111-118.
[54] 阿力木江·艾沙,库尔班·吾布力,吐尔根·依布拉音.维吾尔文文本分类中若干问题的研究[M].Scientific Research Publishing,Inc.USA,2015.
[55] 崔世起,刘群,孟遥,等.基于大规模语料库的新词检测[J].计算机研究与发展,2006,43(5):927-932.
[56] LUO Z Y,SONG R.An Integrated Method for Chinese Unknown Word Extraction[C]∥Proceedings of the Third SIGHAN Workshop on Chinese Language Learning.Barcelona,Spain,2004:148-155.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed