计算机科学 ›› 2017, Vol. 44 ›› Issue (Z11): 55-60.doi: 10.11896/j.issn.1002-137X.2017.11A.010

• 智能计算 • 上一篇    下一篇

基于分词矩阵模型的模糊匹配查重算法研究

李成龙,杨冬菊,韩燕波   

  1. 大规模流数据集成与分析技术北京市重点实验室 北京100144 北方工业大学云计算研究中心 北京100144,大规模流数据集成与分析技术北京市重点实验室 北京100144 北方工业大学云计算研究中心 北京100144,大规模流数据集成与分析技术北京市重点实验室 北京100144 北方工业大学云计算研究中心 北京100144
  • 出版日期:2018-12-01 发布日期:2018-12-01
  • 基金资助:
    本文受国家自然科学基金面上项目(61672042),支持流式大数据实时联动的数据服务模型及方法研究资助

Research on Fuzzy Matching Duplicate Checking Algorithm Based on Matrix Model of Word Segmentation

LI Cheng-long, YANG Dong-ju and HAN Yan-bo   

  • Online:2018-12-01 Published:2018-12-01

摘要: 针对中文文本查重的需求,利用分词的结果,将待查重的目标文本和查重样本文本转换为分词矩阵模型,然后扫描和分析矩阵,得到查重结果。由此提出了一种查重算法,并通过实例验证了该算法具有一定的实用效果。

关键词: 相似度,分词矩阵模型,模糊匹配,查重算法

Abstract: Aiming at the need of Chinese text duplicate checking,based on the result of word segmentation,we converted target text and sample text into matrix model of word segmentation,then scanned and analyzed matrix to get the result.Therefore an algorithm of duplicate checking was developed,and the usefulness of the method was demonstrated by practical examples.

Key words: Similarity,Matrix model of word segmentation,Fuzzy matching,Duplicate checking algorithm

[1] WANG J Y,WANG B,et al.Multi-core Parallel Substring Ma-tching Algorithm Using BWT [J].Journal of Northeastern University (Natural Science),2016,37(5):624-628.
[2] SONG Y,CAI D F,et al.Approach to Chinese Word Segmentation Based on Character-Word Joint Decoding [J].Journal of Software,2009,20(9):2366-2375.
[3] ZHANG B Y,WEI B,et al.Chinese word segmentation algo-rithm based on pair coding [J].Journal of Nanjing University of Science and Technology,2014(4):526-530.
[4] ZHANG P Y,CHEN C M,et al.Texts Similarity AlgorithmBased on Subtrees Matching [J].Pattern Recognition and Artificial Intelligence,2014(3):226-234.
[5] HUANG C H,YIN J,et al.A Text Similarity MeasurementCombining Word Semantic Information with TF-IDF Method [J].Chinese Journal of Computers,2011,34(5):856-864.
[6] MAO Y F,ZHANG D L,WANG L.Directional evidence conflict measurement based on improved cosine similarity [J].Systems Engineering and Electronics,2016,38(11):2567-2571.
[7] FAN H B,YAO N M.A Fast and Exact Single Pattern Ma-tching Algorithm [J].Journal of Computer Research and Deve-lopment,2009,46(8):1341-1348.
[8] LIANG J Y,BAI L,et al.K-Modes Clustering Algorithm Based on a New Distance Measure [J].Journal of Computer Research and Development,2010,47(10):1749-1755.
[9] REN J,LI C P.Improved minimum distance classifier-weighted minimum distance classifier [J].Journal of Computer Applications,2005,25(5):992-994.
[10] YUAN Y,MA L B.Affine Translation Surfaces in Minkowski 3D-Space [J].Journal of Northeastern University(Natural Scien-ce),2013,34(10):1517-1520.
[11] KE J J,HU J Z.Fault feature extraction method based on Manhattan distance and stochastic neighbor embedding [J].Application Research of Computers,2015,32(10):2992-2995.
[12] WANG L F,WANG Y,et al.Application of Chebyshev localcollocation method to trajectory optimization[J].Journal of Harbin Institute of Technology,2013,45(5):95-100.
[13] XIE J Y,XIE W X.Several Feature Selection Algorithms Based on the Discernibility of a Feature Subset and Support Vector Machines [J].Chinese Journal of Computers,2014,37(8):1704-1718.
[14] YU Y Y.Multi-model Estimation Based on Jaccard Distance and Conceptual Clustering[J].Computer Engineering,2012,38(10):22-26.
[15] YANG H F,LI G J.Novel antenna selection algorithm based on Tanimoto similarity [J].Journal of Systems Engineering and Electronics,2008,19(3):624-627.
[16] CHEN D L,SHEN Y T,et al.A Measure Model of Similarity for Finding the Best Coach [J].Journal of Northeastern University (Natural Science),2014,35(12):1697-1700.
[17] WU D,TENG Y P.Word Segment and Search Techniques forChinese Information Search Engines [J].Journal of Computer Applications,2004,24(7):128-131.
[18] XIAO W,TANG D K,et al.Knowledge push based on Lucene and collaborative filtering algorithm [J].Journal of Changchun University of Technology(Natural Science Edition),2016,37(5):503-506.
[19] HE W.The Research for Fast Exact String Matching Algorithm [D].Hefei:Hefei University of Technology,2010.
[20] http://baike.baidu.com/item/%E6%93%8D%E4%BD%9C%E 7%B3%BB%E7%BB%9F/192?sefr=enterbt.
[21] WANG Z.Analysis of producer and consumer problem algorithm [J].Journal of Jilin Province Economic Management Cadre College,2008,22(3):78-81.

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!