计算机科学 ›› 2016, Vol. 43 ›› Issue (8): 95-99.doi: 10.11896/j.issn.1002-137X.2016.08.020

• 信息安全 • 上一篇    下一篇

基于关键词重提取的密文文本相似性度量方法研究

李志华,陈超群,李村,胡振宇,张华伟   

  1. 江南大学物联网工程学院计算机科学系 无锡214122,江南大学物联网工程学院计算机科学系 无锡214122,江南大学物联网工程学院计算机科学系 无锡214122,江南大学物联网工程学院计算机科学系 无锡214122,江南大学物联网工程学院计算机科学系 无锡214122
  • 出版日期:2018-12-01 发布日期:2018-12-01
  • 基金资助:
    本文受江苏省科技厅产学研前瞻项目(BY2013015-23)资助

Similarity Measure Algorithm of Cipher-text Based on Re-extracted Keywords

LI Zhi-hua, CHEN Chao-qun, LI Cun, HU Zhen-yu and ZHANG Hua-wei   

  • Online:2018-12-01 Published:2018-12-01

摘要: 针对密文的相似性度量问题,提出了一种新的密文文本相似性度量方法。该方法通过定义关键词的有效作用域、相对作用域、分散域的概念,有效克服了现有的关键词权重量化方法不能对篇幅不同、结构不同的文档进行相对公平的关键词权重量化的不足,同时显著减少了文本度量时所依赖的关键词数量。进一步对文档的关键词进行重提取,并建立文档的关键词密文索引条目,通过密文的索引条目来度量密文的相似性。将该方法在真实文档上进行实验,并同其它算法进行比较,结果表明所提出的方法在准确率和召回率两方面优于其它参与比较的算法,并能在准确率和召回率之间取得比较好的平衡。

关键词: 关键词重提取,相似性度量,密文文本,作用域

Abstract: To solve the similarity of dissimilarity measurement between the cipher texts,a new similarity measure algorithm of cipher-text based on re-extracted keywords called SMCTBRK was proposed.Through defining the new concepts of effective scope,relative scope,distributed scope of the keywords,and re-extracting the keywords in documents,the SMCTBRK constructs the encryption index item for the compared documents depending on the less amounts of re-extracted keywords.Here,the encryption index item is organized as the feature vector.Further,the SMCTBRK computes the similarity between the different cipher texts by the encryption index item instead of the separated keywords.Experiments on real documents were conducted.And the results show that the SMCTBRK is more promised than the Shingling algorithm and the Simhash algorithm on accuracy and recall ratio.

Key words: LI Zhi-hua CHEN Chao-qun LI Cun HU Zhen-yu ZHANG Hua-wei (Department of Computer Science,School of IOT Engineering,Jiangnan University,Wuxi 214122,China)

[1] Wang C,Cao N,Li J,et al.Secure ranked keyword search over encrypted cloud data[C]∥Proceedings of ICDCS.Genova,Italy,2010:253-262
[2] Sebastiani F.Machine learning in automated text categorization,acmcs[J].ACM Computing Surveys,2002,34(1):1-47
[3] Hemalatha S,Raja K,Arasu T.Duplicate Detection of Query Results from Multiple Web Databases [J].IJCA Special Issue on Computational Science—New Dimension & Perspectives,2011(2):71-75
[4] Zhang Zu-ping,Xu Xin,Long Jun,et al.Parameters Correlation and optimization in Text Similarity Measurement[J].Journal of Chinese Computer Systems,2011,2(5):983-989(in Chinese) 张祖平,徐昕,龙军,等.文本相似性度量中参数相关性与优化配置研究[J].小型微型计算机系统,2011,2(5):983-989
[5] Song Qin-bao,Yang Xiang-rong,Shen Jun-yi,et al.A Detection Algorithm for the Illegal Coping and Distributing of Digital Goods[J].Chinese Journal of Computers,2002,5(11):1207-1213(in Chinese) 宋擒豹,杨向荣,沈钧毅,等.数字商品非法复制的检测算法[J].计算机学报,2002,5(11):1207-1213
[6] Li Ya-zhou.The research and improvement of an automatic construction system of text classification corpus[D].Wuhan:Wuhan University of Technology,2011(in Chinese) 李亚洲.文本分类语料库自动构建系统的研究与改进[D].武汉:武汉理工大学,2011
[7] Ye Shao-zhi,Wen Ji-rong,Ma Wei-ying.A systematic study on parameter correlation in large scale duplicate document detection[J].Knowledge and Information Systems,2008,4(2):217-232
[8] Li Rui-lin,Sun Bing,Li Chao,et al.Differential Fault Analysis on SMS4 using a single fault[J].Information Processing Letters,2011,111(4):156-163
[9] Shi Kan-sheng,Liu Hai-tao,Song Wen-tao.A Text ClusteringMethod Based on Speech to Text and Improved Center Selection[J].Pattern Recognition and Artificial Intelligence,2012,5(6):996-1001(in Chinese) 施侃晟,刘海涛,宋文涛.基于词性和中心点改进的文本聚类方法[J].模式识别与人工智能,2012,5(6):996-1001
[10] Xu Ge,Wang Hou-feng.The Development of Topic Models in Natural Language Processing[J].Chinese Journal of Compu-ters,2011,4(8):1423-1436(in Chinese) 徐戈,王厚峰.自然语言处理中主题模型的发展[J].计算机学报,2011,4(8):1423-1436

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!