计算机科学 ›› 2015, Vol. 42 ›› Issue (10): 132-137.

• 信息安全 • 上一篇    下一篇

基于半监督聚类的文档敏感信息推导方法

苏赢彬,杜学绘,夏春涛,曹利峰,陈华成   

  1. 解放军信息工程大学 郑州450001;数学工程与先进计算国家重点实验室 郑州450001,解放军信息工程大学 郑州450001;数学工程与先进计算国家重点实验室 郑州450001,解放军信息工程大学 郑州450001;数学工程与先进计算国家重点实验室 郑州450001,解放军信息工程大学 郑州450001;数学工程与先进计算国家重点实验室 郑州450001,解放军73503部队 福州 350018
  • 出版日期:2018-11-14 发布日期:2018-11-14
  • 基金资助:
    本文受国家高技术研究发展计划(863计划)项目(2012AA012704)资助

Sensitive Information Inference Method Based on Semi-supervised Document Clustering

SU Ying-bin, DU Xue-hui, XIA Chun-tao, CAO Li-feng and CHEN Hua-cheng   

  • Online:2018-11-14 Published:2018-11-14

摘要: 针对当前多文档聚合推导引起的敏感信息泄露问题存在风险大、隐蔽性高的特点,提出了一种基于半监督聚类的文档敏感信息推导方法。首先,为确保在较小的时间开销下获得高质量的约束信息,设计了一种新颖的二阶约束主动学习算法,它通过选择不确定性最大的样本点来生成信息量最大的约束闭包;然后,在引入约束信息的基础上结合DBSCAN提出一种新的半监督聚类算法,它能够有效解决DBSCAN算法存在的边界模糊问题,提高文档聚类准确性;最后,在半监督聚类结果的基础上,对相似文档进行敏感信息可能性测度。实验表明,半监督聚类算法准确率提升明显,推导方法能够有效推导出敏感信息。

关键词: 半监督聚类,DBSCAN,主动学习,敏感信息,模糊数学,推导方法

Abstract: For the problem that sensitive information leakage caused by multi-document clustering and inference has the features of high risk and high concealment,a sensitive information inference method based on semi-supervised document clustering was proposed.Firstly,a new second-order constraint active learning algorithm was designed,which can ensure to obtain high quality constraints with less time by choosing the most uncertain informative data.Then,a new semi-supervised clustering algorithm combining constraints and DBSCAN was proposed,which can effectively resolve fuzzy boundaries of DBSCAN and improve the precision of document clustering.Finally,possibility measure of sensitive information on similar documents was calculated based on the results of semi-supervise clustering.The experiments show that the precision of semi-supervised clustering improves significantly,and the inference method can infer sensitive information effectively.

Key words: Semi-supervised clustering,DBSCAN,Active learning,Sensitive information,Fuzzy math,Inference method

[1] Motro A,Marks D G,Jajodia S.Aggregation in relational databases:Controlled disclosure of sensitive information[M]∥Computer Security—ESORICS 94.Springer Berlin Heidelberg,1994:429-445
[2] Accorsi R,Müller G.Preventive inference control in data-centric business models[C]∥2013 IEEE Security and Privacy Workshops (SPW).IEEE,2013:28-33
[3] 冯婷.安全数据库的推理通道问题研究[D].南京:南京航空航天大学,2010 Feng Ting.The study of the inference of security database[D].Nanjing:Nanjing University of Aeronautics and Astronautics,2010
[4] 曹利峰,陈性元,杜学绘,等.基于聚类分析的客体聚合信息级别推演方法[J].电子与信息学报,2012,34(6):1432-1437Cao Li-feng,Chen Xing-yuan,Du Xue-hui,et al.A level infe-rence method for aggregated information of objects based on clustering analysis[J].Journal of Electronics and Information Technology,2012,34(6):1432-1437
[5] 王玲,薄列峰,焦李成.密度敏感的半监督谱聚类[J].软件学报,2007,18(10):2412-2422 Wang Ling,Bo Lie-feng,Jiao Li-cheng.Density-Sensitive Smi-Supervised spectral clustering[J].Journal of Software,2007,18(10):2412-2422
[6] Wagstaff K,Cardie C.Clustering with instance-level constraints[C]∥Proc.of the 17th Int’l Conf.on Machine Learning.2000:1103-1110
[7] 赵卫中,马慧芳,李志清,等.一种结合主动学习的半监督文档聚类算法[J].软件学报,2012,23(6):1486-1499 Zhao Wei-zhong,Ma Hui-fang,Li Zhi-qing,et al.Efficiently active learning for Smi-Supervised document clustering[J].Journal of Software,2012,23(6):1486-1499
[8] Jain A K.Data clustering:50 years beyond K-means[J].Pattern Recognition Letters,2010,31(8):651-666
[9] 苏赢彬,杜学绘,夏春涛,等.基于文档平滑和查询扩展的文档敏感信息检测方法[J].计算机应用,2014,34(9):2639-2644 Su Ying-bin,Du Xue-hui,Xia Chun-tao,et al.Sensitive information detection approach for documents based on document smoothing and query expansion[J].Journal of Computer Applications,2014,34(9):2639-2644
[10] Goyal P,Behera L,Mcginnity T M.A novel neighborhood based document smoothing model for information retrieval[J].Information retrieval,2013,16(3):391-425
[11] Settles B.Active learning literature survey[R].University ofWisconsin-Madison,2010
[12] 龙军,殷建平,祝恩,等.主动学习研究综述[J].计算机研究与发展,2008,45(z1):300-304 Long Jun,Yin Jian-ping,Zhu En,et al.The research of active learning[J].Journal of Computer Research and Development,2008,45(z1):300-304
[13] Xiong S,Azimi J,Fern X Z.Active learning of constraints for semi-supervised clustering[J].IEEE Transactions on Know-ledge and Data Engineering,2014,26(1):43-54
[14] Davidson I,Wagstaff K.Measuring constraint-set utility for partitional clustering algorithms[M]∥Lecture Notes in Computer Science,Vol 4213.Springer,2006:115-125
[15] 杨纶标,高英仪,等.模糊数学原理及应用(第三版)[M].广州:华南理工大学出版社,2005:338-344 Yang Lun-biao,Gao Ying-yi,et al.The principle and application of fuzzy mathematics (third edition)[M].Guangzhou:South China University of Technology Press,2005:338-344

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!