计算机科学 ›› 2010, Vol. 37 ›› Issue (6): 23-27.

• 计算机网络与信息安全 • 上一篇    下一篇

基于动态区间映射的文档聚类算法

孙永林,刘仲   

  1. (国防科技大学计算机学院 长沙410073)
  • 出版日期:2018-12-01 发布日期:2018-12-01
  • 基金资助:
    本文受国家自然科学基金(60503042)资助。

Document Clustering Algorithm Based on Dynamic Interval Mapping

SUN Yong-lin,LIU Zhong   

  • Online:2018-12-01 Published:2018-12-01

摘要: 随着信息数字化的快速发展,新兴的归档存储成为研究热点,空间利用率和扩展性是其关键问题。利用基于内容分块存储实现重复数据删除,是提高存储空间利用率的有效途径,但由于归档数据规模巨大,在所有数据中寻找共享分块的做法十分低效。将动态区间映射思想引入信息聚类,提出了基于动态区间映射的文档聚类算法DC-DIM;利用分块和特征提取方法产生文档的分块特征集合,将分块特征集合映射在区间链上,依据文档分块特征集合的映射分布确定文档的存储容器,实现文档聚类;将内容相似度高(共享内容多)的文档聚集在一起,为分块存储和方便数据管理创造有利条件。

关键词: 文档聚类,归档存储,动态区间映射,空间利用率,扩展性

Abstract: Archival storage is becoming a research hotspot with information digitization accelerating,where space utihnation and scalability are very important Using content based chunking storage to achieve data deduplication is an effective way to improve storage space utilization, however, it is inefficiency to find shared chunks in all of the huge scale of archival data. We introduced the thought of dynamic interval mapping to information clustering, and presented the DC-DIM(Document Clustering algorithm based on Dynamic Interval Mapping).The algorithm uses chunking and feature extraction methods to generate the fcaturcset of document, and map it on interval links, then choose the document's storage container according to its feature-set's distribution on interval links.By this way, those documents with high similarity(shared a lot of contents) will be clustered, then, it will be very convenient to improve the space utilization and data management.

Key words: Document clustering, Archival storage, Dynamic interval mapping, Space utilization, Scalability

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!