计算机科学 ›› 2018, Vol. 45 ›› Issue (11): 204-209.doi: 10.11896/j.issn.1002-137X.2018.11.032

• 软件与数据库技术 • 上一篇    下一篇

基于可变时间窗口的增量数据抽取模型

刘杰, 王桂玲, 左小将   

  1. (北方工业大学计算机学院 北京100144)
    (大规模流数据集成与分析技术北京市重点实验室 北京100144)
  • 收稿日期:2017-09-16 发布日期:2019-02-25
  • 作者简介:刘 杰(1993-),男,硕士生,主要研究方向为大规模流数据处理、大数据分析;王桂玲(1978-),女,副研究员,硕士生导师,CCF会员,主要研究方向为服务计算、面向服务的数据集成、大规模流数据处理和集成等,E-mail:wangguiling@ict.ac.cn(通信作者);左小将(1992-),男,硕士生,主要研究方向为服务计算、大数据处理。
  • 基金资助:
    本文受北京市自然科学基金(4172018)资助。

Incremental Data Extraction Model Based on Variable Time-window

LIU Jie, WANG Gui-ling, ZUO Xiao-jiang   

  1. (Department of Computer,North China University of Technology,Beijing 100144,China)
    (Beijing Key Laboratory on Integration and Analysis of Large-scale Stream Data,Beijing 100144,China)
  • Received:2017-09-16 Published:2019-02-25

摘要: 基于合适的数据抽取模型持续不断地将变化的数据从各个数据源系统进行抽取集成,是各个异构系统之间进行数据共享融合的关键,也是构建增量式数据仓库来进行数据分析的关键。传统的时间戳变化数据捕获方式存在因数据抽取过程中发生异常而导致数据抽取失效,进而影响数据抽取效率的问题。鉴于此,文中借鉴时间窗口的思想,采用先抽取少量重复记录再去重的做法,对传统的时间戳增量数据捕获模型进行了改进,提出了基于可变时间窗口的增量数据抽取模型。该模型减少了异常对数据抽取的影响,增强了时间戳增量数据抽取ETL流程的可靠性,在一定程度上提高了数据的抽取效率。

关键词: ETL, 变化数据的捕获, 时间戳, 增量抽取

Abstract: Continuously extracting and integrating the changed data from different data sources based on appropriate data extraction model is crucial for sharing data between different heterogeneous systems and building the incremental data warehouse to analyze data.There exists a problem of efficiency of data extraction in the traditional timestamp based changed-data-capture method.As long as the exception occurs during the data extraction,the whole data extraction progress willfail.In that case,the database must be rolled back,which reduces the efficiency of extraction.To address the problem above,this paper proposed an incremental data extraction model based on variable time-window.The model extracts a small number of repetitive records and then de-duplicates them based on the idea of time-window.The model reduces the influence of the exception on the data extraction,enhances the reliability for extracting ETL process by the timestamp increment data,and improves the efficiency of data extraction.

Key words: Capture of changed data, ETL, Incremental extraction, Timestamp

中图分类号: 

  • TP311
[1]MINAKSHI M,SHARMA H C.Near Real-Time Data Ware- housing Using State-of-the-Art ETL Tools[J].International Journal of Research,2014,41(10):100-117.
[2]SHU Q.The Research on Optimization of ETL Process and Incremental Data Extraction[D].Changsha:Hunan University,2011.(in Chinese)
舒琦.ETL过程优化与增量数据抽取的研究[D].长沙:湖南大学,2011.
[3]WEN L.Design and Implementation of Incremental Data Ex- tractor based on Sector Inquiry[D].Shijiazhuang:Hebei University of Science and Technology,2015.(in Chinese)
温璐.基于区段查询的增量数据抽取器的设计与实现[D].石家庄:河北科技大学,2015.
[4]TANK D M,GANATRA A,KOSTA Y P,et al.Speeding ETL Processing in Data Warehouses Using High-Performance Joins for Changed Data Capture (CDC)[C]∥Advances in Recent Technologies in Communication & Computing International Conference.2010:365-368.
[5]CASTERS M,BOUMAN R,DONGEN J V.Pentaho Kettle Solutions:Building Open Source ETL Solutions with Pentaho Data Integration[M].Indianapolis,Indiana:Wiley Publishing,2010.
[6]JORG T,DESLOCH S.Towards generating ETL processes for incremental loading[C]∥International Database Engineering and Applications Symposium.2008:101-110.
[7]MEKTEROVIC I,BRKIC L.Delta view generation for incre- mental loading of large dimensions in a data warehouse[C]∥International Convention on Information and Communication Technology Electronics and Microelectronics.2015:1417-1422.
[8]LIN Z Y,YANG D Q,SONG G J,et al.Change Data Capture in Real-Time Active Data Warehouses:A Survey[J].Journal of Computer Research and Development,2007,44(Z3):447-451.(in Chinese)
林子雨,杨冬青,宋国杰,等.实时主动数据仓库中的变化数据捕捉研究综述[J].计算机研究与发展,2007,44(Z3):447-451.
[9]SHI J G,BAO Y B,LENG F L,et al.Study on Log-Based Change Data Capture and Handling Mechanism in Real-Time Data Warehouse[C]∥International Conference on Computer Science and Software Engineering.IEEE Computer Society,2008:478-481.[10]JIA Y K.Research and Design on Data Extraction in Multiple Data Sources[D].Harbin:Harbin Engineering University,2013.(in Chinese)
贾艳凯.多源异构增量数据抽取方法研究与设计[D].哈尔滨:哈尔滨工程大学,2013.
[11]YANG L.Design and Implementation of Real-time data extraction Mechanism in data warehousing [D].Beijing:Beijing University of Posts and Telecommunications,2007.(in Chinese)
杨乐.数据仓库中实时抽取机制的研究与实现[D].北京:北京邮电大学,2007.
[12]TAN G W,WU T.Study on Method of Data Warehouse Real-time Data Updating Based on Mechanism of CDC [J].Computer Science,2015,42(S1):546-548.(in Chinese)
谭光玮,武彤.基于CDC机制的数据仓库实时数据更新方法研究[J].计算机科学,2015,42(S1):546-548.
[13]ZOU X X,JIA W J,PAN J H.Research of Log-based Change Data Capture [J].Journal of Chinese Computer Systems,2012,33(3):531-536.(in Chinese)
邹先霞,贾维嘉,潘久辉.基于数据库日志的变化数据捕获研究[J].小型微型计算机系统,2012,33(3):531-536.
[14]JAIN T,SALUJA S.Refreshing Datawarehouse in Near Real-Time[J].International Journal of Computer Applications,2012,46(18):24-29.
[15]WANG Y B,RAO X R,HE P.Incremental database synchronization update mechanism under heterogeneous environment [J].Computer Engineering and Design,2011,32(3):948-951.(in Chinese)
王玉标,饶锡如,何盼.异构环境下数据库增量同步更新机制[J].计算机工程与设计,2011,32(3):948-951.
[16]CUI Y W,ZHOU J H.Research on Data Integration Based on KETTLE [J].Computer Technology and Development,2015(4):153-157.(in Chinese)
崔有文,周金海.基于KETTLE的数据集成研究[J].计算机技术与发展,2015(4):153-157.
[17]LIU X Q,WU G,DENG H P.Data deduplication in Web information integration[J].Journal of Computer Applications,2013,33(9):2493-2496.(in Chinese)
刘雪琼,武刚,邓厚平.Web信息整合中的数据去重方法[J].计算机应用,2013,33(9):2493-2496.
[18]TANK D M.Reducing ETL Load Times by a New Data Integration Approach for Real-time Business Intelligence[J].International Journal of Engineering Innovations & Research,2012,1(2):56-60.
[1] 李艾玲, 张凤荔, 高强, 王瑞锦.
基于自适应时间戳与多尺度特征提取的轨迹下一足迹预测模型
Trajectory Next Footprint Prediction Model Based on Adaptive Timestamp and Multi-scale Feature Extraction
计算机科学, 2021, 48(11A): 191-197. https://doi.org/10.11896/jsjkx.201200015
[2] 陆叶杉.
系统数据迁移常见问题及案例分析
Common Issues and Case Analysis of System Data Migration
计算机科学, 2019, 46(6A): 412-416.
[3] 王卓昊, 杨冬菊, 徐晨阳.
基于ISE算法的分布式ETL任务调度策略研究
Research on Distributed ETL Tasks Scheduling Strategy Based on ISE Algorithm
计算机科学, 2019, 46(12): 1-7. https://doi.org/10.11896/jsjkx.190100023
[4] 梁奎奎.
一种基于Storm平台的ETL方案实现
Implementation of ETL Scheme Based on Storm Platform
计算机科学, 2019, 46(11A): 208-211.
[5] 王斌, 马俊杰, 房新秀, 魏天佑.
基于时间戳和垂直格式的关联规则挖掘算法
Association Rule Mining Algorithm Based on Timestamp and Vertical Format
计算机科学, 2019, 46(10): 71-76. https://doi.org/10.11896/jsjkx.190100223
[6] 宋杰,郝文宁,陈刚,靳大尉,赵成.
基于MapReduce的分布式ETL多维数据模型研究
Research of Distributed ETL Dimensional Data Model Based on MapReduce
计算机科学, 2013, 40(Z11): 263-266.
[7] 宋杰,郝文宁,陈刚,靳大尉,赵水宁.
基于MapReduce的分布式ETL体系结构研究
Research of Distributed ETL Architecture Based on MapReduce
计算机科学, 2013, 40(6): 152-154.
[8] 袁丽娜.
警务综合信息系统数据仓库的建设与实践
Construction and Practice of the DataWarehouse in the Police Comprehensive Information SystemPreplan Expression and Optimization Method Based on and Case-based Reasoning
计算机科学, 2012, 39(Z6): 291-292.
[9] 徐俊刚,裴莹.
数据ETL研究综述
Overview of Data Extraction, Transformation and Loading
计算机科学, 2011, 38(4): 15-20.
[10] 马震远,周杰,陈楚,张凌.
一种基于分段测量的Internet端到端性能参数估计方法
Segmental Measurement Based Approach to Estimate Internet End-to-end Performance Parameter
计算机科学, 2010, 37(3): 138-140.
[11] 周倜,李舟军,王志勇,王巾盈.
时间敏感的安全协议建模与验证:研究综述
Survey on Modelling and Verification of Time Sensitive Security Protocol
计算机科学, 2009, 36(8): 3-7.
[12] .
基于NetLogo平台的HIV治疗模型

计算机科学, 2008, 35(4): 283-284.
[13] .
基于Netfilter的数据包捕获技术研究

计算机科学, 2007, 34(6): 81-83.
[14] .
一个数据膨胀率为1的概率公钥密码系统

计算机科学, 2007, 34(1): 117-119.
[15] 吴远红.
ETL执行过程的优化研究

计算机科学, 2007, 34(1): 81-83.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!