Computer Science ›› 2018, Vol. 45 ›› Issue (11): 204-209,230.doi: 10.11896/j.issn.1002-137X.2018.11.032

• Software & Database Technology • Previous Articles     Next Articles

Incremental Data Extraction Model Based on Variable Time-window

LIU Jie, WANG Gui-ling, ZUO Xiao-jiang   

  1. (Department of Computer,North China University of Technology,Beijing 100144,China)
    (Beijing Key Laboratory on Integration and Analysis of Large-scale Stream Data,Beijing 100144,China)
  • Received:2017-09-16 Published:2019-02-25

Abstract: Continuously extracting and integrating the changed data from different data sources based on appropriate data extraction model is crucial for sharing data between different heterogeneous systems and building the incremental data warehouse to analyze data.There exists a problem of efficiency of data extraction in the traditional timestamp based changed-data-capture method.As long as the exception occurs during the data extraction,the whole data extraction progress willfail.In that case,the database must be rolled back,which reduces the efficiency of extraction.To address the problem above,this paper proposed an incremental data extraction model based on variable time-window.The model extracts a small number of repetitive records and then de-duplicates them based on the idea of time-window.The model reduces the influence of the exception on the data extraction,enhances the reliability for extracting ETL process by the timestamp increment data,and improves the efficiency of data extraction.

Key words: Capture of changed data, Incremental extraction, Timestamp, ETL

CLC Number: 

  • TP311
[1]MINAKSHI M,SHARMA H C.Near Real-Time Data Ware- housing Using State-of-the-Art ETL Tools[J].International Journal of Research,2014,41(10):100-117.
[2]SHU Q.The Research on Optimization of ETL Process and Incremental Data Extraction[D].Changsha:Hunan University,2011.(in Chinese)
舒琦.ETL过程优化与增量数据抽取的研究[D].长沙:湖南大学,2011.
[3]WEN L.Design and Implementation of Incremental Data Ex- tractor based on Sector Inquiry[D].Shijiazhuang:Hebei University of Science and Technology,2015.(in Chinese)
温璐.基于区段查询的增量数据抽取器的设计与实现[D].石家庄:河北科技大学,2015.
[4]TANK D M,GANATRA A,KOSTA Y P,et al.Speeding ETL Processing in Data Warehouses Using High-Performance Joins for Changed Data Capture (CDC)[C]∥Advances in Recent Technologies in Communication & Computing International Conference.2010:365-368.
[5]CASTERS M,BOUMAN R,DONGEN J V.Pentaho Kettle Solutions:Building Open Source ETL Solutions with Pentaho Data Integration[M].Indianapolis,Indiana:Wiley Publishing,2010.
[6]JORG T,DESLOCH S.Towards generating ETL processes for incremental loading[C]∥International Database Engineering and Applications Symposium.2008:101-110.
[7]MEKTEROVIC I,BRKIC L.Delta view generation for incre- mental loading of large dimensions in a data warehouse[C]∥International Convention on Information and Communication Technology Electronics and Microelectronics.2015:1417-1422.
[8]LIN Z Y,YANG D Q,SONG G J,et al.Change Data Capture in Real-Time Active Data Warehouses:A Survey[J].Journal of Computer Research and Development,2007,44(Z3):447-451.(in Chinese)
林子雨,杨冬青,宋国杰,等.实时主动数据仓库中的变化数据捕捉研究综述[J].计算机研究与发展,2007,44(Z3):447-451.
[9]SHI J G,BAO Y B,LENG F L,et al.Study on Log-Based Change Data Capture and Handling Mechanism in Real-Time Data Warehouse[C]∥International Conference on Computer Science and Software Engineering.IEEE Computer Society,2008:478-481.[10]JIA Y K.Research and Design on Data Extraction in Multiple Data Sources[D].Harbin:Harbin Engineering University,2013.(in Chinese)
贾艳凯.多源异构增量数据抽取方法研究与设计[D].哈尔滨:哈尔滨工程大学,2013.
[11]YANG L.Design and Implementation of Real-time data extraction Mechanism in data warehousing [D].Beijing:Beijing University of Posts and Telecommunications,2007.(in Chinese)
杨乐.数据仓库中实时抽取机制的研究与实现[D].北京:北京邮电大学,2007.
[12]TAN G W,WU T.Study on Method of Data Warehouse Real-time Data Updating Based on Mechanism of CDC [J].Computer Science,2015,42(S1):546-548.(in Chinese)
谭光玮,武彤.基于CDC机制的数据仓库实时数据更新方法研究[J].计算机科学,2015,42(S1):546-548.
[13]ZOU X X,JIA W J,PAN J H.Research of Log-based Change Data Capture [J].Journal of Chinese Computer Systems,2012,33(3):531-536.(in Chinese)
邹先霞,贾维嘉,潘久辉.基于数据库日志的变化数据捕获研究[J].小型微型计算机系统,2012,33(3):531-536.
[14]JAIN T,SALUJA S.Refreshing Datawarehouse in Near Real-Time[J].International Journal of Computer Applications,2012,46(18):24-29.
[15]WANG Y B,RAO X R,HE P.Incremental database synchronization update mechanism under heterogeneous environment [J].Computer Engineering and Design,2011,32(3):948-951.(in Chinese)
王玉标,饶锡如,何盼.异构环境下数据库增量同步更新机制[J].计算机工程与设计,2011,32(3):948-951.
[16]CUI Y W,ZHOU J H.Research on Data Integration Based on KETTLE [J].Computer Technology and Development,2015(4):153-157.(in Chinese)
崔有文,周金海.基于KETTLE的数据集成研究[J].计算机技术与发展,2015(4):153-157.
[17]LIU X Q,WU G,DENG H P.Data deduplication in Web information integration[J].Journal of Computer Applications,2013,33(9):2493-2496.(in Chinese)
刘雪琼,武刚,邓厚平.Web信息整合中的数据去重方法[J].计算机应用,2013,33(9):2493-2496.
[18]TANK D M.Reducing ETL Load Times by a New Data Integration Approach for Real-time Business Intelligence[J].International Journal of Engineering Innovations & Research,2012,1(2):56-60.
[1] LU Ye-shan. Common Issues and Case Analysis of System Data Migration [J]. Computer Science, 2019, 46(6A): 412-416.
[2] WANG Bin, MA Jun-jie, FANG Xin-xiu, WEI Tian-you. Association Rule Mining Algorithm Based on Timestamp and Vertical Format [J]. Computer Science, 2019, 46(10): 71-76.
[3] SONG Jie,HAO Wen-ning,CHEN Gang,JIN Da-wei and ZHAO Cheng. Research of Distributed ETL Dimensional Data Model Based on MapReduce [J]. Computer Science, 2013, 40(Z11): 263-266.
[4] SONG Jie,HAO Wen-ning,CHEN Gang,JIN Da-wei and ZHAO Shui-ning. Research of Distributed ETL Architecture Based on MapReduce [J]. Computer Science, 2013, 40(6): 152-154.
[5] . Construction and Practice of the DataWarehouse in the Police Comprehensive Information SystemPreplan Expression and Optimization Method Based on and Case-based Reasoning [J]. Computer Science, 2012, 39(Z6): 291-292,308.
[6] XU Jun-gang,PEI Ying. Overview of Data Extraction, Transformation and Loading [J]. Computer Science, 2011, 38(4): 15-20.
[7] PENG Lin,XIE Lun-guo,ZHANG Xiao-qiang. Vector Timest Based Software Transactional Memory Algorithm [J]. Computer Science, 2010, 37(5): 282-286.
[8] MA Zhen-yuan,ZHOU Jie,CHEN Chu,ZHANG Ling. Segmental Measurement Based Approach to Estimate Internet End-to-end Performance Parameter [J]. Computer Science, 2010, 37(3): 138-140.
[9] . [J]. Computer Science, 2008, 35(4): 283-284.
[10] . [J]. Computer Science, 2007, 34(6): 81-83.
[11] WU Yuan- Hong (Information College of Zhejiang Ocean University, Zhoushan 316004). [J]. Computer Science, 2007, 34(1): 81-83.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] . [J]. Computer Science, 2018, 1(1): 1 .
[2] LEI Li-hui and WANG Jing. Parallelization of LTL Model Checking Based on Possibility Measure[J]. Computer Science, 2018, 45(4): 71 -75, 88 .
[3] XIA Qing-xun and ZHUANG Yi. Remote Attestation Mechanism Based on Locality Principle[J]. Computer Science, 2018, 45(4): 148 -151, 162 .
[4] LI Bai-shen, LI Ling-zhi, SUN Yong and ZHU Yan-qin. Intranet Defense Algorithm Based on Pseudo Boosting Decision Tree[J]. Computer Science, 2018, 45(4): 157 -162 .
[5] WANG Huan, ZHANG Yun-feng and ZHANG Yan. Rapid Decision Method for Repairing Sequence Based on CFDs[J]. Computer Science, 2018, 45(3): 311 -316 .
[6] SUN Qi, JIN Yan, HE Kun and XU Ling-xuan. Hybrid Evolutionary Algorithm for Solving Mixed Capacitated General Routing Problem[J]. Computer Science, 2018, 45(4): 76 -82 .
[7] ZHANG Jia-nan and XIAO Ming-yu. Approximation Algorithm for Weighted Mixed Domination Problem[J]. Computer Science, 2018, 45(4): 83 -88 .
[8] WU Jian-hui, HUANG Zhong-xiang, LI Wu, WU Jian-hui, PENG Xin and ZHANG Sheng. Robustness Optimization of Sequence Decision in Urban Road Construction[J]. Computer Science, 2018, 45(4): 89 -93 .
[9] LIU Qin. Study on Data Quality Based on Constraint in Computer Forensics[J]. Computer Science, 2018, 45(4): 169 -172 .
[10] ZHONG Fei and YANG Bin. License Plate Detection Based on Principal Component Analysis Network[J]. Computer Science, 2018, 45(3): 268 -273 .