Computer Science ›› 2018, Vol. 45 ›› Issue (11): 204-209,230.doi: 10.11896/j.issn.1002-137X.2018.11.032

• Software & Database Technology • Previous Articles     Next Articles

Incremental Data Extraction Model Based on Variable Time-window

LIU Jie, WANG Gui-ling, ZUO Xiao-jiang   

  1. (Department of Computer,North China University of Technology,Beijing 100144,China)
    (Beijing Key Laboratory on Integration and Analysis of Large-scale Stream Data,Beijing 100144,China)
  • Received:2017-09-16 Published:2019-02-25

Abstract: Continuously extracting and integrating the changed data from different data sources based on appropriate data extraction model is crucial for sharing data between different heterogeneous systems and building the incremental data warehouse to analyze data.There exists a problem of efficiency of data extraction in the traditional timestamp based changed-data-capture method.As long as the exception occurs during the data extraction,the whole data extraction progress willfail.In that case,the database must be rolled back,which reduces the efficiency of extraction.To address the problem above,this paper proposed an incremental data extraction model based on variable time-window.The model extracts a small number of repetitive records and then de-duplicates them based on the idea of time-window.The model reduces the influence of the exception on the data extraction,enhances the reliability for extracting ETL process by the timestamp increment data,and improves the efficiency of data extraction.

Key words: Capture of changed data, Incremental extraction, Timestamp, ETL

CLC Number: 

  • TP311
[1]MINAKSHI M,SHARMA H C.Near Real-Time Data Ware- housing Using State-of-the-Art ETL Tools[J].International Journal of Research,2014,41(10):100-117.
[2]SHU Q.The Research on Optimization of ETL Process and Incremental Data Extraction[D].Changsha:Hunan University,2011.(in Chinese)
[3]WEN L.Design and Implementation of Incremental Data Ex- tractor based on Sector Inquiry[D].Shijiazhuang:Hebei University of Science and Technology,2015.(in Chinese)
[4]TANK D M,GANATRA A,KOSTA Y P,et al.Speeding ETL Processing in Data Warehouses Using High-Performance Joins for Changed Data Capture (CDC)[C]∥Advances in Recent Technologies in Communication & Computing International Conference.2010:365-368.
[5]CASTERS M,BOUMAN R,DONGEN J V.Pentaho Kettle Solutions:Building Open Source ETL Solutions with Pentaho Data Integration[M].Indianapolis,Indiana:Wiley Publishing,2010.
[6]JORG T,DESLOCH S.Towards generating ETL processes for incremental loading[C]∥International Database Engineering and Applications Symposium.2008:101-110.
[7]MEKTEROVIC I,BRKIC L.Delta view generation for incre- mental loading of large dimensions in a data warehouse[C]∥International Convention on Information and Communication Technology Electronics and Microelectronics.2015:1417-1422.
[8]LIN Z Y,YANG D Q,SONG G J,et al.Change Data Capture in Real-Time Active Data Warehouses:A Survey[J].Journal of Computer Research and Development,2007,44(Z3):447-451.(in Chinese)
[9]SHI J G,BAO Y B,LENG F L,et al.Study on Log-Based Change Data Capture and Handling Mechanism in Real-Time Data Warehouse[C]∥International Conference on Computer Science and Software Engineering.IEEE Computer Society,2008:478-481.[10]JIA Y K.Research and Design on Data Extraction in Multiple Data Sources[D].Harbin:Harbin Engineering University,2013.(in Chinese)
[11]YANG L.Design and Implementation of Real-time data extraction Mechanism in data warehousing [D].Beijing:Beijing University of Posts and Telecommunications,2007.(in Chinese)
[12]TAN G W,WU T.Study on Method of Data Warehouse Real-time Data Updating Based on Mechanism of CDC [J].Computer Science,2015,42(S1):546-548.(in Chinese)
[13]ZOU X X,JIA W J,PAN J H.Research of Log-based Change Data Capture [J].Journal of Chinese Computer Systems,2012,33(3):531-536.(in Chinese)
[14]JAIN T,SALUJA S.Refreshing Datawarehouse in Near Real-Time[J].International Journal of Computer Applications,2012,46(18):24-29.
[15]WANG Y B,RAO X R,HE P.Incremental database synchronization update mechanism under heterogeneous environment [J].Computer Engineering and Design,2011,32(3):948-951.(in Chinese)
[16]CUI Y W,ZHOU J H.Research on Data Integration Based on KETTLE [J].Computer Technology and Development,2015(4):153-157.(in Chinese)
[17]LIU X Q,WU G,DENG H P.Data deduplication in Web information integration[J].Journal of Computer Applications,2013,33(9):2493-2496.(in Chinese)
[18]TANK D M.Reducing ETL Load Times by a New Data Integration Approach for Real-time Business Intelligence[J].International Journal of Engineering Innovations & Research,2012,1(2):56-60.
[1] LU Ye-shan. Common Issues and Case Analysis of System Data Migration [J]. Computer Science, 2019, 46(6A): 412-416.
[2] WANG Bin, MA Jun-jie, FANG Xin-xiu, WEI Tian-you. Association Rule Mining Algorithm Based on Timestamp and Vertical Format [J]. Computer Science, 2019, 46(10): 71-76.
[3] SONG Jie,HAO Wen-ning,CHEN Gang,JIN Da-wei and ZHAO Cheng. Research of Distributed ETL Dimensional Data Model Based on MapReduce [J]. Computer Science, 2013, 40(Z11): 263-266.
[4] SONG Jie,HAO Wen-ning,CHEN Gang,JIN Da-wei and ZHAO Shui-ning. Research of Distributed ETL Architecture Based on MapReduce [J]. Computer Science, 2013, 40(6): 152-154.
[5] . Construction and Practice of the DataWarehouse in the Police Comprehensive Information SystemPreplan Expression and Optimization Method Based on and Case-based Reasoning [J]. Computer Science, 2012, 39(Z6): 291-292,308.
[6] XU Jun-gang,PEI Ying. Overview of Data Extraction, Transformation and Loading [J]. Computer Science, 2011, 38(4): 15-20.
[7] PENG Lin,XIE Lun-guo,ZHANG Xiao-qiang. Vector Timest Based Software Transactional Memory Algorithm [J]. Computer Science, 2010, 37(5): 282-286.
[8] MA Zhen-yuan,ZHOU Jie,CHEN Chu,ZHANG Ling. Segmental Measurement Based Approach to Estimate Internet End-to-end Performance Parameter [J]. Computer Science, 2010, 37(3): 138-140.
[9] . [J]. Computer Science, 2008, 35(4): 283-284.
[10] . [J]. Computer Science, 2007, 34(6): 81-83.
[11] WU Yuan- Hong (Information College of Zhejiang Ocean University, Zhoushan 316004). [J]. Computer Science, 2007, 34(1): 81-83.
Full text



[1] . [J]. Computer Science, 2018, 1(1): 1 .
[2] LEI Li-hui and WANG Jing. Parallelization of LTL Model Checking Based on Possibility Measure[J]. Computer Science, 2018, 45(4): 71 -75, 88 .
[3] XIA Qing-xun and ZHUANG Yi. Remote Attestation Mechanism Based on Locality Principle[J]. Computer Science, 2018, 45(4): 148 -151, 162 .
[4] LI Bai-shen, LI Ling-zhi, SUN Yong and ZHU Yan-qin. Intranet Defense Algorithm Based on Pseudo Boosting Decision Tree[J]. Computer Science, 2018, 45(4): 157 -162 .
[5] WANG Huan, ZHANG Yun-feng and ZHANG Yan. Rapid Decision Method for Repairing Sequence Based on CFDs[J]. Computer Science, 2018, 45(3): 311 -316 .
[6] SUN Qi, JIN Yan, HE Kun and XU Ling-xuan. Hybrid Evolutionary Algorithm for Solving Mixed Capacitated General Routing Problem[J]. Computer Science, 2018, 45(4): 76 -82 .
[7] ZHANG Jia-nan and XIAO Ming-yu. Approximation Algorithm for Weighted Mixed Domination Problem[J]. Computer Science, 2018, 45(4): 83 -88 .
[8] WU Jian-hui, HUANG Zhong-xiang, LI Wu, WU Jian-hui, PENG Xin and ZHANG Sheng. Robustness Optimization of Sequence Decision in Urban Road Construction[J]. Computer Science, 2018, 45(4): 89 -93 .
[9] LIU Qin. Study on Data Quality Based on Constraint in Computer Forensics[J]. Computer Science, 2018, 45(4): 169 -172 .
[10] ZHONG Fei and YANG Bin. License Plate Detection Based on Principal Component Analysis Network[J]. Computer Science, 2018, 45(3): 268 -273 .