计算机科学 ›› 2018, Vol. 45 ›› Issue (11): 204-209.doi: 10.11896/j.issn.1002-137X.2018.11.032
刘杰, 王桂玲, 左小将
LIU Jie, WANG Gui-ling, ZUO Xiao-jiang
摘要: 基于合适的数据抽取模型持续不断地将变化的数据从各个数据源系统进行抽取集成,是各个异构系统之间进行数据共享融合的关键,也是构建增量式数据仓库来进行数据分析的关键。传统的时间戳变化数据捕获方式存在因数据抽取过程中发生异常而导致数据抽取失效,进而影响数据抽取效率的问题。鉴于此,文中借鉴时间窗口的思想,采用先抽取少量重复记录再去重的做法,对传统的时间戳增量数据捕获模型进行了改进,提出了基于可变时间窗口的增量数据抽取模型。该模型减少了异常对数据抽取的影响,增强了时间戳增量数据抽取ETL流程的可靠性,在一定程度上提高了数据的抽取效率。
中图分类号:
[1]MINAKSHI M,SHARMA H C.Near Real-Time Data Ware- housing Using State-of-the-Art ETL Tools[J].International Journal of Research,2014,41(10):100-117. [2]SHU Q.The Research on Optimization of ETL Process and Incremental Data Extraction[D].Changsha:Hunan University,2011.(in Chinese) 舒琦.ETL过程优化与增量数据抽取的研究[D].长沙:湖南大学,2011. [3]WEN L.Design and Implementation of Incremental Data Ex- tractor based on Sector Inquiry[D].Shijiazhuang:Hebei University of Science and Technology,2015.(in Chinese) 温璐.基于区段查询的增量数据抽取器的设计与实现[D].石家庄:河北科技大学,2015. [4]TANK D M,GANATRA A,KOSTA Y P,et al.Speeding ETL Processing in Data Warehouses Using High-Performance Joins for Changed Data Capture (CDC)[C]∥Advances in Recent Technologies in Communication & Computing International Conference.2010:365-368. [5]CASTERS M,BOUMAN R,DONGEN J V.Pentaho Kettle Solutions:Building Open Source ETL Solutions with Pentaho Data Integration[M].Indianapolis,Indiana:Wiley Publishing,2010. [6]JORG T,DESLOCH S.Towards generating ETL processes for incremental loading[C]∥International Database Engineering and Applications Symposium.2008:101-110. [7]MEKTEROVIC I,BRKIC L.Delta view generation for incre- mental loading of large dimensions in a data warehouse[C]∥International Convention on Information and Communication Technology Electronics and Microelectronics.2015:1417-1422. [8]LIN Z Y,YANG D Q,SONG G J,et al.Change Data Capture in Real-Time Active Data Warehouses:A Survey[J].Journal of Computer Research and Development,2007,44(Z3):447-451.(in Chinese) 林子雨,杨冬青,宋国杰,等.实时主动数据仓库中的变化数据捕捉研究综述[J].计算机研究与发展,2007,44(Z3):447-451. [9]SHI J G,BAO Y B,LENG F L,et al.Study on Log-Based Change Data Capture and Handling Mechanism in Real-Time Data Warehouse[C]∥International Conference on Computer Science and Software Engineering.IEEE Computer Society,2008:478-481.[10]JIA Y K.Research and Design on Data Extraction in Multiple Data Sources[D].Harbin:Harbin Engineering University,2013.(in Chinese) 贾艳凯.多源异构增量数据抽取方法研究与设计[D].哈尔滨:哈尔滨工程大学,2013. [11]YANG L.Design and Implementation of Real-time data extraction Mechanism in data warehousing [D].Beijing:Beijing University of Posts and Telecommunications,2007.(in Chinese) 杨乐.数据仓库中实时抽取机制的研究与实现[D].北京:北京邮电大学,2007. [12]TAN G W,WU T.Study on Method of Data Warehouse Real-time Data Updating Based on Mechanism of CDC [J].Computer Science,2015,42(S1):546-548.(in Chinese) 谭光玮,武彤.基于CDC机制的数据仓库实时数据更新方法研究[J].计算机科学,2015,42(S1):546-548. [13]ZOU X X,JIA W J,PAN J H.Research of Log-based Change Data Capture [J].Journal of Chinese Computer Systems,2012,33(3):531-536.(in Chinese) 邹先霞,贾维嘉,潘久辉.基于数据库日志的变化数据捕获研究[J].小型微型计算机系统,2012,33(3):531-536. [14]JAIN T,SALUJA S.Refreshing Datawarehouse in Near Real-Time[J].International Journal of Computer Applications,2012,46(18):24-29. [15]WANG Y B,RAO X R,HE P.Incremental database synchronization update mechanism under heterogeneous environment [J].Computer Engineering and Design,2011,32(3):948-951.(in Chinese) 王玉标,饶锡如,何盼.异构环境下数据库增量同步更新机制[J].计算机工程与设计,2011,32(3):948-951. [16]CUI Y W,ZHOU J H.Research on Data Integration Based on KETTLE [J].Computer Technology and Development,2015(4):153-157.(in Chinese) 崔有文,周金海.基于KETTLE的数据集成研究[J].计算机技术与发展,2015(4):153-157. [17]LIU X Q,WU G,DENG H P.Data deduplication in Web information integration[J].Journal of Computer Applications,2013,33(9):2493-2496.(in Chinese) 刘雪琼,武刚,邓厚平.Web信息整合中的数据去重方法[J].计算机应用,2013,33(9):2493-2496. [18]TANK D M.Reducing ETL Load Times by a New Data Integration Approach for Real-time Business Intelligence[J].International Journal of Engineering Innovations & Research,2012,1(2):56-60. |
[1] | 李艾玲, 张凤荔, 高强, 王瑞锦. 基于自适应时间戳与多尺度特征提取的轨迹下一足迹预测模型 Trajectory Next Footprint Prediction Model Based on Adaptive Timestamp and Multi-scale Feature Extraction 计算机科学, 2021, 48(11A): 191-197. https://doi.org/10.11896/jsjkx.201200015 |
[2] | 陆叶杉. 系统数据迁移常见问题及案例分析 Common Issues and Case Analysis of System Data Migration 计算机科学, 2019, 46(6A): 412-416. |
[3] | 王卓昊, 杨冬菊, 徐晨阳. 基于ISE算法的分布式ETL任务调度策略研究 Research on Distributed ETL Tasks Scheduling Strategy Based on ISE Algorithm 计算机科学, 2019, 46(12): 1-7. https://doi.org/10.11896/jsjkx.190100023 |
[4] | 梁奎奎. 一种基于Storm平台的ETL方案实现 Implementation of ETL Scheme Based on Storm Platform 计算机科学, 2019, 46(11A): 208-211. |
[5] | 王斌, 马俊杰, 房新秀, 魏天佑. 基于时间戳和垂直格式的关联规则挖掘算法 Association Rule Mining Algorithm Based on Timestamp and Vertical Format 计算机科学, 2019, 46(10): 71-76. https://doi.org/10.11896/jsjkx.190100223 |
[6] | 宋杰,郝文宁,陈刚,靳大尉,赵成. 基于MapReduce的分布式ETL多维数据模型研究 Research of Distributed ETL Dimensional Data Model Based on MapReduce 计算机科学, 2013, 40(Z11): 263-266. |
[7] | 宋杰,郝文宁,陈刚,靳大尉,赵水宁. 基于MapReduce的分布式ETL体系结构研究 Research of Distributed ETL Architecture Based on MapReduce 计算机科学, 2013, 40(6): 152-154. |
[8] | 袁丽娜. 警务综合信息系统数据仓库的建设与实践 Construction and Practice of the DataWarehouse in the Police Comprehensive Information SystemPreplan Expression and Optimization Method Based on and Case-based Reasoning 计算机科学, 2012, 39(Z6): 291-292. |
[9] | 徐俊刚,裴莹. 数据ETL研究综述 Overview of Data Extraction, Transformation and Loading 计算机科学, 2011, 38(4): 15-20. |
[10] | 马震远,周杰,陈楚,张凌. 一种基于分段测量的Internet端到端性能参数估计方法 Segmental Measurement Based Approach to Estimate Internet End-to-end Performance Parameter 计算机科学, 2010, 37(3): 138-140. |
[11] | 周倜,李舟军,王志勇,王巾盈. 时间敏感的安全协议建模与验证:研究综述 Survey on Modelling and Verification of Time Sensitive Security Protocol 计算机科学, 2009, 36(8): 3-7. |
[12] | . 基于NetLogo平台的HIV治疗模型 计算机科学, 2008, 35(4): 283-284. |
[13] | . 基于Netfilter的数据包捕获技术研究 计算机科学, 2007, 34(6): 81-83. |
[14] | . 一个数据膨胀率为1的概率公钥密码系统 计算机科学, 2007, 34(1): 117-119. |
[15] | 吴远红. ETL执行过程的优化研究 计算机科学, 2007, 34(1): 81-83. |
|