计算机科学 ›› 2020, Vol. 47 ›› Issue (4): 36-41.doi: 10.11896/jsjkx.190300070
杨宗霖1, 李天瑞1,2, 刘胜久1, 殷成凤1, 贾真1, 珠杰3
YANG Zong-lin1, LI Tian-rui1,2, LIU Sheng-jiu1, YIN Cheng-feng1, JIA Zhen1, ZHU Jie3
摘要: 互联网的高速发展催生了海量网络文本,这对传统的串行文本校对算法提出了新的性能挑战。尽管近年来文本自动校对任务受到了较多关注,但相关研究工作多集中于串行算法,鲜有涉及校对的并行化。文中首先对串行校对算法进行泛化,给出一种串行校对的通用框架,然后针对串行校对算法处理大规模文本存在的耗时长的不足,提出3种通用的文本校对并行化方法:1)基于多线程的线程并行校对,它基于线程池的方式实现段落和校对功能的同时并行;2)基于Spark MapReduce的批处理并行校对,它通过RDD并行计算的方式实现段落的并行校对;3)基于Spark Streaming流式计算框架的流式并行校对,它通过将文本流的实时计算转为一系列小规模的基于时间分片的批处理作业,有效避免了固定开销,显著缩短了校对时延。由于流式计算兼有低时延和高吞吐的优势,文中最后选用流式校对来构建并行校对系统。性能对比实验表明,线程并行适合校对小规模文本,批处理并行适合大规模文本的离线校对,流式并行校对有效减少了约110s的固定时延,相比批处理校对,采用Streaming计算框架的流式校对取得了极大的性能提升。
中图分类号:
[1]DAHLMEIER D,NG H T,NG E J F.NUS at the HOO 2012 Shared Task[C]//Proceedings of the Seventh Workshop on Building Educational Applications Using NLP.2012:216-224. [2]ROZOVSKAYA A,CHANG K W,SAMMONS M,et al.The Illinois-Columbia System in the CoNLL-2014 Shared Task[C]//Proceedings of the Eighteenth Conference on Computational Natural Language Learning:Shared Task.2014:34-42. [3]ROZOVSKAYA A,ROTH D.Grammatical Error Correction:Machine Translation and Classifiers[C]//Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics.2016:2205-2215. [4]JUNCZYS-DOWMUNT M,GRUNDKIEWICZ R.Phrase-based machine translation is state-of-the-art for automatic grammatical error correction[J].arXiv:1605.06353,2016. [5]CHOLLAMPATT S,NG H T.Connecting the dots:Towardshuman-level grammatical error correction[C]//Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications.2017:327-333. [6]YUAN Z,BRISCOE T.Grammatical error correction using neural machine translation[C]//Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.2016:380-386. [7]XIE Z,AVATI A,ARIVAZHAGAN N,et al.Neural language correction with character-based attention[J].arXiv:1603.09727,2016. [8]JI J,WANG Q,TOUTANOVA K,et al.A nested attention neural hybrid model for grammatical error correction[J].arXiv:1707.02026,2017. [9]XU L C,SHI L.The Design and Application of A Dynamic Program Algorithm in Automatic Text Collationg[J].Computer Science,2002,29(9):149-150. [10]GONG X J,LUO Z S,LUO W H.Automatically Detecting Syntactic Errors in Chinese Texts[J].Computer Engineering and Applications,2003,39(8):98-100. [11]CHEN X R,QIN J,WANG W J,et al.Research and Implementation of Chinese Text Proofreading[J].Computer Science,2003,30(11):53-55. [12]LIU L L,CAO C G.Chinese Real-word Error Automatic Proofreading Based on Combining of Local Context Features[J].Computer Science,2016(12):37-42. [13]LIU L L,CAO C G.Study of Automatic Proofreading Method for Non-multi-character Word Error in Chinese Text[J].Computer Science,2016,43(10):200-205. [14]ZHANG T.Design and Implementation of Chinese Text Automatic Proofreading System[D].Chengdu:Southwest Jiaotong University ,2017. [15]ZHANG Y S,ZHENG J.Study of Semantic Error DetectingMethod for Chinese Text[J].Chinese Journal of Computers,2017(4):911-924. [16]ZAHARIA M,XIN R S,WENDELL P,et al.Apache spark:a unified engine for big data processing[J].Communications of the ACM,2016,59(11):56-65. [17]ZAHARIA M,DAS T,LI H,et al.Discretized streams:Fault-tolerant streaming computation at scale[C]//Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles.2013:423-438. [18]NG H T,WU S M,BRISCOE T,et al.The CoNLL-2014 shared task on grammatical error correction[C]//Proceedings of the Eighteenth Conference on Computational Natural Language Learning:Shared Task.2014:1-14. [19]NICHOLLS D.The Cambridge Learner Corpus:Error coding and analysis for lexicography and ELT[C]//Proceedings of the Corpus Linguistics 2003 conference.2003,16:572-581. [20]LUONG M T,MANNING C D.Achieving open vocabulary neural machine translation with hybrid word-character models[J].arXiv:1604.00788,2016. [21]ZHANG Y S,CAO Y D,YU S W.A Hybrid Model of Combining Rule-based and Statistics-based Approaches for Automatic Detecting Errors in Chinese Text[J].Journal of Chinese Information Processing,2006,20(4):1-7,55. [22]WANG Y.Match Algorithm of Approximate String with Wildcard based on Trie Data Structure[J].Journal of Computer Applications,2004,24(10):121-124. [23]LI J L,YIN C F,JIA Z,et al.Attention-based bidirectional LSTM for Chinese punctuation prediction[C]//Proceedings of the 13th FLINS Conference on Data Science and Knowledge Engineering for Sensing Decision Support.2018:708-714. |
[1] | 陈鑫, 李芳, 丁海昕, 孙唯哲, 刘鑫, 陈德训, 叶跃进, 何香. 面向国产异构众核架构的CFD非结构网格计算并行优化方法 Parallel Optimization Method of Unstructured-grid Computing in CFD for DomesticHeterogeneous Many-core Architecture 计算机科学, 2022, 49(6): 99-107. https://doi.org/10.11896/jsjkx.210400157 |
[2] | 戴宏亮, 钟国金, 游志铭, 戴宏明. 基于Spark的舆情情感大数据分析集成方法 Public Opinion Sentiment Big Data Analysis Ensemble Method Based on Spark 计算机科学, 2021, 48(9): 118-124. https://doi.org/10.11896/jsjkx.210400280 |
[3] | 俞建业, 戚湧, 王宝茁. 基于Spark的车联网分布式组合深度学习入侵检测方法 Distributed Combination Deep Learning Intrusion Detection Method for Internet of Vehicles Based on Spark 计算机科学, 2021, 48(6A): 518-523. https://doi.org/10.11896/jsjkx.200700129 |
[4] | 傅天豪, 田鸿运, 金煜阳, 杨章, 翟季冬, 武林平, 徐小文. 一种面向构件化并行应用程序的性能骨架分析方法 Performance Skeleton Analysis Method Towards Component-based Parallel Applications 计算机科学, 2021, 48(6): 1-9. https://doi.org/10.11896/jsjkx.201200115 |
[5] | 何亚茹, 庞建民, 徐金龙, 朱雨, 陶小涵. 基于神威平台的Floyd并行算法的实现和优化 Implementation and Optimization of Floyd Parallel Algorithm Based on Sunway Platform 计算机科学, 2021, 48(6): 34-40. https://doi.org/10.11896/jsjkx.201100051 |
[6] | 冯凯, 马鑫玉. (n,k)-冒泡排序网络的子网络可靠性 Subnetwork Reliability of (n,k)-bubble-sort Networks 计算机科学, 2021, 48(4): 43-48. https://doi.org/10.11896/jsjkx.201100139 |
[7] | 胡蓉, 阳王东, 王昊天, 罗辉章, 李肯立. 基于GPU加速的并行WMD算法 Parallel WMD Algorithm Based on GPU Acceleration 计算机科学, 2021, 48(12): 24-28. https://doi.org/10.11896/jsjkx.210600213 |
[8] | 马梦宇, 吴烨, 陈荦, 伍江江, 李军, 景宁. 显示导向型的大规模地理矢量实时可视化技术 Display-oriented Data Visualization Technique for Large-scale Geographic Vector Data 计算机科学, 2020, 47(9): 117-122. https://doi.org/10.11896/jsjkx.190800121 |
[9] | 陈国良, 张玉杰. 并行计算学科发展历程 Development of Parallel Computing Subject 计算机科学, 2020, 47(8): 1-4. https://doi.org/10.11896/jsjkx.200600027 |
[10] | 阳王东, 王昊天, 张宇峰, 林圣乐, 蔡沁耘. 异构混合并行计算综述 Survey of Heterogeneous Hybrid Parallel Computing 计算机科学, 2020, 47(8): 5-16. https://doi.org/10.11896/jsjkx.200600045 |
[11] | 冯凯, 李婧. k元n方体的子网络可靠性研究 Study on Subnetwork Reliability of k-ary n-cubes 计算机科学, 2020, 47(7): 31-36. https://doi.org/10.11896/jsjkx.190700170 |
[12] | 朱岸青, 李帅, 唐晓东. Spark平台中的并行化FP_growth关联规则挖掘方法 Parallel FP_growth Association Rules Mining Method on Spark Platform 计算机科学, 2020, 47(12): 139-143. https://doi.org/10.11896/jsjkx.191000110 |
[13] | 邓定胜. 一种改进的DBSCAN算法在Spark平台上的应用 Application of Improved DBSCAN Algorithm on Spark Platform 计算机科学, 2020, 47(11A): 425-429. https://doi.org/10.11896/jsjkx.190700071 |
[14] | 禹鑫燚, 施甜峰, 唐权瑞, 殷慧武, 欧林林. 面向预测性维护的工业设备管理系统 Industrial Equipment Management System for Predictive Maintenance 计算机科学, 2020, 47(11A): 667-672. https://doi.org/10.11896/jsjkx.200100091 |
[15] | 李梦珂, 郑秋生, 王磊. 基于采样技术的动态混合数据竞争检测算法 Dynamic Hybrid Data Race Detection Algorithm Based on Sampling Technique 计算机科学, 2020, 47(10): 315-321. https://doi.org/10.11896/jsjkx.190700079 |
|