基于Spark Streaming的流式并行文本校对

doi:10.11896/jsjkx.190300070

计算机科学 ›› 2020, Vol. 47 ›› Issue (4): 36-41.doi: 10.11896/jsjkx.190300070

基于Spark Streaming的流式并行文本校对

杨宗霖¹, 李天瑞^1,2, 刘胜久¹, 殷成凤¹, 贾真¹, 珠杰³

1 西南交通大学信息科学与技术学院成都611756;
2 西南交通大学人工智能研究院成都611756;
3 西藏大学计算机科学系拉萨850000

收稿日期:2019-03-16 出版日期:2020-04-15 发布日期:2020-04-15
通讯作者: 李天瑞(trli@swjtu.edu.cn)
基金资助:
国家自然科学基金(61573292);四川省科技服务业示范项目(2016GFW0167)

Streaming Parallel Text Proofreading Based on Spark Streaming

YANG Zong-lin¹, LI Tian-rui^1,2, LIU Sheng-jiu¹, YIN Cheng-feng¹, JIA Zhen¹, ZHU Jie³

1 School of Information Science and Technology,Southwest Jiaotong University,Chengdu 611756,China;
2 Institute of Artificial Intelligence,Southwest Jiaotong University,Chengdu 611756,China;
3 Department of Computer Science,Tibetan University,Lasa 850000,China

Received:2019-03-16 Online:2020-04-15 Published:2020-04-15
Contact: LI Tian-rui,born in 1969,Ph.D,professor,Ph.D supervisor,is an outstanding member of CCF.His main research interests include cloud computing,data mining and artificial intelligence ,etc.
About author:YANG Zong-lin,born in 1994.His main research interests include Chinese text proofreading and parallel computing,etc.
Supported by:
This work was supported by the National Natural Science Foundation of China (61573292) and Science and Technology Service Industry Demonstration Project of Sichuan Province(2016GFW0167)

摘要/Abstract

摘要： 互联网的高速发展催生了海量网络文本,这对传统的串行文本校对算法提出了新的性能挑战。尽管近年来文本自动校对任务受到了较多关注,但相关研究工作多集中于串行算法,鲜有涉及校对的并行化。文中首先对串行校对算法进行泛化,给出一种串行校对的通用框架,然后针对串行校对算法处理大规模文本存在的耗时长的不足,提出3种通用的文本校对并行化方法:1)基于多线程的线程并行校对,它基于线程池的方式实现段落和校对功能的同时并行;2)基于Spark MapReduce的批处理并行校对,它通过RDD并行计算的方式实现段落的并行校对;3)基于Spark Streaming流式计算框架的流式并行校对,它通过将文本流的实时计算转为一系列小规模的基于时间分片的批处理作业,有效避免了固定开销,显著缩短了校对时延。由于流式计算兼有低时延和高吞吐的优势,文中最后选用流式校对来构建并行校对系统。性能对比实验表明,线程并行适合校对小规模文本,批处理并行适合大规模文本的离线校对,流式并行校对有效减少了约110s的固定时延,相比批处理校对,采用Streaming计算框架的流式校对取得了极大的性能提升。

关键词: Spark, 并行计算, 多线程, 流式计算, 自动校对

Abstract: The rapid development of the Internet has prompted the generation of massive amounts of network text,which poses new performance challenges for traditional serial text proofreading algorithms.Although the text automatic proofreading task has received more and more attention in recent years,the related research work mostly focuses on serial algorithms,and rarely involves the parallelization of proofreading.Firstly,the serial proofreading algorithm is generalized,and a general framework of serialproofreading is given.Then,in view of the shortcomings of serial proofreading for processing large-scale texts,three general text proofreading parallelization methods are proposed:1)a parallel proofreading method based on multi-threading,which implements simultaneous parallelism of paragraph and proofreading functions based on the thread pool;2)a batch processing parallel proofreading method based on Spark MapReduce,which implements paragraph parallel proofreading by means of RDD parallel computing;3)a Spark Streaming-based parallel proofreading approach,which converts the real-time calculation of text streams into a series of small-scale time fragmentation based batch jobs,making it can effectively avoid fixed overhead and significantly reduce proofreading delay.Because the streaming computing has the advantages of low delay and high throughput,the paper finally chooses the streaming computing-based method to build the parallel proofreading system.Performance comparison experiments demonstrate that thread parallelism is suitable for proofreading small-scale text,batch processing is suitable for off-line proofreading of large-scale text,and streaming parallel proofreading effectively reduces the fixed delay of about 110 seconds.Compared with batch proofreading,the streaming proofreading using a real-time computing framework has achieved a great performance improvement.

Key words: Automatic correction, Multi-threading, Parallel computing, Spark, Streaming computing

中图分类号:

TP391

杨宗霖, 李天瑞, 刘胜久, 殷成凤, 贾真, 珠杰. 基于Spark Streaming的流式并行文本校对[J]. 计算机科学, 2020, 47(4): 36-41. https://doi.org/10.11896/jsjkx.190300070

YANG Zong-lin, LI Tian-rui, LIU Sheng-jiu, YIN Cheng-feng, JIA Zhen, ZHU Jie. Streaming Parallel Text Proofreading Based on Spark Streaming[J]. Computer Science, 2020, 47(4): 36-41. https://doi.org/10.11896/jsjkx.190300070

参考文献

[1]DAHLMEIER D,NG H T,NG E J F.NUS at the HOO 2012 Shared Task[C]//Proceedings of the Seventh Workshop on Building Educational Applications Using NLP.2012:216-224.
[2]ROZOVSKAYA A,CHANG K W,SAMMONS M,et al.The Illinois-Columbia System in the CoNLL-2014 Shared Task[C]//Proceedings of the Eighteenth Conference on Computational Natural Language Learning:Shared Task.2014:34-42.
[3]ROZOVSKAYA A,ROTH D.Grammatical Error Correction:Machine Translation and Classifiers[C]//Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics.2016:2205-2215.
[4]JUNCZYS-DOWMUNT M,GRUNDKIEWICZ R.Phrase-based machine translation is state-of-the-art for automatic grammatical error correction[J].arXiv:1605.06353,2016.
[5]CHOLLAMPATT S,NG H T.Connecting the dots:Towardshuman-level grammatical error correction[C]//Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications.2017:327-333.
[6]YUAN Z,BRISCOE T.Grammatical error correction using neural machine translation[C]//Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.2016:380-386.
[7]XIE Z,AVATI A,ARIVAZHAGAN N,et al.Neural language correction with character-based attention[J].arXiv:1603.09727,2016.
[8]JI J,WANG Q,TOUTANOVA K,et al.A nested attention neural hybrid model for grammatical error correction[J].arXiv:1707.02026,2017.
[9]XU L C,SHI L.The Design and Application of A Dynamic Program Algorithm in Automatic Text Collationg[J].Computer Science,2002,29(9):149-150.
[10]GONG X J,LUO Z S,LUO W H.Automatically Detecting Syntactic Errors in Chinese Texts[J].Computer Engineering and Applications,2003,39(8):98-100.
[11]CHEN X R,QIN J,WANG W J,et al.Research and Implementation of Chinese Text Proofreading[J].Computer Science,2003,30(11):53-55.
[12]LIU L L,CAO C G.Chinese Real-word Error Automatic Proofreading Based on Combining of Local Context Features[J].Computer Science,2016(12):37-42.
[13]LIU L L,CAO C G.Study of Automatic Proofreading Method for Non-multi-character Word Error in Chinese Text[J].Computer Science,2016,43(10):200-205.
[14]ZHANG T.Design and Implementation of Chinese Text Automatic Proofreading System[D].Chengdu:Southwest Jiaotong University ,2017.
[15]ZHANG Y S,ZHENG J.Study of Semantic Error DetectingMethod for Chinese Text[J].Chinese Journal of Computers,2017(4):911-924.
[16]ZAHARIA M,XIN R S,WENDELL P,et al.Apache spark:a unified engine for big data processing[J].Communications of the ACM,2016,59(11):56-65.
[17]ZAHARIA M,DAS T,LI H,et al.Discretized streams:Fault-tolerant streaming computation at scale[C]//Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles.2013:423-438.
[18]NG H T,WU S M,BRISCOE T,et al.The CoNLL-2014 shared task on grammatical error correction[C]//Proceedings of the Eighteenth Conference on Computational Natural Language Learning:Shared Task.2014:1-14.
[19]NICHOLLS D.The Cambridge Learner Corpus:Error coding and analysis for lexicography and ELT[C]//Proceedings of the Corpus Linguistics 2003 conference.2003,16:572-581.
[20]LUONG M T,MANNING C D.Achieving open vocabulary neural machine translation with hybrid word-character models[J].arXiv:1604.00788,2016.
[21]ZHANG Y S,CAO Y D,YU S W.A Hybrid Model of Combining Rule-based and Statistics-based Approaches for Automatic Detecting Errors in Chinese Text[J].Journal of Chinese Information Processing,2006,20(4):1-7,55.
[22]WANG Y.Match Algorithm of Approximate String with Wildcard based on Trie Data Structure[J].Journal of Computer Applications,2004,24(10):121-124.
[23]LI J L,YIN C F,JIA Z,et al.Attention-based bidirectional LSTM for Chinese punctuation prediction[C]//Proceedings of the 13th FLINS Conference on Data Science and Knowledge Engineering for Sensing Decision Support.2018:708-714.

相关文章 15

[1]	陈鑫, 李芳, 丁海昕, 孙唯哲, 刘鑫, 陈德训, 叶跃进, 何香. 面向国产异构众核架构的CFD非结构网格计算并行优化方法 Parallel Optimization Method of Unstructured-grid Computing in CFD for DomesticHeterogeneous Many-core Architecture 计算机科学, 2022, 49(6): 99-107. https://doi.org/10.11896/jsjkx.210400157
[2]	戴宏亮, 钟国金, 游志铭, 戴宏明. 基于Spark的舆情情感大数据分析集成方法 Public Opinion Sentiment Big Data Analysis Ensemble Method Based on Spark 计算机科学, 2021, 48(9): 118-124. https://doi.org/10.11896/jsjkx.210400280
[3]	俞建业, 戚湧, 王宝茁. 基于Spark的车联网分布式组合深度学习入侵检测方法 Distributed Combination Deep Learning Intrusion Detection Method for Internet of Vehicles Based on Spark 计算机科学, 2021, 48(6A): 518-523. https://doi.org/10.11896/jsjkx.200700129
[4]	傅天豪, 田鸿运, 金煜阳, 杨章, 翟季冬, 武林平, 徐小文. 一种面向构件化并行应用程序的性能骨架分析方法 Performance Skeleton Analysis Method Towards Component-based Parallel Applications 计算机科学, 2021, 48(6): 1-9. https://doi.org/10.11896/jsjkx.201200115
[5]	何亚茹, 庞建民, 徐金龙, 朱雨, 陶小涵. 基于神威平台的Floyd并行算法的实现和优化 Implementation and Optimization of Floyd Parallel Algorithm Based on Sunway Platform 计算机科学, 2021, 48(6): 34-40. https://doi.org/10.11896/jsjkx.201100051
[6]	冯凯, 马鑫玉. (n,k)-冒泡排序网络的子网络可靠性 Subnetwork Reliability of (n,k)-bubble-sort Networks 计算机科学, 2021, 48(4): 43-48. https://doi.org/10.11896/jsjkx.201100139
[7]	胡蓉, 阳王东, 王昊天, 罗辉章, 李肯立. 基于GPU加速的并行WMD算法 Parallel WMD Algorithm Based on GPU Acceleration 计算机科学, 2021, 48(12): 24-28. https://doi.org/10.11896/jsjkx.210600213
[8]	马梦宇, 吴烨, 陈荦, 伍江江, 李军, 景宁. 显示导向型的大规模地理矢量实时可视化技术 Display-oriented Data Visualization Technique for Large-scale Geographic Vector Data 计算机科学, 2020, 47(9): 117-122. https://doi.org/10.11896/jsjkx.190800121
[9]	陈国良, 张玉杰. 并行计算学科发展历程 Development of Parallel Computing Subject 计算机科学, 2020, 47(8): 1-4. https://doi.org/10.11896/jsjkx.200600027
[10]	阳王东, 王昊天, 张宇峰, 林圣乐, 蔡沁耘. 异构混合并行计算综述 Survey of Heterogeneous Hybrid Parallel Computing 计算机科学, 2020, 47(8): 5-16. https://doi.org/10.11896/jsjkx.200600045
[11]	冯凯, 李婧. k元n方体的子网络可靠性研究 Study on Subnetwork Reliability of k-ary n-cubes 计算机科学, 2020, 47(7): 31-36. https://doi.org/10.11896/jsjkx.190700170
[12]	朱岸青, 李帅, 唐晓东. Spark平台中的并行化FP_growth关联规则挖掘方法 Parallel FP_growth Association Rules Mining Method on Spark Platform 计算机科学, 2020, 47(12): 139-143. https://doi.org/10.11896/jsjkx.191000110
[13]	邓定胜. 一种改进的DBSCAN算法在Spark平台上的应用 Application of Improved DBSCAN Algorithm on Spark Platform 计算机科学, 2020, 47(11A): 425-429. https://doi.org/10.11896/jsjkx.190700071
[14]	禹鑫燚, 施甜峰, 唐权瑞, 殷慧武, 欧林林. 面向预测性维护的工业设备管理系统 Industrial Equipment Management System for Predictive Maintenance 计算机科学, 2020, 47(11A): 667-672. https://doi.org/10.11896/jsjkx.200100091
[15]	李梦珂, 郑秋生, 王磊. 基于采样技术的动态混合数据竞争检测算法 Dynamic Hybrid Data Race Detection Algorithm Based on Sampling Technique 计算机科学, 2020, 47(10): 315-321. https://doi.org/10.11896/jsjkx.190700079

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

基于Spark Streaming的流式并行文本校对

Streaming Parallel Text Proofreading Based on Spark Streaming

PDF (PC)

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 15

Metrics

本文评价

推荐阅读 0