Computer Science ›› 2020, Vol. 47 ›› Issue (4): 36-41.doi: 10.11896/jsjkx.190300070

• Computer Architecture • Previous Articles     Next Articles

Streaming Parallel Text Proofreading Based on Spark Streaming

YANG Zong-lin1, LI Tian-rui1,2, LIU Sheng-jiu1, YIN Cheng-feng1, JIA Zhen1, ZHU Jie3   

  1. 1 School of Information Science and Technology,Southwest Jiaotong University,Chengdu 611756,China;
    2 Institute of Artificial Intelligence,Southwest Jiaotong University,Chengdu 611756,China;
    3 Department of Computer Science,Tibetan University,Lasa 850000,China
  • Received:2019-03-16 Online:2020-04-15 Published:2020-04-15
  • Contact: LI Tian-rui,born in 1969,Ph.D,professor,Ph.D supervisor,is an outstanding member of CCF.His main research interests include cloud computing,data mining and artificial intelligence ,etc.
  • About author:YANG Zong-lin,born in 1994.His main research interests include Chinese text proofreading and parallel computing,etc.
  • Supported by:
    This work was supported by the National Natural Science Foundation of China (61573292) and Science and Technology Service Industry Demonstration Project of Sichuan Province(2016GFW0167)

Abstract: The rapid development of the Internet has prompted the generation of massive amounts of network text,which poses new performance challenges for traditional serial text proofreading algorithms.Although the text automatic proofreading task has received more and more attention in recent years,the related research work mostly focuses on serial algorithms,and rarely involves the parallelization of proofreading.Firstly,the serial proofreading algorithm is generalized,and a general framework of serialproofreading is given.Then,in view of the shortcomings of serial proofreading for processing large-scale texts,three general text proofreading parallelization methods are proposed:1)a parallel proofreading method based on multi-threading,which implements simultaneous parallelism of paragraph and proofreading functions based on the thread pool;2)a batch processing parallel proofreading method based on Spark MapReduce,which implements paragraph parallel proofreading by means of RDD parallel computing;3)a Spark Streaming-based parallel proofreading approach,which converts the real-time calculation of text streams into a series of small-scale time fragmentation based batch jobs,making it can effectively avoid fixed overhead and significantly reduce proofreading delay.Because the streaming computing has the advantages of low delay and high throughput,the paper finally chooses the streaming computing-based method to build the parallel proofreading system.Performance comparison experiments demonstrate that thread parallelism is suitable for proofreading small-scale text,batch processing is suitable for off-line proofreading of large-scale text,and streaming parallel proofreading effectively reduces the fixed delay of about 110 seconds.Compared with batch proofreading,the streaming proofreading using a real-time computing framework has achieved a great performance improvement.

Key words: Automatic correction, Multi-threading, Parallel computing, Spark, Streaming computing

CLC Number: 

  • TP391
[1]DAHLMEIER D,NG H T,NG E J F.NUS at the HOO 2012 Shared Task[C]//Proceedings of the Seventh Workshop on Building Educational Applications Using NLP.2012:216-224.
[2]ROZOVSKAYA A,CHANG K W,SAMMONS M,et al.The Illinois-Columbia System in the CoNLL-2014 Shared Task[C]//Proceedings of the Eighteenth Conference on Computational Natural Language Learning:Shared Task.2014:34-42.
[3]ROZOVSKAYA A,ROTH D.Grammatical Error Correction:Machine Translation and Classifiers[C]//Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics.2016:2205-2215.
[4]JUNCZYS-DOWMUNT M,GRUNDKIEWICZ R.Phrase-based machine translation is state-of-the-art for automatic grammatical error correction[J].arXiv:1605.06353,2016.
[5]CHOLLAMPATT S,NG H T.Connecting the dots:Towardshuman-level grammatical error correction[C]//Proceedings of the 12th Workshop on Innovative Use of NLP for Building Educational Applications.2017:327-333.
[6]YUAN Z,BRISCOE T.Grammatical error correction using neural machine translation[C]//Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.2016:380-386.
[7]XIE Z,AVATI A,ARIVAZHAGAN N,et al.Neural language correction with character-based attention[J].arXiv:1603.09727,2016.
[8]JI J,WANG Q,TOUTANOVA K,et al.A nested attention neural hybrid model for grammatical error correction[J].arXiv:1707.02026,2017.
[9]XU L C,SHI L.The Design and Application of A Dynamic Program Algorithm in Automatic Text Collationg[J].Computer Science,2002,29(9):149-150.
[10]GONG X J,LUO Z S,LUO W H.Automatically Detecting Syntactic Errors in Chinese Texts[J].Computer Engineering and Applications,2003,39(8):98-100.
[11]CHEN X R,QIN J,WANG W J,et al.Research and Implementation of Chinese Text Proofreading[J].Computer Science,2003,30(11):53-55.
[12]LIU L L,CAO C G.Chinese Real-word Error Automatic Proofreading Based on Combining of Local Context Features[J].Computer Science,2016(12):37-42.
[13]LIU L L,CAO C G.Study of Automatic Proofreading Method for Non-multi-character Word Error in Chinese Text[J].Computer Science,2016,43(10):200-205.
[14]ZHANG T.Design and Implementation of Chinese Text Automatic Proofreading System[D].Chengdu:Southwest Jiaotong University ,2017.
[15]ZHANG Y S,ZHENG J.Study of Semantic Error DetectingMethod for Chinese Text[J].Chinese Journal of Computers,2017(4):911-924.
[16]ZAHARIA M,XIN R S,WENDELL P,et al.Apache spark:a unified engine for big data processing[J].Communications of the ACM,2016,59(11):56-65.
[17]ZAHARIA M,DAS T,LI H,et al.Discretized streams:Fault-tolerant streaming computation at scale[C]//Proceedings of the Twenty-Fourth ACM Symposium on Operating Systems Principles.2013:423-438.
[18]NG H T,WU S M,BRISCOE T,et al.The CoNLL-2014 shared task on grammatical error correction[C]//Proceedings of the Eighteenth Conference on Computational Natural Language Learning:Shared Task.2014:1-14.
[19]NICHOLLS D.The Cambridge Learner Corpus:Error coding and analysis for lexicography and ELT[C]//Proceedings of the Corpus Linguistics 2003 conference.2003,16:572-581.
[20]LUONG M T,MANNING C D.Achieving open vocabulary neural machine translation with hybrid word-character models[J].arXiv:1604.00788,2016.
[21]ZHANG Y S,CAO Y D,YU S W.A Hybrid Model of Combining Rule-based and Statistics-based Approaches for Automatic Detecting Errors in Chinese Text[J].Journal of Chinese Information Processing,2006,20(4):1-7,55.
[22]WANG Y.Match Algorithm of Approximate String with Wildcard based on Trie Data Structure[J].Journal of Computer Applications,2004,24(10):121-124.
[23]LI J L,YIN C F,JIA Z,et al.Attention-based bidirectional LSTM for Chinese punctuation prediction[C]//Proceedings of the 13th FLINS Conference on Data Science and Knowledge Engineering for Sensing Decision Support.2018:708-714.
[1] CHEN Xin, LI Fang, DING Hai-xin, SUN Wei-ze, LIU Xin, CHEN De-xun, YE Yue-jin, HE Xiang. Parallel Optimization Method of Unstructured-grid Computing in CFD for DomesticHeterogeneous Many-core Architecture [J]. Computer Science, 2022, 49(6): 99-107.
[2] DAI Hong-liang, ZHONG Guo-jin, YOU Zhi-ming , DAI Hong-ming. Public Opinion Sentiment Big Data Analysis Ensemble Method Based on Spark [J]. Computer Science, 2021, 48(9): 118-124.
[3] YU Jian-ye, QI Yong, WANG Bao-zhuo. Distributed Combination Deep Learning Intrusion Detection Method for Internet of Vehicles Based on Spark [J]. Computer Science, 2021, 48(6A): 518-523.
[4] FU Tian-hao, TIAN Hong-yun, JIN Yu-yang, YANG Zhang, ZHAI Ji-dong, WU Lin-ping, XU Xiao-wen. Performance Skeleton Analysis Method Towards Component-based Parallel Applications [J]. Computer Science, 2021, 48(6): 1-9.
[5] HE Ya-ru, PANG Jian-min, XU Jin-long, ZHU Yu, TAO Xiao-han. Implementation and Optimization of Floyd Parallel Algorithm Based on Sunway Platform [J]. Computer Science, 2021, 48(6): 34-40.
[6] LI Fan, YAN Xing, ZHANG Xiao-yu. Optimization of GPU-based Eigenface Algorithm [J]. Computer Science, 2021, 48(4): 197-204.
[7] HU Rong, YANG Wang-dong, WANG Hao-tian, LUO Hui-zhang, LI Ken-li. Parallel WMD Algorithm Based on GPU Acceleration [J]. Computer Science, 2021, 48(12): 24-28.
[8] MA Meng-yu, WU Ye, CHEN Luo, WU Jiang-jiang, LI Jun, JING Ning. Display-oriented Data Visualization Technique for Large-scale Geographic Vector Data [J]. Computer Science, 2020, 47(9): 117-122.
[9] CHEN Guo-liang, ZHANG Yu-jie, . Development of Parallel Computing Subject [J]. Computer Science, 2020, 47(8): 1-4.
[10] YANG Wang-dong, WANG Hao-tian, ZHANG Yu-feng, LIN Sheng-le, CAI Qin-yun. Survey of Heterogeneous Hybrid Parallel Computing [J]. Computer Science, 2020, 47(8): 5-16.
[11] ZHU An-qing, LI Shuai, TANG Xiao-dong. Parallel FP_growth Association Rules Mining Method on Spark Platform [J]. Computer Science, 2020, 47(12): 139-143.
[12] YU Xin-yi, SHI Tian-feng, TANG Quan-rui, YIN Hui-wu, OU Lin-lin. Industrial Equipment Management System for Predictive Maintenance [J]. Computer Science, 2020, 47(11A): 667-672.
[13] DENG Ding-sheng. Application of Improved DBSCAN Algorithm on Spark Platform [J]. Computer Science, 2020, 47(11A): 425-429.
[14] XU Chuan-fu,WANG Xi,LIU Shu,CHEN Shi-zhao,LIN Yu. Large-scale High-performance Lattice Boltzmann Multi-phase Flow Simulations Based on Python [J]. Computer Science, 2020, 47(1): 17-23.
[15] XU Lei, CHEN Rong-liang, CAI Xiao-chuan. Scalable Parallel Finite Volume Lattice Boltzmann Method Based on Unstructured Grid [J]. Computer Science, 2019, 46(8): 84-88.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!