计算机科学 ›› 2024, Vol. 51 ›› Issue (9): 233-241.doi: 10.11896/jsjkx.230900159

• 人工智能 • 上一篇    下一篇

CFGT:一种基于词典的中文地址要素解析模型

黄威, 沈耀迪, 陈松龄, 傅湘玲   

  1. 北京邮电大学计算机学院(国家示范性软件学院) 北京 100876
    可信分布式计算与服务教育部重点实验室 北京 100876
  • 收稿日期:2023-09-28 修回日期:2024-03-14 出版日期:2024-09-15 发布日期:2024-09-10
  • 通讯作者: 傅湘玲(fuxiangling@bupt.edu.cn)
  • 作者简介:(huangwei@bupt.edu.cn)
  • 基金资助:
    国家自然科学基金(72274022)

CFGT:A Lexicon-based Chinese Address Element Parsing Model

HUANG Wei, SHEN Yaodi, CHEN Songling, FU Xiangling   

  1. School of Computer Science(National Pilot Software Engineering School), Beijing University of Posts, Telecommunications, Beijing 100876, China
    Key Laboratory of Trustworthy Distributed Computing and Service(BUPT),Ministry of Education,Beijing 100876,China
  • Received:2023-09-28 Revised:2024-03-14 Online:2024-09-15 Published:2024-09-10
  • About author:HUANG Wei,born in 1998,postgra-duate.His main research interests include data mining and anomaly detection.
    FU Xiangling,born in 1975,Ph.D,professor,Ph.D supervisor.Her main research interests include natural language processing,smart finance and smart healthcare.
  • Supported by:
    National Natural Science Foundation of China(72274022).

摘要: 地址要素解析作为地理编码过程中的关键环节,直接影响到地理编码的准确性。由于中文地址表达的多样性和复杂性,两段相似的地址文本在地理表示上却可能完全不同。传统的通过词典匹配进行地址要素解析的方法无法较好地应对歧义词,从而导致识别准确率欠佳。文中提出一种基于词典的中文地址要素解析模型( Collaborative Flat-Graph Transformer,CFGT),利用自匹配词、最近上下文等词汇信息增强地址文本字符序列表示,有效遏制了地址文本表达的歧义性。具体地,模型首先构建Flat-Lattice和Flat-Shift两种协作图,为地址字符捕获自匹配词和最近上下文词汇的知识,并设计融合层实现图之间的协作;其次,通过改进的相对位置编码,进一步强化词信息对地址文本字符序列的增强效果;最后,利用Transformer和条件随机场进行地址要素解析。在Weibo和Resume等多个公开数据集及Address私有数据集上开展的实验表明,CFGT模型的性能优于已有的中文地址要素解析模型和中文命名实体识别模型。

关键词: 中文地址识别, 词典强化, 外部信息, 命名实体识别

Abstract: As a key step in the geocoding process,address element parsing directly affects the accuracy of geocoding.Due to the diversity and complexity of Chinese address expressions,two similar address texts may be completely different in geographical representation.Traditional address element parsing based on dictionary matching cannot handle ambiguous words well,thus showing poor recognition accuracy.A lexicon-based Chinese address element parsing model CFGT:collaborative flat-graph transformer is proposed,which uses self-matched words,nearest contextual and other lexical information to enhance the character sequence representation of address text,effectively curbing the ambiguity of address text expression.Specifically,the model first constructs two collaboration graphs,flat-lattice and flat-shift,to capture the knowledge of self-matched words and nearest contextual words for address characters,and designs a fusion layer to implement collaboration between graphs.Secondly,with the help of the improved relative position encoding,the enhancing effect of word information on the address text character sequence is further strengthened.Finally,Transformer and conditional random fields are used to analyze address elements.Experiments are conducted on multiple public datasets such as Weibo and Resume,as well as the private dataset Address.Experimental results show that the performance of the CFGT is superior to previous Chinese address element parsing models and existing models in the field of Chinese named entity recognition.

Key words: Chinese address recognition, Lexicon enhancement, External information, Named entity recognition

中图分类号: 

  • TP391
[1]GOLDBERG D W,WILSON J P,KNOBLOCK C A.From textto geographic coordinates:the current state of geocoding[J].URISA Journal,2007,19(1):33-46.
[2]GOLDBERG D W.Advances in geocoding research and practice[J].Transactions in GIS,2011,15(6):727-733.
[3]KARIMI H A,SHARKER M H,ROONGPIBOONSOPIT D.Geocoding recommender:an algorithm to recommend optimal online geocoding services for applications[J].Transactions in GIS,2011,15(6):869-886.
[4]DHAR S,VARSHNEY U.Challenges and business models for mobile location-based services and advertising[J].Communications of the ACM,2011,54(5):121-128.
[5]CONG G,JENSEN C S.Querying geo-textual data:Spatial keyword queries and beyond[C]//Proceedings of the 2016 International Conference on Management of Data.New York:Association for Computing Machinery,2016:2207-2212.
[6]LI P,LUO A,LIU J,et al.Bidirectional gated recurrent unit neural network for chinese address element segmentation[J].ISPRS International Journal of Geo-Information,2020,9(11):635.
[7]MELO F,MARTINS B.Automated geocoding of textual docu-ments:A survey of current approaches[J].Transactions in GIS,2017,21(1):3-38.
[8]KUAI X,GUO R,ZHANG Z,et al.Spatial context-based localtoponym extraction and chinese textual address segmentation from urban poi data[J].ISPRS International Journal of Geo-Information,2020,9(3):147.
[9]LI X,ZHANG Y,LI L.A Chinese address recognition methodbased on address semantics[J].Computer Engineering & Science,2019,41(3):171-178.
[10]LIN Y,KANG M,HE B.Spatial pattern analysis of addressquality:A study on the impact of rapid urban expansion in china[J].Environment and Planning B:Urban Analytics and City Science,2021,48(4):724-740.
[11]ZHANG X,LV G,LI B,et al.Rule-based Approach to Semantic Resolution of Chinese Address[J].Journal of Geo-information Science,2010(1):9-16.
[12]ZHAO Y,WANG L,QIU A.An improved algorithm for address segmentation[J].Science of Surveying and Mapping,2013,38(5):74-76.
[13]DUAN Y,LI X,HUANG S.Extraction of administrative division of Chinese address based on conditional random fields[J].Journal of Wuhan Institute of Technology,2015(11):47-51.
[14]WANG Y,ZHOU S,XING C.The address spatiotemporal data engine building method based on HMM[J].Science of Surveying and Mapping,2020,45(10):7.
[15]CHENG B,LI W,TONG H.Chinese Address Segmentationbased on BiLSTM-CRF[J].Journal of Geo-information Science,2019,21(8):1143-1151.
[16]LI P,LUO A,LIU J,et al.Bidirectional gated recurrent unit neural network for chinese address element segmentation[J].International Journal of Geo-Information,2020,9(11):635.
[17]LIU X,PENG T.Research on Chinese Scenic Spot Named Entity Recognition Based on Convolutional Neural Network[J].Computer Engineering & Science,2020,56(4):145-150.
[18]DEVLIN J,CHANG M W,LEE K,et al.Bert:Pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of NAACL-HLT.Stroudsburg:Assoc Computational Linguistics-ACL,2019:4171-4186.
[19]ZHANG H,REN F,LI H,et al.Recognition method of new address elements in chinese address matching based on deep lear-ning[J].ISPRS International Journal of Geo-Information,2020,9(12):745.
[20]SUN S,TANG K.Chinese address segment method based onBERT[J].Electronic Design Engineering,2021,29(9):155-159.
[21]ZHANG Y,YANG J.Chinese ner using lattice lstm[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics.Stroudsburg:Assoc Computational Linguistics,2018:1554-1564.
[22]LI X,YAN H,QIU X,et al.FLAT:Chinese NER Using Flat-Lattice Transformer[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.Stroudsburg:Assoc Computational Linguistics,2020:6836-6842.
[23]HEWITT J,MANNING C D.A structural probe for findingsyntax in word representations[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.Stroudsburg:Assoc Computational Linguistics,2019:4129-4138.
[24]DING R,XIE P,ZHANG X,et al.A neural multi-digraph model for chinese ner with gazetteers[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.Stroudsburg:Assoc Computational Linguistics,2019:1462-1467.
[25]SUI D,CHEN Y,LIU K,et al.Leverage lexical knowledge for chinese named entity recognition via collaborative graph network[C]//Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing(EMNLP-IJCNLP).Stroudsburg:Assoc Computational Linguistics,2019:3830- 3840.
[26]LIU W,XU T,XU Q,et al.An encoding strategy based word-character lstm for chinese ner[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.Stroudsburg:Assoc Computational Linguistics,2019:2379-2389.
[27]DAI Z,YANG Z,YANG Y,et al.Transformer-xl:Attentivelanguage models beyond a fixed-length context[C]//Procee-dings of the 57th Annual Meeting of the Association for Computational Linguistics.Association for Computational Linguistics.Stroudsburg:Assoc Computational Linguistics,2019:2978-2988.
[28]HU Y,VERBERNE S.Named entity recognition for Chinese biomedical patents[C]//Proceedings of the 28th International Conference on Computational Linguistics.Stroudsburg:Assoc Computational Linguistics,2020:627-637.
[29]MA R,PENG M,ZHANG Q,et al.Simplify the usage of lexicon in chinese ner [C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics.Stroudsburg:Assoc Computational Linguistics,2020:5951-5960.
[30]LIU W,FU X,ZHANG Y,et al.Lexicon enhanced chinese sequence labelling using bert adapter[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing.Stroudsburg:Assoc Computational Linguistics,2021:5847-5858.
[31]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need [C]//Proceedings of the 31st International Confe-rence on Neural Information Processing Systems.California:Neural Information Processing Systems(NIPS),2017:6000-6010.
[32]PENG N,DREDZE M.Named entity recognition for chinese social media with jointly trained embeddings[C]//Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing.Stroudsburg:Assoc Computational Linguistics,2015:548-554.
[33]HE H,SUN X.F-score driven max margin neural network fornamed entity recognition in chinese social media[C]//Procee-dings of the 15th Conference of the European Chapter of the Association for Computational Linguistics.Stroudsburg:Assoc Computational Linguistics,2017:713-718.
[34]LEVOW G A.The third international Chinese language processing bakeoff:Word segmentation and named entity recognition[C]//Proceedings of the Fifth SIGHAN workshop on Chinese language processing.Stroudsburg:Assoc Computational Linguistics,2006:108-117.
[35]WEISCHEDEL R,PARADHAN S,RAMSHAW L,et al.On-tonotes release 4.0[DB/OL].http://catalog.ldc.upenn.edu.LDC2011T03.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!