计算机科学 ›› 2023, Vol. 50 ›› Issue (7): 229-236.doi: 10.11896/jsjkx.220500068

• 人工智能 • 上一篇    下一篇

基于字频差算法与左切分词库构建的专利文献组件名称识别方法

孔嘉斌, 吕剑文, 刘江南, 杜文轩   

  1. 湖南大学汽车车身先进设计制造国家重点实验室 长沙 410082
  • 收稿日期:2022-05-07 修回日期:2022-10-23 出版日期:2023-07-15 发布日期:2023-07-05
  • 通讯作者: 刘江南(Liujiangnan@hnu.edu.cn)
  • 作者简介:(jbkong@hnu.edu.cn)
  • 基金资助:
    国家科技部创新方法专项资助项目(2019IM050100);湖南省自然科学基金(2018JJ2039)

Recognition Method of Component Names in Patent Documents Based on the Algorithm of Word Frequency Difference and Library of Left-segmentation Words

KONG Jiabin, LYU Jianwen, LIU Jiangnan, DU Wenxuan   

  1. State Key Laboratory of Advanced Design and Manufacturing for Vehicle Body,Changsha 410082,China
  • Received:2022-05-07 Revised:2022-10-23 Online:2023-07-15 Published:2023-07-05
  • About author:KONG Jiabin,born in 1996,postgra-duate.His main research interests include mechanical equipment innovation design and patent knowledge mining.LIU Jiangnan,born in 1965,Ph.D,professor,master supervisor.Her main research interests include innovative design theory and methods,mechanical system optimization methods,patent avoidance and regeneration.
  • Supported by:
    Innovation methods work Special Projects of Science and Technology of China(2019IM050100) and Natural Science Foundation of Hunan Province,China(2018JJ2039).

摘要: 机械专利文献蕴含着海量以组件名称为信息单元的领域知识信息,组件名称用词灵活多变,具有独特、复杂和生僻等特点,难以被计算机准确识别,成为专利知识挖掘的一大阻碍。为了提出组件名称的高效识别方法,剖析并提炼专利文本语句中的组件名称构词特征;从组件名称相关的外部用词入手,通过标识附图标记,识别其左侧的名称字符,自动从文本中检索候选名称,并构建组件候选名称集合;提出了字频差算法,过滤候选名称集合的冗余字符;提出了动态构建左切分词库算法,进一步剔除未能被过滤的冗余字符;通过交叉实验测试和分析识别过程中字频差先验阈值、词频阈值和字频差阈值的选取对识别效果的影响,形成一种面向机械领域中文专利的组件名称识别三段式综合方法。最后通过对实验结果的对比分析,验证了该方法的有效性与高效性。

关键词: 专利文本, 冗余字符, 附图标记, 字频差, 左切分词

Abstract: Mechanical patent literature contains a large amount of domain knowledge where component names exist as information units.Being flexible and changeable,the word formatting of component name represents the characteristics of uniqueness,complexity and lesser-known expressions.The challenge of accurate recognition of component names by computers becomes an obstacle to patent knowledge mining.In order to propose an efficient method to recognize component names,the features of word formation in patent text statements are analyzed and extracted.Starting with external words related to component names,characters on the left side of the appended drawing reference signs(ADRS) are identified.Accordingly,candidate names are automatically retrieved from texts,and the set of candidate names are constructed.An algorithm of word frequency difference is proposed to filter redundant characters in the set of candidate names.By building left-segmentation library(LSL) dynamically,redundant characters which are not filtered are further eliminated.Based on cross-over experiment,the influence of character frequency difference prior threshold(CFDV-Ⅰ),word frequency threshold(LSWF) and character frequency difference threshold(CFDV-Ⅱ) on recognition result is tested and analyzed.Furthermore,a three-stage comprehensive method for recognizing component names from patent documents in mechanical field is proposed.Finally,the method has been proved to be effective and efficient by comparing the results of experiments.

Key words: Patent text, Redundant characters, Appended drawing reference signs, Word frequency difference, Left-segmentation words

中图分类号: 

  • TH122
[1]HE M,GONG C C,ZHANG H P,et al.Method of New WordIdentification Based on Lager- scale Corpus[J].Computer Engineering and Applications,2007,43(21):157-159.
[2]ZHAO H,CAI D,HUANG C N,et al.Chinese Word Segmentation:Another Decade Review(2007-2017) [DB/OL].https://arxiv.org/ftp/arxiv/papers/1901/1901.06079.pdf.
[3]LIU L,WANG D B.A Review on Named Entity Recognition[J].Journal of the China Society for Scientific and Technical Information,2018,37(3):329-340.
[4]SUN Z,WANG H L.Overview on the Advance of the Research on Named Entity Recognition[J].Data Analysis and Knowledge Discovery,2010,193(6):42-47.
[5]CHEN Q Y,CHENG G,LI D,et al.Named Entity Recognition for Mechanical Design and Manufacturing Area[J].Computer Engineering and Applications,2017,53(20):100-104.
[6]VIKAS Y,STEVEN B.A Survey on Recent Advances in NamedEntity Recognition from Deep Learning models [C]//Procee-dings of the 27th International Conference on Computational Linguistics.2018:2145-2158.
[7]PAN Z G.Research on the Recognition of Chinese Named EntityBased on Rulesand Statistics[J].Information Science,2012,30(5):708-712,786.
[8]MAO X L,LI F F,WANG H T,et al.Named Entity Recognition of Electronic Medical Record Based on Improved HMM Algorithm[C]//2017 International Conference on Computer Technology,Electronics and Communication(ICCTEC).IEEE,2017:435-438.
[9]JU Z F,WANG J,ZHU F.Named Entity Recognition from Biomedical Text Using SVM[C]//2011 5th International Confe-rence on Bioinformatics and Biomedical Engineering.IEEE,2011:1-4.
[10]SUN A,YU Y X,LUO Y G,et al.Research on Feature Extraction Scheme of Chinese-character Granularity in Sequence Labeling Model--A Case Study About Clinical Named Entity Recognition of CCKS2017:Task2[J].Library and Information Ser-vice,2018,62(11):103-111.
[11]DONG C H,WU H J,ZHANG J J,et al.Multichannel LSTM-CRF for Named Entity Recognition in Chinese Social Media[C]//China National Conference on Chinese Computational Linguistics International Symposium on Natural Language Proces-sing Based on Naturally Annotated Big Data.2017:197-208.
[12]LI Y,MA L,SHAO D G,et al.Chinese Named Entity Recognition for Social Media[J].Journal of Chinese Information Processing,2020,34(8):61-69.
[13]LI M Y,KONG F.Combined Self-Attention Mechanism forNamed Entity Recognition in Social Media[J].Journal of Tsinghua University(Science and Technology),2019,59(6):461-467.
[14]BATISTA-NAVARRO R,RAK R,ANANIADOU S.Optimizing Chemical Named Entity Recognition with Pre-processing Analytics,Knowledge-Rich Features and Heuristics[J].Journal of Cheminformatics,2015,7(Suppl 1):S6.
[15]YANG P,YANG Z H,LUO,et al.An Attention-Based Ap-proach for Chemical Compound and Drug Named Entity Recognition[J].Journal of Computer Research and Development,2018,55(7):1548-1556.
[16]LI X,WEI X H,JIA L,et al.Recognition of Crops,Diseases and Pesticides Named Entities in Chinese Based on Conditional Random Fields[J].Transactions of the Chinese Society for Agricultural Machinery,2017,48(S1):178-185.
[17]FENG Y T,ZHANG H J,HAO W N.Named Entity Recognition for Military Text[J].Computer Science,2015,42(7):15-18,47.
[18]SHAN Y D,WANG H J,WANG N.Military Domain Named Entity Recognition Based on Multi-label[J].Computer Science,2019,46(S2):9-12.
[19]WANG Z X,QIU Q Y,FENG P E,et al.Information Extraction Method of Technical Solution from Mechanical Product Patent[J].Journal of Mechanical Engineering,2009,45(10):198-206.
[20]FANTONI G,APREDA R,DELL’ORLETTA F,et al.Automatic Extraction of Function-Behaviour-State Information from Patents[J].Advanced Engineering Informatics,2013,27(3):317-334.
[21]ALEX J,HINRICH S,SOREN B.Unsupervised Training SetGeneration for Automatic Acquisition of Technical Terminology in Patents [C]//Proceedings of COLING 2014,the 25th International Conference on Computational Linguistics:Dublin,Ireland,2014,Technical Papers.2014:290-300.
[22]CHEN L,XU S,ZHU L,et al.A deep Learning Based Method for Extracting Semantic Information from Patent Documents[J].Scientometrics,2020,125:289-312.
[23]LI S B,WU Y M,XU Y X,et al.A Bayesian Network BasedAdaptability Design of Product Structures for Function Evolution [J].Applied Sciences,2018,8(4):493-509.
[24]WANG M P,WANG H,DENG S H,et al.Extracting Chinese Metallurgy Patent Terms with Conditional Random Fields[J].Data Analysis and Knowledge Discovery,2016,271(6):28-36.
[25]YU Y,ZHAO N X.Patent Term Extraction Based on GenericWords and Term Components[J].Journal of the China Society for Scientific and Technical Information,2018,37(7):742-752.
[26]CHEN M J,XIE Z P,CHEN X Q,et al.Novel Bidirectional Aggregation Degree Feature Extraction Method for Patent New Word Discovery[J].Journal of Computer Applications,2020,40(3):631-637.
[27]LI J,JING F Y,LIU J.Study on Patent Entity Extraction Based on Improved Bert Algorithms-A Case Study of Graphene[J].Journal of University of Electronic Science and Technology of China,2020,49(6):883-890.
[28]GEORGESCU T M,IANCU B,ZAMFIROIU A,et al.A Survey on Named Entity Recognition Solutions Applied for Cybersecurity-Related Text Processing[C]//Proceedings of Fifth International Congress on Information and Communication Technology,ICICT 2020,London,(Volume 2).2020:316-325.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!