面向实体标注的军事语料库建设

摘要/Abstract

摘要： 军事语料的识别和标注是军事语料库建设的关键。针对军事语料的实体,提出了一套统一的军语词性标记规范和军事语料标注规范,设计了一种基于军语词典的自动扩展的军事语料实体特征提取框架。该框架借助设计的高精分类器进行基本特征的选择和提取,结合军语的典型特征组成特征集,构建基于军语词典校正的特征空间,对军事语料进行实体识别之后按照指定的标注规范和词形标记规范进行军事语料实体的标注,构建一个较大规模的高质量军事语料库。实验表明,该框架可以较好地完成语料实体的识别和语料标注工作,有利于军事语料库的建设工作和认清其在军事上的广泛作用和应用前景。

关键词: 军事实体标注, 军事语料库, 军语词性标记, 特征提取

Abstract: The key to build military corpus are the identification and the marking of military corpus.For the entities of military corpus,this paper put forward a set of unified army language part-of-speech tags specification and military corpus annotation specifications,and designed a kind of automatic extension of military corpora based on the military language dictionary entity framework feature extraction.With the help of high precision classifier,the framework selects and extracts the basic features,combined with the typical features of the language set,builds the feature space.Based on the language dictionary correction for military corpora entity recognition,according to the specified annotation standard and specification of morphological marker military annotation corpus entity,the framework builds a large-scale high-quality military corpus.Experiments show that the framework can better complete corpus entity recognition and corpus annotation of the work,to do the construction of military corpus work and to recognize its function and the application prospect of widely in the military.

Key words: Feature extraction, Military corpus, Military entity’s annotation, Military speech tagging

中图分类号:

TP391

周彬彬, 张宏军, 张睿, 冯蕴天, 徐有为. 面向实体标注的军事语料库建设[J]. 计算机科学, 2019, 46(6A): 540-546. https://doi.org/

ZHOU Bin-bin, ZHANG Hong-jun, ZHANG Rui, FENG Yun-tian, XU You-wei. Construction of Military Corpus for Entity Annotation[J]. Computer Science, 2019, 46(6A): 540-546. https://doi.org/

参考文献

[1]麻丽莉,王祥兵.军事平行语料库的建立及其在军事翻译方面的应用[J].国防科技,2009,30(1):38-41.
[2]梁晓波,刘伍颖,孟凡礼.信息化条件下的军事语料库应用[J].国防科技,2008(2):51-57.
[3]王红霞,周密.国际化视域下海军军事科技英语的实用性研究[J].中国校外教育旬刊,2014(S1):1103-1104.
[4]向音.军用文书的语篇特征初探[J].办公室业务,2011(10):010.
[5]俞士汶,朱学锋,段慧明.大规模现代汉语标注语料库的加工规范[J].中文信息学报,2000,14(6):58-64.
[6]范云飞.基于POS规则匹配的电子商务网站用户评价信息的分析[D].武汉:武汉理工大学,2015.
[7]XIA F,YETISGEN-YILDIZ M.Clinical corpus annotation: Challenges and strategies[C]∥Proceedings of the 3rd Workshop on Building and Evaluating Resources for Biomedical Text Mining (BioTxtM 2012) of the International Conference on Language Resources and Evaluation (LREC).2012:32-39.
[8]SNOW R,O’CONNOR B,JURAFSKY D,et al.Cheap and fast—But is it good? Evaluating non-expert annotations for natural language tasks[C]∥Proceedings of the Conference on Empirical Methods in Natural Language Processing.Stroudsburg.Association for Computational Linguistics,2008:254-263.
[9]ZHOU J,LI B C,CHEN G.Automatically building large-scale named entity recognition corpora from Chinese Wikipedia[J].Frontiers of Information Technology &Electronic Engineering,2015,16(11):940-957.
[10]NADEAU D,SEKINE S.A survey of named entity recognition and classification[J].Lingvisticae Investigations,2007,30(1):3-26.
[11]XIE L,ZHENG Y,LIU Z,et al.Extracting Chinese abbrevia-tion-definition pairs from anchor texts[C]∥International Conference on Machine Learning and Cybernetics.IEEE,2011:1485-1491.
[12]崔世起.中文缩略语自动抽取初探[C]∥全国第八届计算语言学联合学术会议(JSCL-2005).2005:6.
[13]CHANG J S,TENG W L.Mining atomic Chinese abbreviations with a probabilistic single character recovery model[J].Language Resources and Evaluation,2007,40(3-4):367-374.
[14]CHANG J S,LAI Y T.A Preliminary Study on Probabilistic Models for Chinese Abbreviations[C]∥Proceedings of the Third Sighan Workshop on Chinese Language Learning.2004:9-16.

相关文章 15

[1]	张源, 康乐, 宫朝辉, 张志鸿. 基于Bi-LSTM的期货市场关联交易行为检测方法 Related Transaction Behavior Detection in Futures Market Based on Bi-LSTM 计算机科学, 2022, 49(7): 31-39. https://doi.org/10.11896/jsjkx.210400304
[2]	曾志贤, 曹建军, 翁年凤, 蒋国权, 徐滨. 基于注意力机制的细粒度语义关联视频-文本跨模态实体分辨 Fine-grained Semantic Association Video-Text Cross-modal Entity Resolution Based on Attention Mechanism 计算机科学, 2022, 49(7): 106-112. https://doi.org/10.11896/jsjkx.210500224
[3]	程成, 降爱莲. 基于多路径特征提取的实时语义分割方法 Real-time Semantic Segmentation Method Based on Multi-path Feature Extraction 计算机科学, 2022, 49(7): 120-126. https://doi.org/10.11896/jsjkx.210500157
[4]	刘伟业, 鲁慧民, 李玉鹏, 马宁. 指静脉识别技术研究综述 Survey on Finger Vein Recognition Research 计算机科学, 2022, 49(6A): 1-11. https://doi.org/10.11896/jsjkx.210400056
[5]	高元浩, 罗晓清, 张战成. 基于特征分离的红外与可见光图像融合算法 Infrared and Visible Image Fusion Based on Feature Separation 计算机科学, 2022, 49(5): 58-63. https://doi.org/10.11896/jsjkx.210200148
[6]	左杰格, 柳晓鸣, 蔡兵. 基于图像分块与特征融合的户外图像天气识别 Outdoor Image Weather Recognition Based on Image Blocks and Feature Fusion 计算机科学, 2022, 49(3): 197-203. https://doi.org/10.11896/jsjkx.201200263
[7]	任首朋, 李劲, 王静茹, 岳昆. 基于集成回归决策树的lncRNA-疾病关联预测方法 Ensemble Regression Decision Trees-based lncRNA-disease Association Prediction 计算机科学, 2022, 49(2): 265-271. https://doi.org/10.11896/jsjkx.201100132
[8]	张师鹏, 李永忠. 基于降噪自编码器和三支决策的入侵检测方法 Intrusion Detection Method Based on Denoising Autoencoder and Three-way Decisions 计算机科学, 2021, 48(9): 345-351. https://doi.org/10.11896/jsjkx.200500059
[9]	冯霞, 胡志毅, 刘才华. 跨模态检索研究进展综述 Survey of Research Progress on Cross-modal Retrieval 计算机科学, 2021, 48(8): 13-23. https://doi.org/10.11896/jsjkx.200800165
[10]	张丽倩, 李孟航, 高珊珊, 张彩明. 面向计算机辅助舌诊关键问题的解决方案综述 Summary of Computer-assisted Tongue Diagnosis Solutions for Key Problems 计算机科学, 2021, 48(7): 256-269. https://doi.org/10.11896/jsjkx.200800223
[11]	暴雨轩, 芦天亮, 杜彦辉, 石达. 基于i_ResNet34模型和数据增强的深度伪造视频检测方法 Deepfake Videos Detection Method Based on i_ResNet34 Model and Data Augmentation 计算机科学, 2021, 48(7): 77-85. https://doi.org/10.11896/jsjkx.210300258
[12]	霍帅, 庞春江. 基于Transformer和多通道卷积神经网络的情感分析研究 Research on Sentiment Analysis Based on Transformer and Multi-channel Convolutional Neural Network 计算机科学, 2021, 48(6A): 349-356. https://doi.org/10.11896/jsjkx.200800004
[13]	李娜娜, 王勇, 周林, 邹春明, 田英杰, 郭乃网. 基于特征重要度二次筛选的DDoS攻击随机森林检测方法 DDoS Attack Random Forest Detection Method Based on Secondary Screening of Feature Importance 计算机科学, 2021, 48(6A): 464-467. https://doi.org/10.11896/jsjkx.200900101
[14]	雷剑梅, 曾令秋, 牟洁, 陈立东, 王淙, 柴勇. 基于整车EMC标准测试和机器学习的反向诊断方法 Reverse Diagnostic Method Based on Vehicle EMC Standard Test and Machine Learning 计算机科学, 2021, 48(6): 190-195. https://doi.org/10.11896/jsjkx.200700204
[15]	李梦荷, 许宏吉, 石磊鑫, 赵文杰, 李娟. 基于骨骼关键点检测的多人行为识别 Multi-person Activity Recognition Based on Bone Keypoints Detection 计算机科学, 2021, 48(4): 138-143. https://doi.org/10.11896/jsjkx.200300042

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed