计算机科学 ›› 2020, Vol. 47 ›› Issue (8): 313-318.doi: 10.11896/jsjkx.190700031

• 计算机网络 • 上一篇    下一篇

基于最长连续间隔的未知二进制协议格式推断

陈庆超1, 王韬1, 冯文博2, 尹世庄1, 刘丽君1   

  1. 1 陆军工程大学装备模拟训练中心 石家庄 050003
    2 陆军工程大学指挥控制工程学院 南京 210007
  • 出版日期:2020-08-15 发布日期:2020-08-10
  • 通讯作者: 王韬(a13592247640@foxmail.com)
  • 作者简介:cqc62808@163.com
  • 基金资助:
    国家重点研发计划(2017YFB0802900);江苏省自然科学基金(BK20161469)

Unknown Binary Protocol Format Inference Method Based on Longest Continuous Interval

CHEN Qing-chao1, WANG Tao1, FENG Wen-bo2, YIN Shi-zhuang1, LIU Li-jun1   

  1. 1 Equipment Simulation Training Center, Army Engineering University, Shijiazhuang 050003, China
    2 College of Command and Control Engineering, Army Engineering University, Nanjing 210007, China
  • Online:2020-08-15 Published:2020-08-10
  • About author:CHEN Qing-chao, born in 1996, postgraduate.His main research interests include cyber security and so on.
    WANG Tao, born in 1964, Ph.D, professor.His main research interests include cyber security and cryptography.
  • Supported by:
    This work was supported by the National Key Research and Development Program of China (2017YFB0802900) and Natural Science Foundation of Jiangsu Province, China (BK20161469).

摘要: 在未知二进制协议的格式推断过程中, 常常引入大量的先验知识, 实验操作复杂且准确率不高。为此, 文中提出了一种人为设定较少参数、操作简单、准确率较高的方法进行未知二进制协议格式推断, 将预处理的协议数据进行层次聚类, 以CH(Calinski-Harabasz)系数为评价标准获得最优聚类, 通过对聚类所得结果进行改进的序列对比以获得带有间隔的协议数据序列, 统计合并连续间隔, 以分析协议格式。实验结果表明, 提出的二进制协议格式推断方法能够推断出未知二进制协议80%以上的字段间隔, 相较于AutoReEngine算法中的格式推断方法, 所提方法的F1-Measure值整体上提升了约30%。

关键词: 层次聚类, 二进制协议, 格式推断, 间隔, 序列对比

Abstract: In the process of format inference of unknown binary protocols, a large amount of prior knowledge is often introduced, the experimental operation is complex and the accuracy of the results is low.For this reason, a method that requires less artificial setting of parameters, simple operation and higher accuracy is proposed to infer the unknown binary protocol format.The preprocessed protocol data is clustered hierarchically, and the optimal clustering is obtained by using CH (Calinski-Harabasz) coefficient as the evaluation criteria.Through the improved sequence comparison of the clustering results, the protocol data sequence with interval is obtained, continuous intervals are counted and merged to analyze protocol formats.The experimental results show that the binary protocol format inference method proposed in this paper can infer more than 80% of the field intervals in the unknown binary protocol.Compared with the format inference method in AutoReEngine algorithm, the F1-Measure value of the proposed method is improved by about 30% as a whole.

Key words: Binary protocol, Format inference, Hierarchical clustering, Interval, Sequence alignment

中图分类号: 

  • TP393
[1]DUCHENE J, LE GUERNIC C, ALATA E, et al.State of the art of network protocol reverse engineering tools[J].Journal of Computer Virology and Hacking Techniques, 2018, 14(1):53-68.
[2]LUO J Z, YU S Z.Position-based automatic reverse engineering of network protocols[J].Journal of Network and Computer Applications, 2013, 36(3):1070-1077.
[3]LI M, YU S Z.Noise-Tolerant and Optimal Segmentation of Message Formats for Unknown Application-Layer Protocols [J].Journal of Software, 2013(3):604-617.
[4]ZHANG Z, ZHANG Z, LEE P P, et al.ProWord:An unsupervised approach to protocol feature word extraction[C]∥International Conference on Computer Communications.2014:1393-1401.
[5]MUHAMAD F N, AHMAD R B, ASI S M, et al.Performance Analysis Of Needleman-Wunsch Algorithm (Global) And Smith-Waterman Algorithm (Local) In Reducing Search Space And Time For Dna Sequence Alignment[C]∥Journal of Physics:Conference Series.IOP Publishing, 2018, 1019(1):012085.
[6]TAO S, YU H, LI Q.Bit-oriented format extraction approach for automatic binary protocol reverse engineering[J].IET Communications, 2016, 10(6):709-716.
[7]YAN X, LI Q.Method for determining boundaries of binary protocol format keywords based on optimal path search[J].Journal of Computer Applications, 2018, 38(6):1726-1731.
[8]WANG Y, LI X, MENG J, et al.Biprominer:Automatic Mining of Binary Protocol Features[C]∥International Conference on Parallel & Distributed Computing.IEEE, 2012:179-184.
[9]HOU F J, WANG L, WANG S, et al.Position-based Automated Protocol Reverse Engineer on Network Flows[J].Computer Engineering, 2019, 45(5):84-87.
[10]LIU J L, FU G Y, LI H L, et al.Proprietary protocol fuzzing method based on improved voting expert algorithm[J].Compu-ter Engineering and Applications, 2018, 54(12):98-104.
[11]MENG F, ZHANG C, WU G.Protocol reverse based on hierarchical clustering and probability alignment from network traces[C]∥2018 IEEE 3rd International Conference on Big Data Analysis (ICBDA).IEEE, 2018:443-447.
[12]LI Y, LI Q, ZHANG X.Automatic protocol format signatureconstruction algorithm based on discrete series protocol message
[J].Journal of Computer Applications, 2017, 37(4):954-959.
[13]WU Y.Research on Encryption Identification and frequent patterns mining of unknown protocol bitstreams[D].Shijiazhuang:Army Engineering University, 2015:130-132.
[14]ASHKENAZY H, SELA I, LEVY KARIN E, et al.Multiple sequence alignment averaging improves phylogeny reconstruction[J].Systematic Biology, 2018, 68(1):117-130.
[15]HASHEEM Y M, MOHAMAD K M, ABDI A N E, et al.Mo-bile Forensic Images and Videos Signature Pattern Matching using M-Aho-Corasick[J].International Journal of Advanced Computer Science and Applications, 2016, 7(7):261-264.
[16]QIAO Z, GOTO K, OHSHIMA T, et al.Dictionary matching:review of the aho-corasick algorithm and vision for large dictio-naries[C]∥Proceedings of the 8th International Conference on Information Systems and Technologies.ACM, 2018:4.
[17]LEI D, WANG T, WANG X H, et al.Unknown protocol frame segmentation algorithm based on preamble mining [J].Journal of Computer Applications, 2017, 37(2):440-444.
[18]LIAO Y L, LI Y C, CHEN N C, et al.Adaptively Banded Smith-Waterman Algorithm for Long Reads and Its Hardware Accelerator[C]∥2018 IEEE 29th International Conference on Application-specific Systems, Architectures and Processors (ASAP).IEEE, 2018:1-9.
[19]LI T, LIU Y, ZHANG C, et al.A noise-tolerant system for protocol formats extraction from binary data[C]∥2014 IEEE Workshop on Advanced Research and Technology in Industry Applications (WARTIA).IEEE, 2014:862-865.
[20]TRIFILO A, BURSCHKA S, BIERSACK E.Traffic to protocol reverse engineering[C]∥2009 IEEE Symposium on Computational Intelligence for Security and Defense Applications.IEEE, 2009:1-8.
[21]SUN F, WANG S, ZHANG C, et al.Unsupervised field segmentation of unknown protocol messages[J].Computer Communications, 2019, 146:121-130.
[22]WRCCDC:Pcaps from the Western Regional Collegiate CyberDefense Competition[OL].https://archive.wrccdc.org/pcaps/.
[23]CSDN.S7协议数据集[OL].https://download.csdn.net/down-load/jizhuan0248/10780517.
[1] 鲁淑霞, 张振莲.
基于最优间隔的AdaBoostv算法的非平衡数据分类
Imbalanced Data Classification of AdaBoostv Algorithm Based on Optimum Margin
计算机科学, 2021, 48(11): 184-191. https://doi.org/10.11896/jsjkx.200900107
[2] 徐旭东, 张志祥, 张献.
私有二进制协议中变长域的格式挖掘方法
Format Mining Method of Variable-length Domain in Private Binary Protocol
计算机科学, 2020, 47(6A): 556-560. https://doi.org/10.11896/JsJkx.190900035
[3] 张云帆,周宇,黄志球.
基于语义相似度的API使用模式推荐
Semantic Similarity Based API Usage Pattern Recommendation
计算机科学, 2020, 47(3): 34-40. https://doi.org/10.11896/jsjkx.190300053
[4] 李毅豪, 洪征, 林培鸿, 冯文博.
基于粗糙集聚类的报文格式推断方法
Message Format Inference Method Based on Rough Set Clustering
计算机科学, 2020, 47(12): 319-326. https://doi.org/10.11896/jsjkx.191000193
[5] 张洪泽, 洪征, 王辰, 冯文博, 吴礼发.
基于闭合序列模式挖掘的未知协议格式推断方法
Closed Sequential Patterns Mining Based Unknown Protocol Format Inference Method
计算机科学, 2019, 46(6): 80-89. https://doi.org/10.11896/j.issn.1002-137X.2019.06.011
[6] 张峰.
机会网络中基于节点相遇间隔的缓存管理策略
Node Encounter Interval Based Buffer Management Strategy in Opportunistic Networks
计算机科学, 2019, 46(5): 57-61. https://doi.org/10.11896/j.issn.1002-137X.2019.05.008
[7] 夏英, 李刘杰, 张旭, 裴海英.
基于层次聚类的不平衡数据加权过采样方法
Weighted Oversampling Method Based on Hierarchical Clustering for Unbalanced Data
计算机科学, 2019, 46(4): 22-27. https://doi.org/10.11896/j.issn.1002-137X.2019.04.004
[8] 吴祎凡, 崔艳鹏, 胡建伟.
基于层次聚类的警报处理方法
Alert Processing Method Based on Hierarchical Clustering
计算机科学, 2019, 46(4): 203-209. https://doi.org/10.11896/j.issn.1002-137X.2019.04.032
[9] 徐晓玲, 金忠, 贲圣兰.
基于标签敏感最大间隔准则的人脸年龄两步估计算法
Facial Age Two-steps Estimation Algorithm Based on Label-sensitive Maximum Margin Criterion
计算机科学, 2018, 45(6): 284-290. https://doi.org/10.11896/j.issn.1002-137X.2018.06.050
[10] 王树怡,董东.
基于聚类和偏序序列的API用法模式挖掘
Mining of API Usage Pattern Based on Clustering and Partial Order Sequences
计算机科学, 2017, 44(Z6): 486-490. https://doi.org/10.11896/j.issn.1002-137X.2017.6A.108
[11] 李锋,谢嗣弘.
基于无监督学习的移动心电信号异常诊断研究
Study on Abnormal Diagnosis of Moving ECG Signals Based on Unsupervised Learning
计算机科学, 2017, 44(Z11): 68-71. https://doi.org/10.11896/j.issn.1002-137X.2017.11A.013
[12] 李寒,佟宁,陈峰.
一种基于层次聚类的软件架构恢复方法
Hierarchical Clustering Based Software Architecture Recovery Approach
计算机科学, 2017, 44(4): 75-78. https://doi.org/10.11896/j.issn.1002-137X.2017.04.016
[13] 熊振亚,林正浩,任浩琪.
基于跳转轨迹的分支目标缓冲研究
Efficient BTB Based on Taken Trace
计算机科学, 2017, 44(3): 195-201. https://doi.org/10.11896/j.issn.1002-137X.2017.03.042
[14] 林梦雷,刘景华,王晨曦,林耀进.
基于标记权重的多标记特征选择算法
Multi-label Feature Selection Algorithm Based on Label Weighting
计算机科学, 2017, 44(10): 289-295. https://doi.org/10.11896/j.issn.1002-137X.2017.10.052
[15] 王洋,沈记全.
基于发车时刻表的单线公交组合调度模型
Single Line Transit Mixed Scheduling Model Based on Vehicle Departure Timetable
计算机科学, 2017, 44(10): 269-275. https://doi.org/10.11896/j.issn.1002-137X.2017.10.049
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!