计算机科学 ›› 2025, Vol. 52 ›› Issue (8): 51-61.doi: 10.11896/jsjkx.241000073
刘泓玏, 陈娟, 付才, 韩兰胜, 郭晓威, 江帅
LIU Hongle, CHEN Juan, FU Cai, HAN Lansheng, GUO Xiaowei, JIANG shuai
摘要: 随着互联网的发展,网络安全日益受到关注,打击恶意代码作者是其中重要一环。目前,通过恶意代码编写风格进行作者识别已取得显著成果。但若要深入了解作者真实信息,需对其社会属性进行分析,形成完善的人物画像。性别作为人类社会属性的关键分类指标,是个体真实信息的重要组成部分。其他社会属性也基本会与性别特征关联,对性别的区分成为深入研究其他社会属性特征的必要前提。本研究通过对程序员的源代码编写风格进行深入分析,总结了22种源代码作者性别识别关联特征。基于作者性别识别关联特征利用自适应提升算法(AdaBoost)训练源代码作者性别识别分类器,保证高识别率的同时提高模型鲁棒性。同时与自然语言性别识别算法做比较,突出源代码作者性别识别特征的适用性。从Github上分别收集115 004和22 700个带有性别标签的Java和C++源代码文件,为学术界提供了第一个带有源代码作者性别标签的研究数据集。所提出的方法在收集到的C++和Java数据集上均表现出不错的性能,分别可以达到98%和94%的准确率。提出的研究结论为从源代码作者风格到其他社会属性的映射做了探索,有助于指导从源代码作者风格到其他社会属性的进一步研究。
中图分类号:
[1]IVAN K,EUGENE H S.Authorship analysis:identifying the author of a program[J].Computers & Security,1997,16(3):233-257. [2]MOHAMMED A,TAMER A,AZIZ M,et al.Large-Scale andLanguage-Oblivious Code Authorship Identification[C]//Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security.2018:101-114. [3]EGOR B,VLADIMIR K,YURII R,et al.Authorship attribution of source code:a language-agnostic approach and applicability in software engineering[C]//Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering.2021:932-944. [4]SULIA N.The Development,Main Content,and TheoreticalEvolution of Gender Language Research Abroad [J].Modern Linguistics,2022,10(5):1162-1171. [5] RAO D,YAROWSKY D,SHREEVATS D,et al.Classifying latent user attributes in Twitter[C]//Proceedings of International Conference on Information and Knowledge Management.2010:37-44. [6]ZHEN L,GUENEVERE C,CHEN C,et al.RoPGen:Towards Robust Code Authorship Attribution via Automatic Coding Style Transformation[C]//44th International Conference on Software Engineering.2022. [7] CHARLES DE R,CÈSARDED R,LOUIS DED P,et al.EdwardHistoire naturelle et morale des iles Antilles de ''Amerique. : Enrichie de plusieurs belles figures des raretez les plus considerables qui y sont d'écrites.:Avec vn vocabulaire Cara?be[M].John Carter Brobn Library.1681. [8]INTAN M K.Differences in Language Conversation Use byGender[J].Jurnal Penelitian dan Pengembangan Sains dan Humaniora,2022,6(2):248-253. [9]ERIK M G,PETRI K.Sex differences in personality are larger in gender equal countries:Replicating and extending a surprising finding:sex differences in gender equal countries[J].International Journal of Psychology,2018,54(6):705-711. [10]MULAC A,BRADAC J J,GIBBONS P.Empirical support forthe gender-as-culture hypothesis:An intercultural analysis of male/female language differences[J].Human Communication Research,2001,27(1):121-152. [11]HANCOCK A B,RUBIN B A.Influence of communication partner's gender on language[J].Journal of Language and Social Psychology,2014,34(1):46-64. [12]PRADEEP V,KEVIN M.Gender Classification using TwitterText Data[C]//31st Irish Signals and Systems Conference.2020:1-6. [13]SUNAKSHI M,RAKESH C B,AJIT K D.Author Profiling:Prediction of Gender and Language Variety from Document[C]//2019 International Conference on Information Technology.2019:473-477. [14]MARION B,SUSAN L.Inferring Gender:A Scalable Methodology for Gender Detection with Online Lexical Databases[C]//Proceedings of the Second Workshop on Language Technology for Equality.2022:47-58. [15]SAMAN D,DIANA I.Gender Identification in Twitter using N-grams and LSA:Notebook for PAN at CLEF 2018[C]//Conference and Labs of the Evaluation Forum.2018:2125. [16]CALISKAN-ISLAM A,HARANG R,LIU A,et al.De-Anonymizing Programmers via Code Stylometry[C]//Proceedings of the 24th USENIX Conference on Security Symposium.2015:255-270. [17]PARINAZB,JACQUELINE E R.Code Authorship Attribution using content-based and non-content-based features[C]//IEEE Canadian Conference on Electrical and Computer Engineering.2021:1-6. [18]SOPHIA F F,KRISHNENDU G.Machine Learning Approa-ches for Authorship Attribution using Source Code Stylometry[C]//IEEE International Conference on Big Data.2021:3298-3304. [19]DAUBER E,CALISKAN A,HARANG R,et al.Git BlameWho? Stylistic Authorship Attributionof Small,Incomplete Source Code Fragments[J].Proceedings on Privacy Enhancing Technologies,2019(3):389-408. [20]HITESH S,VAIBHAV S,JEFFREY S.SourcererCC:Scaling Code Clone Detection to Big-Code[C]//2016 IEEE/ACM 38th International Conference on Software Engineering.2016:1157-1168. |
|