计算机科学 ›› 2025, Vol. 52 ›› Issue (8): 51-61.doi: 10.11896/jsjkx.241000073

• 软件工程 • 上一篇    下一篇

基于多元混合特征的源代码作者性别属性识别

刘泓玏, 陈娟, 付才, 韩兰胜, 郭晓威, 江帅   

  1. 分布式系统安全湖北省重点实验室 武汉 430074
    湖北省大数据安全工程技术研究中心 武汉 430074
    华中科技大学网络空间安全学院 武汉 430074
  • 收稿日期:2024-10-15 修回日期:2025-01-23 出版日期:2025-08-15 发布日期:2025-08-08
  • 通讯作者: 付才(fucai@hust.edu.cn)
  • 作者简介:(M202271723@hust.edu.cn)
  • 基金资助:
    国家重点研发计划(2022YFB3103402);国家自然科学基金(62072200,62172176,62127808)

Authorship Gender Recognition of Source Code Based on Multiple Mixed Features

LIU Hongle, CHEN Juan, FU Cai, HAN Lansheng, GUO Xiaowei, JIANG shuai   

  1. Hubei Provincial Key Laboratory of Distributed System Security,Wuhan 430074,ChinaHubei Province Big Data Security Engineering Technology Research Center,Wuhan 430074,ChinaDepartment of Cyberspace Security Academy,Huazhong University of Science and Technology,Wuhan 430074,China
  • Received:2024-10-15 Revised:2025-01-23 Online:2025-08-15 Published:2025-08-08
  • About author:LIU Hongle,born in 2000,postgra-duate.Her main research interests include source code authorship attribution,malicious code detection and natural language processing.
    FU Cai,born in 1977,Ph.D,professor.His main research interests include cyber security,malicious code,wireless network security and behavior analysis.
  • Supported by:
    National Key Research and Development Program of China(2022YFB3103402) and National Natural Science Foundation of China(62072200,62172176,62127808).

摘要: 随着互联网的发展,网络安全日益受到关注,打击恶意代码作者是其中重要一环。目前,通过恶意代码编写风格进行作者识别已取得显著成果。但若要深入了解作者真实信息,需对其社会属性进行分析,形成完善的人物画像。性别作为人类社会属性的关键分类指标,是个体真实信息的重要组成部分。其他社会属性也基本会与性别特征关联,对性别的区分成为深入研究其他社会属性特征的必要前提。本研究通过对程序员的源代码编写风格进行深入分析,总结了22种源代码作者性别识别关联特征。基于作者性别识别关联特征利用自适应提升算法(AdaBoost)训练源代码作者性别识别分类器,保证高识别率的同时提高模型鲁棒性。同时与自然语言性别识别算法做比较,突出源代码作者性别识别特征的适用性。从Github上分别收集115 004和22 700个带有性别标签的Java和C++源代码文件,为学术界提供了第一个带有源代码作者性别标签的研究数据集。所提出的方法在收集到的C++和Java数据集上均表现出不错的性能,分别可以达到98%和94%的准确率。提出的研究结论为从源代码作者风格到其他社会属性的映射做了探索,有助于指导从源代码作者风格到其他社会属性的进一步研究。

关键词: 软件安全, 软件取证, 源代码作者归属, 源代码作者性别识别, 特征表示

Abstract: With the development of the Internet,network security is increasingly concerned,and cracking down on malicious code authors is an important part of it.At present,author identification through malicious code writing style has achieved remarkable results.However,if we want to understand the author's real information,we need to analyze his social attributes and form a perfect portrait.Gender,as the key classification index of human social attributes,is an important part of individual real information.Other social attributes are basically associated with gender characteristics,and the distinction of gender has become a necessary prerequisite for further exploring other social attributes.This study conducts an in-depth analysis of programmers' source code writing styles and summarizes 22 gender recognition association features of source code authors.Based on the author's gender re-cognition association feature,the adaptive enhancement algorithm(AdaBoost) is used to train the source code author's gender re-cognition classifier,ensuring high recognition rate while improving model robustness.At the same time,it is compared with natural language gender recognition algorithms to highlight the applicability of gender recognition features of source code authors.This study collectes a total of 115 004 and 22 700 Java and C++source code files with gender labels from Github,providing the academic community with the first research dataset with gender labels for source code authors.The proposed method shows good performance on the collected C++ and Java data sets,reaching 98% and 94% accuracy respectively.The conclusion of this study explores the mapping from source code author style to other social attributes,which is helpful to guide the further research from source code author style to other social attributes.

Key words: Software security, Software forensics, Source code author attribution, Source code author gender identification, Feature representation

中图分类号: 

  • TP311
[1]IVAN K,EUGENE H S.Authorship analysis:identifying the author of a program[J].Computers & Security,1997,16(3):233-257.
[2]MOHAMMED A,TAMER A,AZIZ M,et al.Large-Scale andLanguage-Oblivious Code Authorship Identification[C]//Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security.2018:101-114.
[3]EGOR B,VLADIMIR K,YURII R,et al.Authorship attribution of source code:a language-agnostic approach and applicability in software engineering[C]//Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering.2021:932-944.
[4]SULIA N.The Development,Main Content,and TheoreticalEvolution of Gender Language Research Abroad [J].Modern Linguistics,2022,10(5):1162-1171.
[5] RAO D,YAROWSKY D,SHREEVATS D,et al.Classifying latent user attributes in Twitter[C]//Proceedings of International Conference on Information and Knowledge Management.2010:37-44.
[6]ZHEN L,GUENEVERE C,CHEN C,et al.RoPGen:Towards Robust Code Authorship Attribution via Automatic Coding Style Transformation[C]//44th International Conference on Software Engineering.2022.
[7] CHARLES DE R,CÈSARDED R,LOUIS DED P,et al.EdwardHistoire naturelle et morale des iles Antilles de ''Amerique. : Enrichie de plusieurs belles figures des raretez les plus considerables qui y sont d'écrites.:Avec vn vocabulaire Cara?be[M].John Carter Brobn Library.1681.
[8]INTAN M K.Differences in Language Conversation Use byGender[J].Jurnal Penelitian dan Pengembangan Sains dan Humaniora,2022,6(2):248-253.
[9]ERIK M G,PETRI K.Sex differences in personality are larger in gender equal countries:Replicating and extending a surprising finding:sex differences in gender equal countries[J].International Journal of Psychology,2018,54(6):705-711.
[10]MULAC A,BRADAC J J,GIBBONS P.Empirical support forthe gender-as-culture hypothesis:An intercultural analysis of male/female language differences[J].Human Communication Research,2001,27(1):121-152.
[11]HANCOCK A B,RUBIN B A.Influence of communication partner's gender on language[J].Journal of Language and Social Psychology,2014,34(1):46-64.
[12]PRADEEP V,KEVIN M.Gender Classification using TwitterText Data[C]//31st Irish Signals and Systems Conference.2020:1-6.
[13]SUNAKSHI M,RAKESH C B,AJIT K D.Author Profiling:Prediction of Gender and Language Variety from Document[C]//2019 International Conference on Information Technology.2019:473-477.
[14]MARION B,SUSAN L.Inferring Gender:A Scalable Methodology for Gender Detection with Online Lexical Databases[C]//Proceedings of the Second Workshop on Language Technology for Equality.2022:47-58.
[15]SAMAN D,DIANA I.Gender Identification in Twitter using N-grams and LSA:Notebook for PAN at CLEF 2018[C]//Conference and Labs of the Evaluation Forum.2018:2125.
[16]CALISKAN-ISLAM A,HARANG R,LIU A,et al.De-Anonymizing Programmers via Code Stylometry[C]//Proceedings of the 24th USENIX Conference on Security Symposium.2015:255-270.
[17]PARINAZB,JACQUELINE E R.Code Authorship Attribution using content-based and non-content-based features[C]//IEEE Canadian Conference on Electrical and Computer Engineering.2021:1-6.
[18]SOPHIA F F,KRISHNENDU G.Machine Learning Approa-ches for Authorship Attribution using Source Code Stylometry[C]//IEEE International Conference on Big Data.2021:3298-3304.
[19]DAUBER E,CALISKAN A,HARANG R,et al.Git BlameWho? Stylistic Authorship Attributionof Small,Incomplete Source Code Fragments[J].Proceedings on Privacy Enhancing Technologies,2019(3):389-408.
[20]HITESH S,VAIBHAV S,JEFFREY S.SourcererCC:Scaling Code Clone Detection to Big-Code[C]//2016 IEEE/ACM 38th International Conference on Software Engineering.2016:1157-1168.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!