Computer Science ›› 2025, Vol. 52 ›› Issue (8): 51-61.doi: 10.11896/jsjkx.241000073

Previous Articles     Next Articles

Authorship Gender Recognition of Source Code Based on Multiple Mixed Features

LIU Hongle, CHEN Juan, FU Cai, HAN Lansheng, GUO Xiaowei, JIANG shuai   

  1. Hubei Provincial Key Laboratory of Distributed System Security,Wuhan 430074,ChinaHubei Province Big Data Security Engineering Technology Research Center,Wuhan 430074,ChinaDepartment of Cyberspace Security Academy,Huazhong University of Science and Technology,Wuhan 430074,China
  • Received:2024-10-15 Revised:2025-01-23 Online:2025-08-15 Published:2025-08-08
  • About author:LIU Hongle,born in 2000,postgra-duate.Her main research interests include source code authorship attribution,malicious code detection and natural language processing.
    FU Cai,born in 1977,Ph.D,professor.His main research interests include cyber security,malicious code,wireless network security and behavior analysis.
  • Supported by:
    National Key Research and Development Program of China(2022YFB3103402) and National Natural Science Foundation of China(62072200,62172176,62127808).

Abstract: With the development of the Internet,network security is increasingly concerned,and cracking down on malicious code authors is an important part of it.At present,author identification through malicious code writing style has achieved remarkable results.However,if we want to understand the author's real information,we need to analyze his social attributes and form a perfect portrait.Gender,as the key classification index of human social attributes,is an important part of individual real information.Other social attributes are basically associated with gender characteristics,and the distinction of gender has become a necessary prerequisite for further exploring other social attributes.This study conducts an in-depth analysis of programmers' source code writing styles and summarizes 22 gender recognition association features of source code authors.Based on the author's gender re-cognition association feature,the adaptive enhancement algorithm(AdaBoost) is used to train the source code author's gender re-cognition classifier,ensuring high recognition rate while improving model robustness.At the same time,it is compared with natural language gender recognition algorithms to highlight the applicability of gender recognition features of source code authors.This study collectes a total of 115 004 and 22 700 Java and C++source code files with gender labels from Github,providing the academic community with the first research dataset with gender labels for source code authors.The proposed method shows good performance on the collected C++ and Java data sets,reaching 98% and 94% accuracy respectively.The conclusion of this study explores the mapping from source code author style to other social attributes,which is helpful to guide the further research from source code author style to other social attributes.

Key words: Software security, Software forensics, Source code author attribution, Source code author gender identification, Feature representation

CLC Number: 

  • TP311
[1]IVAN K,EUGENE H S.Authorship analysis:identifying the author of a program[J].Computers & Security,1997,16(3):233-257.
[2]MOHAMMED A,TAMER A,AZIZ M,et al.Large-Scale andLanguage-Oblivious Code Authorship Identification[C]//Proceedings of the 2018 ACM SIGSAC Conference on Computer and Communications Security.2018:101-114.
[3]EGOR B,VLADIMIR K,YURII R,et al.Authorship attribution of source code:a language-agnostic approach and applicability in software engineering[C]//Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering.2021:932-944.
[4]SULIA N.The Development,Main Content,and TheoreticalEvolution of Gender Language Research Abroad [J].Modern Linguistics,2022,10(5):1162-1171.
[5] RAO D,YAROWSKY D,SHREEVATS D,et al.Classifying latent user attributes in Twitter[C]//Proceedings of International Conference on Information and Knowledge Management.2010:37-44.
[6]ZHEN L,GUENEVERE C,CHEN C,et al.RoPGen:Towards Robust Code Authorship Attribution via Automatic Coding Style Transformation[C]//44th International Conference on Software Engineering.2022.
[7] CHARLES DE R,CÈSARDED R,LOUIS DED P,et al.EdwardHistoire naturelle et morale des iles Antilles de ''Amerique. : Enrichie de plusieurs belles figures des raretez les plus considerables qui y sont d'écrites.:Avec vn vocabulaire Cara?be[M].John Carter Brobn Library.1681.
[8]INTAN M K.Differences in Language Conversation Use byGender[J].Jurnal Penelitian dan Pengembangan Sains dan Humaniora,2022,6(2):248-253.
[9]ERIK M G,PETRI K.Sex differences in personality are larger in gender equal countries:Replicating and extending a surprising finding:sex differences in gender equal countries[J].International Journal of Psychology,2018,54(6):705-711.
[10]MULAC A,BRADAC J J,GIBBONS P.Empirical support forthe gender-as-culture hypothesis:An intercultural analysis of male/female language differences[J].Human Communication Research,2001,27(1):121-152.
[11]HANCOCK A B,RUBIN B A.Influence of communication partner's gender on language[J].Journal of Language and Social Psychology,2014,34(1):46-64.
[12]PRADEEP V,KEVIN M.Gender Classification using TwitterText Data[C]//31st Irish Signals and Systems Conference.2020:1-6.
[13]SUNAKSHI M,RAKESH C B,AJIT K D.Author Profiling:Prediction of Gender and Language Variety from Document[C]//2019 International Conference on Information Technology.2019:473-477.
[14]MARION B,SUSAN L.Inferring Gender:A Scalable Methodology for Gender Detection with Online Lexical Databases[C]//Proceedings of the Second Workshop on Language Technology for Equality.2022:47-58.
[15]SAMAN D,DIANA I.Gender Identification in Twitter using N-grams and LSA:Notebook for PAN at CLEF 2018[C]//Conference and Labs of the Evaluation Forum.2018:2125.
[16]CALISKAN-ISLAM A,HARANG R,LIU A,et al.De-Anonymizing Programmers via Code Stylometry[C]//Proceedings of the 24th USENIX Conference on Security Symposium.2015:255-270.
[17]PARINAZB,JACQUELINE E R.Code Authorship Attribution using content-based and non-content-based features[C]//IEEE Canadian Conference on Electrical and Computer Engineering.2021:1-6.
[18]SOPHIA F F,KRISHNENDU G.Machine Learning Approa-ches for Authorship Attribution using Source Code Stylometry[C]//IEEE International Conference on Big Data.2021:3298-3304.
[19]DAUBER E,CALISKAN A,HARANG R,et al.Git BlameWho? Stylistic Authorship Attributionof Small,Incomplete Source Code Fragments[J].Proceedings on Privacy Enhancing Technologies,2019(3):389-408.
[20]HITESH S,VAIBHAV S,JEFFREY S.SourcererCC:Scaling Code Clone Detection to Big-Code[C]//2016 IEEE/ACM 38th International Conference on Software Engineering.2016:1157-1168.
[1] YANG Jian, SUN Liu, ZHANG Lifang. Survey on Data Processing and Data Augmentation in Low-resource Language Automatic Speech Recognition [J]. Computer Science, 2025, 52(8): 86-99.
[2] WEI Youyuan, SONG Jianhua, ZHANG Yan. Survey of Binary Code Similarity Detection Method [J]. Computer Science, 2025, 52(6): 365-380.
[3] SU Ruqi, BIAN Xiong, ZHU Songhao. Few-shot Images Classification Based on Clustering Optimization Learning [J]. Computer Science, 2024, 51(6A): 230300227-7.
[4] LIU Chunling, QI Xuyan, TANG Yonghe, SUN Xuekai, LI Qinghao, ZHANG Yu. Summary of Token-based Source Code Clone Detection Techniques [J]. Computer Science, 2024, 51(6): 12-22.
[5] SHEN Nan, CHEN Gang. Formalization of Inverse Matrix Operation Based on Coq [J]. Computer Science, 2023, 50(6A): 220400108-7.
[6] GAO Yuzhao, XING Yunhan, LIU Jiaxiang. Constraint-based Verification Method for Neural Networks [J]. Computer Science, 2023, 50(11A): 221000045-5.
[7] LUO Huilan, YU Yawei, WANG Chanjuan. Multi-dimensional Feature Excitation Network for Video Action Recognition [J]. Computer Science, 2023, 50(11A): 230300115-8.
[8] ZHU Guang-li, XU Xin, ZHANG Shun-xiang, WU Hou-yue, HUANG Ju. PosNet:Position-based Causal Relation Extraction Network [J]. Computer Science, 2022, 49(12): 305-311.
[9] FANG Lei, WU Ze-hui, WEI Qiang. Summary of Binary Code Similarity Detection Techniques [J]. Computer Science, 2021, 48(5): 1-8.
[10] DONG Hu-sheng, ZHONG Shan, YANG Yuan-feng, SUN Xun, GONG Sheng-rong. Person Re-identification by Region Correlated Deep Feature Learning with Multiple Granularities [J]. Computer Science, 2021, 48(12): 269-277.
[11] CHAI Bing, LI Dong-dong, WANG Zhe, GAO Da-qi. EEG Emotion Recognition Based on Frequency and Channel Convolutional Attention [J]. Computer Science, 2021, 48(12): 312-318.
[12] XIAO Xiao and KONG Fan-zhi. New Representation of Facial Affect Based on Triangular Coordinate System [J]. Computer Science, 2020, 47(6A): 250-253.
[13] GONG Kou-lin, ZHOU Yu, DING Li, WANG Yong-chao. Vulnerability Detection Using Bidirectional Long Short-term Memory Networks [J]. Computer Science, 2020, 47(5): 295-300.
[14] YUAN Ding, WANG Qian, DENG Li-wei. Clustering Assist Feature Alignment for Unsupervised Domain Adaptation [J]. Computer Science, 2019, 46(3): 221-226.
[15] ZHANG Xiong and LI Zhou-jun. Survey of Fuzz Testing Technology [J]. Computer Science, 2016, 43(5): 1-8.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!