Computer Science ›› 2026, Vol. 53 ›› Issue (2): 289-299.doi: 10.11896/jsjkx.241200004

• Artificial Intelligence • Previous Articles     Next Articles

Chinese Hate Speech Detection Incorporating Hate Object Features and Variant Word Restoration Mechanism

SUN Mingxu, LIANG Gang, WU Yifei, HU Haixin   

  1. School of Cyber Science and Engineering,Sichuan University,Chengdu 610207,China
  • Received:2024-12-02 Revised:2025-03-07 Published:2026-02-10
  • About author:SUN Mingxu,born in 2000,postgra-duate.His main research interests include hate speech detection and social network.
    LIANG Gang,born in 1976,Ph.D,associate professor,master supervisor.His main research interests include network security,online public opinion analysis and prediction,and AI security.
  • Supported by:
    National Natural Science Foundation of China(62162057),National Natural Science Foundation of Sichuan Pro-vince(2025ZNSFSC0509),Sichuan Province Science and Technology Department Key R&D Projects (2023YFG0294) and Local Projects of the Ministry of Education(2023CDLZ-2).

Abstract: The rise of online hate speech and its significant societal harms have made automatic hate speech detection a critical task.Existing methods overlook the impact of hate objects on semantic extraction for hate speech detection,leading to inadequate contextual feature extraction and susceptibility to decision errors induced by specific expressions.Meanwhile,these methods fail to consider the interference of variant words on semantic extraction,resulting in a high miss rate in hate speech detection.Furthermore,the field of Chinese hate speech detection lacks the support of available datasets.To tackle these challenges,this paper proposes a hate speech detection method incorporating hate object features and variant word restoration mechanism.The method treats hate object recognition as an intermediate task,guiding the model to fully learn the contextual features of hate objects,thereby enhancing text comprehension in hate speech detection.Additionally,a variant word restoration module fine-tuned based on ChatGLM2-6B is proposed.It aims to effectively reduce the interference of variant words on hate speech detection by restoring variant words to their normal equivalents.Finally,a Chinese hate speech dataset is also presented to facilitate further research in this field.Experimental results verify that the proposed method achieves a 96.71% F1 score,outperforming baseline methods in all metrics.Specifically,the model exhibits a 4.21% improvement in detection accuracy for specific scenes and a 3.45% decrease in the miss rate caused by variant words.

Key words: Hate speech detection, Chinese dataset, Hate object recognition, Variant word restoration, Text enhancement, Natural language processing

CLC Number: 

  • TP391
[1]ANTI-DEFAMATION LEAGUE.Online Hate and Harass-ment:The American Experience 2023[EB/OL].https://extremismterms.adl.org/resources/report/online-hate-and-harassment-american-experience-2023.
[2]GANDHI A,AHIR P,ADHVARYU K,et al.Hate speech detection:A comprehensive review of recent works[J].Expert Systems,2024,41(8):e13562.
[3]The Dark data Project and the Sentinel Project.Hatebase[EB/OL].https://hatebase.org/.
[4]VALERIOBASILE.Hurtlex[EB/OL].https://github.com/valeriobasile/hurtlex.
[5]DAVIDSON T,WARMSLEY D,MACY M,et al.Automatedhate speech detection and the problem of offensive language[C]//Proceedings of the International AAAI Conference on Web and Social Media.Palo Alto,CA:AAAI,2017:512-515.
[6]WASEEM Z,HOVY D.Hateful symbols or hateful people?Predictive features for hate speech detection on twitter[C]//Proceedings of the NAACL student research workshop.Stroudsburg,PA:ACL,2016:88-93.
[7]ZHANG Z,ROBINSON D,TEPPER J.Detecting hate speech on twitter using a convolution-gru based deep neural network[C]//The Semantic Web:15th International Conference,ESWC 2018.Berlin:Springer-Verlag,2018:745-760.
[8]CASELLI T,BASILE V,MITROVIĆ J,et al.Hatebert:Re-training bert for abusive language detection in english[C]//Proceedings of the 5th Workshop on Online Abuse and Harms(WOAH 2021).Stroudsburg,PA:ACL,2021:17-25.
[9]GAMBÄCK B,SIKDAR U K.Using convolutional neural networks to classify hate-speech[C]//Proceedings of the First Workshop on Abusive Language Online.Stroudsburg,PA:ACL,2017:85-90.
[10]SAHOO N R,BERIA G P,BHATTACHARYYA P.IndicCONAN:A Multilingual Dataset for Combating Hate Speech in Indian Context[C]//Proceedings of the AAAI Conference on Artificial Intelligence.Menlo Park,CA:AAAI,2024:22313-22321.
[11]HEBERT L,SAHU G,GUO Y,et al.Multi-modal discussion transformer:Integrating text,images and graph transformers to detect hate speech on social media[C]//Proceedings of the AAAI Conference on Artificial Intelligence.Menlo Park,CA:AAAI,2024:22096-22104.
[12]UNITED NATIONS.What is hate speech?[EB/OL].https://www.un.org/hate-speech/understanding-hate-speech/what-is-hate-speech.
[13]THE European Commission.Code of conduct on countering illegal hate speech online[EB/OL].https://ec.europa.eu/newsroom/just/document.cfm?doc_id=42985.
[14]WIKIPEDIA.Hate speech[EB/OL].https://en.wikipedia.org/wiki/Hate_speech.
[15]XU J L,HAO J H,BIAN X M,et al.Multi-task fine-tuning on bert using spelling errors correction for chinese text classification robustness[C]//2021 IEEE 4th International Conference on Big Data and Artificial Intelligence(BDAI).Piscataway,NJ:IEEE,2021:110-114.
[16]DENG J,ZHOU J,SUN H,et al.COLD:A benchmark for Chinese offensive language detection[C]//Proceedings of the 2022 Conference on Empirical Methods in Natural Language Proces-sing.Stroudsburg,PA:ACL,2022:11580-11599.
[17]MAGICSMX.CHS-CORPUS-DataSet[EB/OL].https://github.com/Magicsmx/CHS-CORPUS-DataSet.
[18]PITSILIS G K,RAMAMPIARO H,LANGSETH H.Effective hate-speech detection in Twitter data using recurrent neural networks[J].Applied Intelligence,2018,48(12):4730-4742.
[19]IWENDI C,SRIVASTAVA G,KHAN S,et al.Cyberbullyingdetection solutions based on deep learning architectures[J].Multimedia Systems,2023,29(3):1839-1852.
[20]MANDL T,MODHA S,MAJUMDER P,et al.Overview of the hasoc track at fire 2019:Hate speech and offensive content identification in indo-european languages[C]//Proceedings of the 11th annual meeting of the Forum for Information Retrieval Evaluation.New York:ACM,2019:14-17.
[21]FRI-DRIKSDÓTTIR S R,SIMONSEN A,ÁSMUNDSSON A S,et al.Ice and Fire:Dataset on Sentiment,Emotions,Toxicity,Sarcasm,Hate speech,Sympathy and More in Icelandic Blog Comments[C]//Proceedings of the Fourth Workshop on Threat,Aggression & Cyberbullying@ LREC-COLING-2024.Paris:ELRA,2024:73-84.
[22]HOSSAIN E,SHARIF O,HOQUE M M,et al.DecipheringHate:Identifying Hateful Memes and Their Targets[C]//Proceedings of the Annual Meeting of the Association for Computational Linguistics.Stroudsburg,PA:ACL,2024:8347-8359.
[23]SINGH A,THAKUR R.Generalizable Multilingual HateSpeech Detection on Low Resource Indian Languages using Fair Selection in Federated Learning[C]//Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.Stroudsburg,PA:ACL,2024:7204-7214.
[24]MATHEW B,SAHA P,YIMAM S M,et al.Hatexplain:Abenchmark dataset for explainable hate speech detection[C]//Proceedings of the AAAI Conference on Artificial Intelligence.Menlo Park,CA:AAAI,2021:14867-14875.
[25]BAUER N,PREISIG M,VOLK M.Offensiveness,Hate,Emotion and GPT:Benchmarking GPT3.5 and GPT4 as Classifiers on Twitter-specific Datasets[C]//Proceedings of the Fourth Workshop on Threat,Aggression & Cyberbullying@ LREC-COLING-2024.Paris:ELRA,2024:126-133.
[26]ZHANG Y,LI Z,BAO Z,et al.MuCGEC:a multi-referencemulti-source evaluation dataset for Chinese grammatical error correction[C]//2022 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies,Proceedings of the Conference.Stroudsburg,PA:ACL,2022:3118-3130.
[27]YANG H,LIN C J.Tocp:A dataset for chinese profanity processing[C]//Proceedings of the Second Workshop on Trolling,Aggression and Cyberbullying.Paris:ELRA,2020:6-12.
[28]CHUNG I,LIN C J.Tocab:A dataset for chinese abusive language processing[C]//2021 IEEE 22nd International Conference on Information Reuse and Integration for Data Science(IRI).Piscataway,NJ:IEEE,2021:445-452.
[29]JIANG A,YANG X,LIU Y,et al.SWSR:A Chinese dataset and lexicon for online sexism detection[J].Online Social Networks and Media,2022,27:100182.
[30]RAO A,ZHANG Y,JIA Q,et al.Chinese Hate Speech detection method Based on RoBERTa-WWM.[C]//Proceedings of the 22nd Chinese National Conference on Computational Linguistics.Beijing:Chinese Information Processing Society of China,2023:501-511.
[31]LU J,XU B,ZHANG X,et al.Facilitating fine-grained detection of Chinese toxic language:Hierarchical taxonomy,resources,and benchmarks[C]//Proceedings of the Annual Meeting of the Association for Computational Linguistics.Stroudsburg,PA:ACL,2023:16235-16250.
[32]DESTWANG.CTC2021[EB/OL].https://github.com/dest-wang/CTC2021.
[33]TSENG Y H,LEE L H,CHANG L P,et al.Introduction to SIGHAN 2015 bake-off for Chinese spelling check[C]//Proceedings of the Eighth SIGHAN Workshop on Chinese Language Processing.Stroudsburg,PA:ACL,2015:32-37.
[34]WU S H,LIU C L,LEE L H.Chinese spelling check evaluation at SIGHAN bake-off 2013[C]//Proceedings of the Seventh SIGHAN Workshop on Chinese Language Processing.Stroudsburg,PA:ACL,2013:35-42.
[35]YU L C,LEE L H,TSENG Y H,et al.Overview of SIGHAN 2014 bake-off for Chinese spelling check[C]//Proceedings of The Third CIPS-SIGHAN Joint Conference on Chinese Language Processing.Stroudsburg,PA:ACL,2014:126-132.
[36]ZHAO Y,JIANG N,SUN W,et al.Overview of the nlpcc 2018 shared task:Grammatical error correction[C]//The 7th CCF International Conference on Natural Language Processing and Chinese Computing.Berlin:Springer-Verlag,2018:439-445.
[37]LXNENG.xpinyin[EB/OL].https://pypi.org/project/xpinyin/.
[38]HOWL-ANDERSON.hanzi_chaizi[EB/OL].https://github.com/howl-anderson/hanzi_chaizi.
[39]MAGICSMX.MultiVWRD-Dataset[EB/OL].https://github.com/Magicsmx/MultiVWRD-Dataset.
[40]MAGICSMX.HateSpeechKeywords[EB/OL].https://github.com/Magicsmx/HateSpeechKeywords.
[41]HILLZHANG1999.ChERRANT[EB/OL].https://github.com/HillZhang1999/MuCGEC/tree/main/scorers/ChERRANT.
[42]GLM T,ZENG A,XU B,et al.ChatGLM:A family of large language models from glm-130b to glm-4 all tools[J].arXiv:2406.12793,2024.
[43]OPENAI.GPT-4o mini:advancing cost-efficient intelligence[EB/OL].https://openai.com/index/gpt-4o-mini-advancing-cost-efficient-intelligence/.
[44]TONGYI LABS.Qwen-max[EB/OL].https://help.aliyun.com/zh/dashscope/developer-reference/quick-start.
[45]LEWIS M.Bart:Denoising sequence-to-sequence pre-training for natural language generation,translation,and comprehension[C]//Proceedings of the Annual Meeting of the Association for Computational Linguistics.Stroudsburg,PA:ACL,2020:7871-7880.
[46]RAFFEL C,SHAZEER N,ROBERTS A,et al.Exploring the limits of transfer learning with a unified text-to-text transformer[J].Journal of Machine Learning Research,2020,21(140):1-67.
[47]CUI Y,CHE W,LIU T,et al.Pre-training with whole word masking for chinese bert[J].IEEE/ACM Transactions on Au-dio,Speech,and Language Processing,2021,29:3504-3514.
[1] CHENG Zhangtao, HUANG Haoran, XUE He, LIU Leyuan, ZHONG Ting, ZHOU Fan. Event Causality Identification Model Based on Prompt Learning and Hypergraph [J]. Computer Science, 2025, 52(9): 303-312.
[2] LIU Le, XIAO Rong, YANG Xiao. Application of Decoupled Knowledge Distillation Method in Document-level RelationExtraction [J]. Computer Science, 2025, 52(8): 277-287.
[3] ZHENG Cheng, YANG Nan. Aspect-based Sentiment Analysis Based on Syntax,Semantics and Affective Knowledge [J]. Computer Science, 2025, 52(7): 218-225.
[4] LIU Yanlun, XIAO Zheng, NIE Zhenyu, LE Yuquan, LI Kenli. Case Element Association with Evidence Extraction for Adjudication Assistance [J]. Computer Science, 2025, 52(2): 222-230.
[5] XU Siyao, ZENG Jianjun, ZHANG Weiyan, YE Qi, ZHU Yan. Dependency Parsing for Chinese Electronic Medical Record Enhanced by Dual-scale Collaboration of Large and Small Language Models [J]. Computer Science, 2025, 52(2): 253-260.
[6] ZHANG Peng, ZHANG Daojuan, CHEN Kai, ZHAO Yufei, ZHANG Yingjie, FEI Kexiong. Enhancing NLP Robustness Against Attacks with Retrieval-augmented Classification and Decoupled Representations [J]. Computer Science, 2025, 52(12): 428-434.
[7] LIU Weijie, TANG Zecheng, LI Juntao. MemLong:Memory-augmented Retrieval for Long Text Modeling [J]. Computer Science, 2025, 52(12): 231-238.
[8] XIA Peng, ZHANG Yijun, QI Ji. Multi-agent Collaborative Code Generation Technology Driven by Large Language Models [J]. Computer Science, 2025, 52(11A): 241200033-9.
[9] YUAN Tianhao, WANG Yongjun, WANG Baoshan, WANG Zhongyuan. Review of Artificial Intelligence Generated Content Applications in Natural Language Processing [J]. Computer Science, 2025, 52(11A): 241200156-12.
[10] WEI Hao, ZHANG Zongyu, DIAO Hongyue, DENG Yaochen. Review of Application of Information Extraction Technology in Digital Humanities [J]. Computer Science, 2025, 52(11A): 250600198-10.
[11] ZHAO Hongyi, LI Zhiyuan, BU Fanliang. Multi-language Embedding Graph Convolutional Network for Hate Speech Detection [J]. Computer Science, 2025, 52(11A): 241200023-8.
[12] FU Juan. Research on Application of Deep Learning-based Natural Language Processing Technology inIntelligent Translation Systems [J]. Computer Science, 2025, 52(11A): 241000037-6.
[13] ZHANG Jiawei, WANG Zhongqing, CHEN Jiali. Multi-grained Sentiment Analysis of Comments Based on Text Generation [J]. Computer Science, 2025, 52(10): 239-246.
[14] ZHANG Jian, LI Hui, ZHANG Shengming, WU Jie, PENG Ying. Review of Pre-training Methods for Visually-rich Document Understanding [J]. Computer Science, 2025, 52(1): 259-276.
[15] GUO Zhiqiang, GUAN Donghai, YUAN Weiwei. Word-Character Model with Low Lexical Information Loss for Chinese NER [J]. Computer Science, 2024, 51(8): 272-280.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!