Computer Science ›› 2024, Vol. 51 ›› Issue (1): 284-294.doi: 10.11896/jsjkx.230400120

• Artificial Intelligence • Previous Articles     Next Articles

Automated Kaomoji Extraction Based on Large-scale Danmaku Texts

MAO Xin, LEI Zhanyao, QI Zhengwei   

  1. School of Electronics,Information and Electrical Engineering,Shanghai Jiao Tong University,Shanghai 200240,China
  • Received:2023-04-17 Revised:2023-10-31 Online:2024-01-15 Published:2024-01-12
  • About author:MAO Xin,born in 1999,postgraduate.Her main research interests include na-tural language processing and data ana-lysis.
    QI Zhengwei,born in 1976,Ph.D,professor,Ph.D supervisor,is a member of CCF(No.10710D).His main research interests include program analysis,model checking,virtual machines,and distributed systems.
  • Supported by:
    National Natural Science Foundation of China(62141218).

Abstract: As a new type of emoticon symbol that emerged in the Internet age,kaomoji not only enjoys popularity among Internet users and mainstream social media but also has indispensable value in emotional expression,cultural promotion,and other aspects.Considering that kaomoji carries rich semantic and emotional information,studying them in the context of Internet texts can promote the analysis and understanding of such texts,thus improving the effectiveness of various natural language processing tasks.Detecting and extracting kaomoji from texts are the primary steps in analyzing texts with kaomoji.However,due to the flexible structure,diverse types,and rapid evolution of kaomoji,most existing works lack a comprehensive analysis of kaomoji,resulting in limitations such as low accuracy,difficulty in determining boundaries,and poor timeliness.In this paper,through an in-depth analysis of kaomoji features,a kaomoji detection and extraction algorithm called Emoly based on a large-scale danmaku text dataset is proposed.It extracts preliminary candidate strings through preprocessing methods,combines various improved statistical indicators and filtering rules to select the final candidate strings,and ranks them based on text similarity to produce the final results.Experimental results show that the Emoly algorithm achieves a recall rate of 91% in a dataset of millions of danmaku texts,effectively and accurately detecte and extracte kaomoji from the texts.It demonstrates robustness,superiority,and generality.Additionally,the proposed algorithm provides new ideas and methods for tasks such as Chinese word segmentation,sentiment analysis,and input method dictionary updates,offering broad application value.

Key words: Natural language processing, Data analysis, Kaomoji, Video danmaku

CLC Number: 

  • TP391
[1]China Internet Network Information Center.Statistical Reporton Internet Development in China [EB/OL].(2017-08-03) [2023-03-24].https://cnnic.cn/n4/2022/0401/c88-1129.html.
[2]Wikipedia contributors.Danmaku [EB/OL].(2023-02-24)[2023-03-24].https://en.wikipedia.org/wiki/Danmaku.
[3]XIAN Y K,LI J F,ZHANG C X,et al.Video highlight shot extraction with time-sync comment [C]//Proceedings of the 7th International Workshop on Hot Topics in Planet-scale Mobile Computing and Online Social Networking.ACM,2015:31-36.
[4]XU L L,ZHANG C.Bridging video content and comments:Synchronized video description with temporal summarization of crowdsourced time-sync comments [C]//Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence.AAAI Press,2017:1611-1617.
[5]WU B,ZHONG E H,TAN B,et al.Crowdsourced time-syncvideo tagging using temporal and personalized topic modeling [C]//Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.ACM,2014:721-730.
[6]HE M,GE Y,WU L,et al.Predicting the popularity of danmu-enabled videos:A multi-factor view [C]//Proceedings of the 21st International Conference on Database Systems for Advanced Applications.Springer-Verlag,2016:351-366.
[7]WU F M,LV G Y,LIU Q,et al.Deep Semantic Representation of Time-Sync Comments for Videos [J].Journal of Computer Research and Development,2019,56(2):293-305.
[8]LV G Y,XU T,CHEN E H,et al.Reading the videos:Temporal labeling for crowdsourced time-sync videos based on semantic embedding [C]//Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence.AAAI Press,2016:3000-3006.
[9]HE M.Mining techniques for online videos’ danmu data [D].Hefei:University of Science and Technology of China,2018.
[10]Wikipedia contributors.Emoticon [EB/OL].(2023-04-20)[2023-05-15].https://en.wikipedia.org/wiki/Emoticon.
[11]JING M.Kaomoji:Emojis and cultural representations in theAge of Reading Pictures [J].Journal of Southwest University for Nationalities(Humanities and Social Science),2020,41(11):149-155.
[12]DAANTJE D,ARJAN E,JASPER G.Emoticons and social interaction on the Internet:the importance of social context [J].Computers in human behavior,2007,23(1):842-849.
[13]JARAM P,VLADIMIR B,CLAY F,et al.Emoticon style:Interpreting differences in emoticons across cultures [C]//Procee-dings of the Seventh International AAAI Conference on Weblogs and Social Media.AAAI Press,2013:466-475.
[14]CAO Z J,YE J.Attention savings and emoticons usage in BBS [C]//Proceedings of the 2009 Fourth International Conference on Computer Sciences and Convergence Information Technology.IEEE Computer Society,2009:416-419.
[15]Wikipedia contributors.Stop word [EB/OL].(2023-03-13)[2023-03-24].https://en.wikipedia.org/wiki/Stop_word.
[16]YU S,ZHU H Y,JIANG S,et al.Emoticon analysis for Chinese social media and e-commerce:The AZEmo system [J].ACM Transactions on Management Information Systems,2019,9(4):1-22.
[17]HOGENBOOM A,BAL D,FRASINCAR F,et al.Exploitingemoticons in polarity classification of text [J].Journal of Web Engineering,2015,14(1/2):22-40.
[18]YAMADA T,TSUCHIYA S,KUROIWA S,et al.Classification of facemarks using n-gram [C]//Proceedings of 2007 International Conference on Natural Language Processing and Knowledge Engineering.IEEE,2007:322-327.
[19]BEDRICK S,BECKLEY R,ROARK B,et al.Robust kaomoji detection in Twitter [C]//Proceedings of the Second Workshop on Language in Social Media.Association for Computational Linguistics,2012:56-64.
[20]ZHAO X F,JIN Z G.Multi-dimensional sentiment classification of microblog based on Emoticons and short texts [J].Journal of Harbin Institute of Technology,2020,52(5):113-120.
[21]MAO X,LEI Z Y,XIA M Y,et al.The Emoticons Discovered byEmoly [EB/OL].(2023-04-17) [2023-04-17].https://figshare.com/articles/dataset/The_Emoticons_Discovered_by_Emoly/22639207.
[22]Wikipedia contributors.Emoticon [EB/OL].(2023-03-07)[2023-03-27].https://en.wikipedia.org/wiki/Emoticon.
[23]SONG Z X.Non-verbal Communication [M].Shanghai:Fudan University Press,2008:1-18.
[24]PTASZYNSKI M,MACIEJEWSKI J,DYBALA P,et al.CAO:A fully automatic emoticon analysis system based on theory of kinesics [J].IEEE Transactions on Affective Computing,2010,1(1):46-59.
[25]CHEN X,ZHANG Y X,WU J C,et al.Construction and Analysis of Diachronic Bullet-screen Comment Corpus:Case Study of Youth Subculture Bullet-screen Comment [J].Information Research,2022,2022(9):77-85.
[26]LI Z,LI R,JIN G H.Sentiment analysis of danmaku videosbased on Naïve Bayes and sentiment dictionary [J].IEEE Access,2020:75073-75084.
[27]AHMAD S,VARMA R.Information extraction from text messages using data mining techniques [J].Malaya Journal of Matematik,2018,5(1):26-29.
[28]LIU L J.Research on text sentiment analysis for bullet screen[D].Lanzhou:Lanzhou Jiaotong University,2020.
[29]TANAKA Y,TAKAMURA H,OKUMURA M.Extraction and classification of facemarks [C]//Proceedings of the 10th International Conference on Intelligent User Interfaces.ACM,2005:28-34.
[30]KWON J,KOBAYASHI N,KAMIGAITO H,et al.Bridging between emojis and kaomojis by learning their representations from linguistic and visual information [C]//Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence.ACM,2019:116-123.
[31]YOKOI T,KOBAYASHI M,IBRAHIM R.Emoticon extraction method based on eye characters and symmetric string [C]//Proceedings of the 2015 IEEE International Conference on Systems,Man,and Cybernetics.IEEE,2015:2979-2984.
[32]Wikipedia contributors.Sogou Pinyin [EB/OL].(2022-12-24) [2023-03-27].https://en.wikipedia.org/wiki/Sogou_Pinyin.
[33]ALASADI S,BHAYA W.Review of data preprocessing techniques in data mining [J].Journal of Engineering and Applied Sciences,2017,12(16):4102-4107.
[34]LOSARWAR V,JOSHI M.Data preprocessing in web usagemining [C]//Proceedings of the International Conference on Artificial Intelligence and Embedded Systems.2012:15-16.
[35]LIU M J,WANG X F,HUANG Y L.Data preprocessing in data mining [J].Computer Science,2000,27(4):54-57.
[36]Wikipedia contributors.N-gram [EB/OL].(2023-03-10) [2023-03-27].https://en.wikipedia.org/wiki/N-gram.
[37]LIN C Y.ROUGE:A Package for Automatic Evaluation ofSummaries [C]//Text Summarization Branches Out.2004:74-81.
[38]HUANG C N,ZHAO H.Chinese Word Segmentation:A DecadeReview [J].Journal of Chinese Information Processing,2007,21(3):8-19.
[39]LUO R X,XU J J,ZHANG Y,et al.Pkuseg:A toolkit for multi-domain chinese word segmentation [EB/OL].(2019-06-27) [2023-03-27].https://doi.org/10.48550/arXiv.1906.11455.
[40]SUN M S,CHEN X X,ZHANG K X,et al.THULAC:An Efficient Lexical Analyzer for Chinese [EB/OL].(2018-07-27) [2023-03-27].https://github.com/thunlp/THULAC.
[41]SONG Y,CAI D F,ZHANG G P,et al.Approach to Chinese Word Segmentation Based on Character-Word Joint Decoding [J].Journal of Software,2009,20(9):2366-2375.
[42]YIN R C,WANG Q,LI P,et al.Multi-granularity chinese word embedding [C]//Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing.Association for Computational Linguistics,2016:981-986.
[43]DAY M Y,LEE C C.Deep learning for financial sentiment ana-lysis on finance news providers [C]//Proceedings of the 2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining.IEEE,2016:1127-1134.
[1] GE Huibin, WANG Dexin, ZHENG Tao, ZHANG Ting, XIONG Deyi. Study on Model Migration of Natural Language Processing for Domestic Deep Learning Platform [J]. Computer Science, 2024, 51(1): 50-59.
[2] GU Shiwei, LIU Jing, LI Bingchun, XIONG Deyi. Survey of Unsupervised Sentence Alignment [J]. Computer Science, 2024, 51(1): 60-67.
[3] ZHANG Yian, YANG Ying, REN Gang, WANG Gang. Study on Multimodal Online Reviews Helpfulness Prediction Based on Attention Mechanism [J]. Computer Science, 2023, 50(8): 37-44.
[4] ZHOU Ziyi, XIONG Hailing. Image Captioning Optimization Strategy Based on Deep Learning [J]. Computer Science, 2023, 50(8): 99-110.
[5] YANG Heng, ZHU Yan. Analysis of Academic Network Based on Graph OLAP [J]. Computer Science, 2023, 50(6A): 220100237-5.
[6] WEI Tao, LI Zhihua, WANG Changjie, CHENG Shunhang. Cybersecurity Threat Intelligence Mining Algorithm for Open Source Heterogeneous Data [J]. Computer Science, 2023, 50(6): 330-337.
[7] WANG Lin, MENG Zuqiang, YANG Lina. Chinese Sentiment Analysis Based on CNN-BiLSTM Model of Multi-level and Multi-scale Feature Extraction [J]. Computer Science, 2023, 50(5): 248-254.
[8] ZHEN Tiange, SONG Mingyang, JING Liping. Incorporating Multi-granularity Extractive Features for Keyphrase Generation [J]. Computer Science, 2023, 50(4): 181-187.
[9] XU Xia, ZHANG Hui, YANG Chunming, LI Bo, ZHAO Xujian. Fair Method for Spectral Clustering to Improve Intra-cluster Fairness [J]. Computer Science, 2023, 50(2): 158-165.
[10] CHEN Shifei, LIU Dong, JIANG He. CodeBERT-based Language Model for Design Patterns [J]. Computer Science, 2023, 50(12): 75-81.
[11] QIN Mingfei, FU Guohong. Multi-level Semantic Structure Enhanced Emotional Cause Span Extraction in Conversations [J]. Computer Science, 2023, 50(12): 236-245.
[12] FAN Dongxu, GUO Yi. Aspect-based Multimodal Sentiment Analysis Based on Trusted Fine-grained Alignment [J]. Computer Science, 2023, 50(12): 246-254.
[13] WANG Zhendong, DONG Kaikun, HUANG Junheng, WANG Bailing. SemFA:Extreme Multi-label Text Classification Model Based on Semantic Features and Association Attention [J]. Computer Science, 2023, 50(12): 270-278.
[14] HE Wenhao, WU Chunjiang, ZHOU Shijie, HE Chaoxin. Study on Short Text Clustering with Unsupervised SimCSE [J]. Computer Science, 2023, 50(11): 71-76.
[15] SHAO Wenqiang, CAI Ruijie, SONG Enzhou, GUO Xixi, LIU Shengli. Semantic-based Multi-architecture Binary Function Name Prediction Method [J]. Computer Science, 2023, 50(10): 369-376.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!