计算机科学 ›› 2024, Vol. 51 ›› Issue (1): 284-294.doi: 10.11896/jsjkx.230400120
毛馨, 雷瞻遥, 戚正伟
MAO Xin, LEI Zhanyao, QI Zhengwei
摘要: 作为网络时代产生的新型表情符号,颜文字不仅受到了网络用户与社会主流媒体的青睐,被广泛应用于网络文本中,而且在情感表达、文化宣传等方面具有独特的价值。鉴于颜文字具有丰富的语义情感信息,结合颜文字对网络文本进行研究,能够促进对网络文本的分析与理解,提高多项自然语言处理任务的效果。对文本中的颜文字进行检测与提取,是结合颜文字进行文本分析的首要步骤;然而,由于颜文字具有结构灵活、种类丰富、更新换代快等特点,现有工作大多缺乏对颜文字的整体分析,具有准确率低、边界确定困难、时效性差等局限性。文中通过深入分析颜文字的特征,提出了一种基于大规模弹幕文本的颜文字检测与提取算法Emoly。该算法通过预处理方法提取出初步候选字符串,将多种改进的统计指标与过滤规则相结合,用于筛选出最终候选字符串,并通过文本相似度对其排序,输出最终结果。实验结果表明,Emoly算法在百万规模的弹幕文本中达到了91%的召回率,能够全面而准确地将文本中的颜文字检测并提取出来,具有稳健性、优越性与通用性。同时,该算法还为中文分词、情感分析、输入法词库更新等任务提供了新的解决思路与方法,具有广泛的应用价值。
中图分类号:
[1]China Internet Network Information Center.Statistical Reporton Internet Development in China [EB/OL].(2017-08-03) [2023-03-24].https://cnnic.cn/n4/2022/0401/c88-1129.html. [2]Wikipedia contributors.Danmaku [EB/OL].(2023-02-24)[2023-03-24].https://en.wikipedia.org/wiki/Danmaku. [3]XIAN Y K,LI J F,ZHANG C X,et al.Video highlight shot extraction with time-sync comment [C]//Proceedings of the 7th International Workshop on Hot Topics in Planet-scale Mobile Computing and Online Social Networking.ACM,2015:31-36. [4]XU L L,ZHANG C.Bridging video content and comments:Synchronized video description with temporal summarization of crowdsourced time-sync comments [C]//Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence.AAAI Press,2017:1611-1617. [5]WU B,ZHONG E H,TAN B,et al.Crowdsourced time-syncvideo tagging using temporal and personalized topic modeling [C]//Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.ACM,2014:721-730. [6]HE M,GE Y,WU L,et al.Predicting the popularity of danmu-enabled videos:A multi-factor view [C]//Proceedings of the 21st International Conference on Database Systems for Advanced Applications.Springer-Verlag,2016:351-366. [7]WU F M,LV G Y,LIU Q,et al.Deep Semantic Representation of Time-Sync Comments for Videos [J].Journal of Computer Research and Development,2019,56(2):293-305. [8]LV G Y,XU T,CHEN E H,et al.Reading the videos:Temporal labeling for crowdsourced time-sync videos based on semantic embedding [C]//Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence.AAAI Press,2016:3000-3006. [9]HE M.Mining techniques for online videos’ danmu data [D].Hefei:University of Science and Technology of China,2018. [10]Wikipedia contributors.Emoticon [EB/OL].(2023-04-20)[2023-05-15].https://en.wikipedia.org/wiki/Emoticon. [11]JING M.Kaomoji:Emojis and cultural representations in theAge of Reading Pictures [J].Journal of Southwest University for Nationalities(Humanities and Social Science),2020,41(11):149-155. [12]DAANTJE D,ARJAN E,JASPER G.Emoticons and social interaction on the Internet:the importance of social context [J].Computers in human behavior,2007,23(1):842-849. [13]JARAM P,VLADIMIR B,CLAY F,et al.Emoticon style:Interpreting differences in emoticons across cultures [C]//Procee-dings of the Seventh International AAAI Conference on Weblogs and Social Media.AAAI Press,2013:466-475. [14]CAO Z J,YE J.Attention savings and emoticons usage in BBS [C]//Proceedings of the 2009 Fourth International Conference on Computer Sciences and Convergence Information Technology.IEEE Computer Society,2009:416-419. [15]Wikipedia contributors.Stop word [EB/OL].(2023-03-13)[2023-03-24].https://en.wikipedia.org/wiki/Stop_word. [16]YU S,ZHU H Y,JIANG S,et al.Emoticon analysis for Chinese social media and e-commerce:The AZEmo system [J].ACM Transactions on Management Information Systems,2019,9(4):1-22. [17]HOGENBOOM A,BAL D,FRASINCAR F,et al.Exploitingemoticons in polarity classification of text [J].Journal of Web Engineering,2015,14(1/2):22-40. [18]YAMADA T,TSUCHIYA S,KUROIWA S,et al.Classification of facemarks using n-gram [C]//Proceedings of 2007 International Conference on Natural Language Processing and Knowledge Engineering.IEEE,2007:322-327. [19]BEDRICK S,BECKLEY R,ROARK B,et al.Robust kaomoji detection in Twitter [C]//Proceedings of the Second Workshop on Language in Social Media.Association for Computational Linguistics,2012:56-64. [20]ZHAO X F,JIN Z G.Multi-dimensional sentiment classification of microblog based on Emoticons and short texts [J].Journal of Harbin Institute of Technology,2020,52(5):113-120. [21]MAO X,LEI Z Y,XIA M Y,et al.The Emoticons Discovered byEmoly [EB/OL].(2023-04-17) [2023-04-17].https://figshare.com/articles/dataset/The_Emoticons_Discovered_by_Emoly/22639207. [22]Wikipedia contributors.Emoticon [EB/OL].(2023-03-07)[2023-03-27].https://en.wikipedia.org/wiki/Emoticon. [23]SONG Z X.Non-verbal Communication [M].Shanghai:Fudan University Press,2008:1-18. [24]PTASZYNSKI M,MACIEJEWSKI J,DYBALA P,et al.CAO:A fully automatic emoticon analysis system based on theory of kinesics [J].IEEE Transactions on Affective Computing,2010,1(1):46-59. [25]CHEN X,ZHANG Y X,WU J C,et al.Construction and Analysis of Diachronic Bullet-screen Comment Corpus:Case Study of Youth Subculture Bullet-screen Comment [J].Information Research,2022,2022(9):77-85. [26]LI Z,LI R,JIN G H.Sentiment analysis of danmaku videosbased on Naïve Bayes and sentiment dictionary [J].IEEE Access,2020:75073-75084. [27]AHMAD S,VARMA R.Information extraction from text messages using data mining techniques [J].Malaya Journal of Matematik,2018,5(1):26-29. [28]LIU L J.Research on text sentiment analysis for bullet screen[D].Lanzhou:Lanzhou Jiaotong University,2020. [29]TANAKA Y,TAKAMURA H,OKUMURA M.Extraction and classification of facemarks [C]//Proceedings of the 10th International Conference on Intelligent User Interfaces.ACM,2005:28-34. [30]KWON J,KOBAYASHI N,KAMIGAITO H,et al.Bridging between emojis and kaomojis by learning their representations from linguistic and visual information [C]//Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence.ACM,2019:116-123. [31]YOKOI T,KOBAYASHI M,IBRAHIM R.Emoticon extraction method based on eye characters and symmetric string [C]//Proceedings of the 2015 IEEE International Conference on Systems,Man,and Cybernetics.IEEE,2015:2979-2984. [32]Wikipedia contributors.Sogou Pinyin [EB/OL].(2022-12-24) [2023-03-27].https://en.wikipedia.org/wiki/Sogou_Pinyin. [33]ALASADI S,BHAYA W.Review of data preprocessing techniques in data mining [J].Journal of Engineering and Applied Sciences,2017,12(16):4102-4107. [34]LOSARWAR V,JOSHI M.Data preprocessing in web usagemining [C]//Proceedings of the International Conference on Artificial Intelligence and Embedded Systems.2012:15-16. [35]LIU M J,WANG X F,HUANG Y L.Data preprocessing in data mining [J].Computer Science,2000,27(4):54-57. [36]Wikipedia contributors.N-gram [EB/OL].(2023-03-10) [2023-03-27].https://en.wikipedia.org/wiki/N-gram. [37]LIN C Y.ROUGE:A Package for Automatic Evaluation ofSummaries [C]//Text Summarization Branches Out.2004:74-81. [38]HUANG C N,ZHAO H.Chinese Word Segmentation:A DecadeReview [J].Journal of Chinese Information Processing,2007,21(3):8-19. [39]LUO R X,XU J J,ZHANG Y,et al.Pkuseg:A toolkit for multi-domain chinese word segmentation [EB/OL].(2019-06-27) [2023-03-27].https://doi.org/10.48550/arXiv.1906.11455. [40]SUN M S,CHEN X X,ZHANG K X,et al.THULAC:An Efficient Lexical Analyzer for Chinese [EB/OL].(2018-07-27) [2023-03-27].https://github.com/thunlp/THULAC. [41]SONG Y,CAI D F,ZHANG G P,et al.Approach to Chinese Word Segmentation Based on Character-Word Joint Decoding [J].Journal of Software,2009,20(9):2366-2375. [42]YIN R C,WANG Q,LI P,et al.Multi-granularity chinese word embedding [C]//Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing.Association for Computational Linguistics,2016:981-986. [43]DAY M Y,LEE C C.Deep learning for financial sentiment ana-lysis on finance news providers [C]//Proceedings of the 2016 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining.IEEE,2016:1127-1134. |
|