计算机科学 ›› 2021, Vol. 48 ›› Issue (2): 87-92.doi: 10.11896/jsjkx.200700111
所属专题: 大数据&数据科学 虚拟专题
李可悦1, 陈轶2, 牛少彰1
LI Ke-yue1, CHEN Yi2, NIU Shao-zhang1
摘要: 随着网络购物的高速发展,网络商家和购物者在网络交易活动中产生了大量的交易数据,其中蕴含着巨大的分析价值。针对社交电商商品文本的文本分类问题,为了更加高效准确地判断文本所描述商品的类别,提出了一种基于BERT模型的社交电商文本分类算法。首先,该算法采用BERT(Bidirectional Encoder Representations from Transformers)预训练语言模型来完成社交电商文本的句子层面的特征向量表示,随后有针对性地将获得的特征向量输入分类器进行分类,最后采用社交电商文本的数据集进行算法验证。实验结果表明,经过训练的模型在测试集上的分类结果F1值最高可达94.61%,高出BERT模型针对MRPC的分类任务6%。因此,所提社交电商文本分类算法能够较为高效准确地判断文本所描述商品的类别,有助于进一步分析网络交易数据,从海量数据中提取有价值的信息。
中图分类号:
[1] CNNIC.The 45th "Statistical Report on Internet Development in China" (Full Text) [OL].(2020-04-24)[2020-11-01].http://www.cac.gov.cn/2020-04/27/c_1589535470378587.htm. [2] WANG B.The Essence,Causes and Practical Trends of "New Retail" [J].China Business and Market,2017(7):3-11. [3] YU H.The Development Status,Trends and Countermeasures of New E-commerce Business Types in China [J].China Business and Market,2016,30(12):47-56. [4] LI Z,DUAN M.Research of Chinese Short Text Classification Based on Word2vec [J].Computer Life (CPL),2019,7(2):90-96. [5] QIAO X,PENG C,LIU Z,et al.Word-character attention model for Chinese text classification[J].International Journal of Machine Learning and Cybernetics,2019,10:3521-3537. [6] WANG L.Research on Chinese short text classification method based on hybrid neural network [D].Hanzhou:Zhejiang Sci-Tech University,2019. [7] XIE J,HOU Y,WANG Y,et al.Chinese text classificationbased on attention mechanism and feature-enhanced fusion neural network [J].Computing,2020,102:683-700. [8] HE J,WANG C,WU H,et al.Multi-label chinese comments categorization:comparison of multi-label learning algorithms [J].Journal of New Media,2019,1(2):51-61. [9] PETERS M,NEUMANN M,IYYER M,et al.Deep contextua-lized word representations[C]//Proceedings of the 2018 Confe-rence of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.2018:1-6. [10] ALEC R,KARTHIK N,TIM S,et al.Improving Language Understanding by Generative Pre-Training [EB/OL].[2020-07-01].https://s3-us-west2.amazonawa.com/openai-assets/research-covers/language-unsupervised/language_unders-tanding_paper.pdf. [11] WU F,ZHENG Y.Adaptive normalized weighted KNN textclassification based on PSO [J].Scientific Bulletin of National Mining University,2016(1):109-115. [12] JEFFREY P,RICHARD S,CHRISTOPHER M.Glove:GlobalVectors for Word Representation[C]//Conference on Empirical Methods in Natural Language Processing.2014. [13] FABRIZIO S.Machine learning in automated text categorization [J].ACM Computing Surveys,2002,34(1):1-47. [14] SUN M,LI J,GUO Z,et al.THUCTC:An Efficient ChineseText Classifier[EB/OL].[2020-07-01].http:∥thuctc.thunlp.org/. [15] Sohu News Data[EB/OL].[2020-03-01].https:∥www.jian-shu.com/p/370d3e67a18f. [16] Netease News Data[EB/OL].[2020-03-01].https:∥news.163.com/. [17] JACOB D,CHANG M,KENTON L,et al.BERT:Pre-training of Deep Bidirectional Transformers for Language Understanding[J].arXiv:1810.04805v2,2018. [18] VASWANI A,SHAZEER N,PARMAR N,et al.Attention Is All You Need [J].arXiv:1706.03762v5,2017. [19] Google.Pre-trainedmodels,google-research,bert[EB/OL].[2020-05-10].https://github.com/google-research/bert#pre-trained-models. [20] Google.Sentence (and sentence-pair) classification tasks,google-research,bert[EB/OL].[2020-05-10].https:∥github.com/google-research/bert#sentence-and-sentence-pair-classification-tasks. |
[1] | 冷典典, 杜鹏, 陈建廷, 向阳. 面向自动化集装箱码头的AGV行驶时间估计 Automated Container Terminal Oriented Travel Time Estimation of AGV 计算机科学, 2022, 49(9): 208-214. https://doi.org/10.11896/jsjkx.210700028 |
[2] | 宁晗阳, 马苗, 杨波, 刘士昌. 密码学智能化研究进展与分析 Research Progress and Analysis on Intelligent Cryptology 计算机科学, 2022, 49(9): 288-296. https://doi.org/10.11896/jsjkx.220300053 |
[3] | 何强, 尹震宇, 黄敏, 王兴伟, 王源田, 崔硕, 赵勇. 基于大数据的进化网络影响力分析研究综述 Survey of Influence Analysis of Evolutionary Network Based on Big Data 计算机科学, 2022, 49(8): 1-11. https://doi.org/10.11896/jsjkx.210700240 |
[4] | 李瑶, 李涛, 李埼钒, 梁家瑞, Ibegbu Nnamdi JULIAN, 陈俊杰, 郭浩. 基于多尺度的稀疏脑功能超网络构建及多特征融合分类研究 Construction and Multi-feature Fusion Classification Research Based on Multi-scale Sparse Brain Functional Hyper-network 计算机科学, 2022, 49(8): 257-266. https://doi.org/10.11896/jsjkx.210600094 |
[5] | 张光华, 高天娇, 陈振国, 于乃文. 基于N-Gram静态分析技术的恶意软件分类研究 Study on Malware Classification Based on N-Gram Static Analysis Technology 计算机科学, 2022, 49(8): 336-343. https://doi.org/10.11896/jsjkx.210900203 |
[6] | 张源, 康乐, 宫朝辉, 张志鸿. 基于Bi-LSTM的期货市场关联交易行为检测方法 Related Transaction Behavior Detection in Futures Market Based on Bi-LSTM 计算机科学, 2022, 49(7): 31-39. https://doi.org/10.11896/jsjkx.210400304 |
[7] | 曾志贤, 曹建军, 翁年凤, 蒋国权, 徐滨. 基于注意力机制的细粒度语义关联视频-文本跨模态实体分辨 Fine-grained Semantic Association Video-Text Cross-modal Entity Resolution Based on Attention Mechanism 计算机科学, 2022, 49(7): 106-112. https://doi.org/10.11896/jsjkx.210500224 |
[8] | 程成, 降爱莲. 基于多路径特征提取的实时语义分割方法 Real-time Semantic Segmentation Method Based on Multi-path Feature Extraction 计算机科学, 2022, 49(7): 120-126. https://doi.org/10.11896/jsjkx.210500157 |
[9] | 陈明鑫, 张钧波, 李天瑞. 联邦学习攻防研究综述 Survey on Attacks and Defenses in Federated Learning 计算机科学, 2022, 49(7): 310-323. https://doi.org/10.11896/jsjkx.211000079 |
[10] | 肖治鸿, 韩晔彤, 邹永攀. 基于多源数据和逻辑推理的行为识别技术研究 Study on Activity Recognition Based on Multi-source Data and Logical Reasoning 计算机科学, 2022, 49(6A): 397-406. https://doi.org/10.11896/jsjkx.210300270 |
[11] | 姚烨, 朱怡安, 钱亮, 贾耀, 张黎翔, 刘瑞亮. 一种基于异质模型融合的 Android 终端恶意软件检测方法 Android Malware Detection Method Based on Heterogeneous Model Fusion 计算机科学, 2022, 49(6A): 508-515. https://doi.org/10.11896/jsjkx.210700103 |
[12] | 刘伟业, 鲁慧民, 李玉鹏, 马宁. 指静脉识别技术研究综述 Survey on Finger Vein Recognition Research 计算机科学, 2022, 49(6A): 1-11. https://doi.org/10.11896/jsjkx.210400056 |
[13] | 李亚茹, 张宇来, 王佳晨. 面向超参数估计的贝叶斯优化方法综述 Survey on Bayesian Optimization Methods for Hyper-parameter Tuning 计算机科学, 2022, 49(6A): 86-92. https://doi.org/10.11896/jsjkx.210300208 |
[14] | 赵璐, 袁立明, 郝琨. 多示例学习算法综述 Review of Multi-instance Learning Algorithms 计算机科学, 2022, 49(6A): 93-99. https://doi.org/10.11896/jsjkx.210500047 |
[15] | 王飞, 黄涛, 杨晔. 基于Stacking多模型融合的IGBT器件寿命的机器学习预测算法研究 Study on Machine Learning Algorithms for Life Prediction of IGBT Devices Based on Stacking Multi-model Fusion 计算机科学, 2022, 49(6A): 784-789. https://doi.org/10.11896/jsjkx.210400030 |
|