计算机科学 ›› 2021, Vol. 48 ›› Issue (2): 87-92.doi: 10.11896/jsjkx.200700111

所属专题: 大数据&数据科学 虚拟专题

• 数据库&大数据&数据科学 • 上一篇    下一篇

基于BERT的社交电商文本分类算法

李可悦1, 陈轶2, 牛少彰1   

  1. 1 北京邮电大学计算机学院 北京100876
    2 东南数字经济发展研究院移动大数据中心 浙江 衢州324000
  • 收稿日期:2020-07-17 修回日期:2020-12-04 出版日期:2021-02-15 发布日期:2021-02-04
  • 通讯作者: 牛少彰(szniu@bupt.edu.cn)
  • 作者简介:likeyue@bupt.edu.cn

Social E-commerce Text Classification Algorithm Based on BERT

LI Ke-yue1, CHEN Yi2, NIU Shao-zhang1   

  1. 1 School of Computer Science, Beijing University of Posts, Telecommunications, Beijing 100876, China
    2 Mobile Big Data Center,Southeast Digital Economic Development Institute,Quzhou,Zhejiang 324000,China
  • Received:2020-07-17 Revised:2020-12-04 Online:2021-02-15 Published:2021-02-04
  • About author:LI Ke-yue,born in 1995,postgraduate.His main research interests include big data processing and machine learning.
    NIU Shao-zhang,born in 1963,Ph.D supervisor,is a member of China Compu-ter Federation.His main research intere-sts include digital image forensics and information security.

摘要: 随着网络购物的高速发展,网络商家和购物者在网络交易活动中产生了大量的交易数据,其中蕴含着巨大的分析价值。针对社交电商商品文本的文本分类问题,为了更加高效准确地判断文本所描述商品的类别,提出了一种基于BERT模型的社交电商文本分类算法。首先,该算法采用BERT(Bidirectional Encoder Representations from Transformers)预训练语言模型来完成社交电商文本的句子层面的特征向量表示,随后有针对性地将获得的特征向量输入分类器进行分类,最后采用社交电商文本的数据集进行算法验证。实验结果表明,经过训练的模型在测试集上的分类结果F1值最高可达94.61%,高出BERT模型针对MRPC的分类任务6%。因此,所提社交电商文本分类算法能够较为高效准确地判断文本所描述商品的类别,有助于进一步分析网络交易数据,从海量数据中提取有价值的信息。

关键词: 多标签文本分类, 机器学习, 模型构建, 双向编码器, 特征提取

Abstract: With the rapid development of online shopping,a large amount of transaction data has been generated in online transaction activities between online merchants and shoppers,which contain great analytical value.Aiming at the text classification pro-blem of social e-commerce product texts,in order to more efficiently and accurately determine the category of products described in the text,this paper proposes a social e-commerce text classification algorithm based on BERT model.The algorithm adopts the BERT pre-trained language model to complete the feature vector representation of social e-commerce text on sentence-level,and then inputs the obtained feature vectors into the targeted classifier for classification.In this paper,we use the social e-commerce text data set for algorithm verification,and the results show that the F1 value of the trained model on the test set can reach up to 94.61%,which is 6% higher than the MRPC classification task based on the BERT model.Therefore,the social e-commerce text classification algorithm proposed in this paper can more efficiently and accurately determine the type of goods described in the text,which is helpful for further analysis of online transaction data and extraction of valuable information from massive data.

Key words: Bidirectional encoder, Feature extraction, Machine learning, Model building, Multi-label text classification

中图分类号: 

  • TP181
[1] CNNIC.The 45th "Statistical Report on Internet Development in China" (Full Text) [OL].(2020-04-24)[2020-11-01].http://www.cac.gov.cn/2020-04/27/c_1589535470378587.htm.
[2] WANG B.The Essence,Causes and Practical Trends of "New Retail" [J].China Business and Market,2017(7):3-11.
[3] YU H.The Development Status,Trends and Countermeasures of New E-commerce Business Types in China [J].China Business and Market,2016,30(12):47-56.
[4] LI Z,DUAN M.Research of Chinese Short Text Classification Based on Word2vec [J].Computer Life (CPL),2019,7(2):90-96.
[5] QIAO X,PENG C,LIU Z,et al.Word-character attention model for Chinese text classification[J].International Journal of Machine Learning and Cybernetics,2019,10:3521-3537.
[6] WANG L.Research on Chinese short text classification method based on hybrid neural network [D].Hanzhou:Zhejiang Sci-Tech University,2019.
[7] XIE J,HOU Y,WANG Y,et al.Chinese text classificationbased on attention mechanism and feature-enhanced fusion neural network [J].Computing,2020,102:683-700.
[8] HE J,WANG C,WU H,et al.Multi-label chinese comments categorization:comparison of multi-label learning algorithms [J].Journal of New Media,2019,1(2):51-61.
[9] PETERS M,NEUMANN M,IYYER M,et al.Deep contextua-lized word representations[C]//Proceedings of the 2018 Confe-rence of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.2018:1-6.
[10] ALEC R,KARTHIK N,TIM S,et al.Improving Language Understanding by Generative Pre-Training [EB/OL].[2020-07-01].https://s3-us-west2.amazonawa.com/openai-assets/research-covers/language-unsupervised/language_unders-tanding_paper.pdf.
[11] WU F,ZHENG Y.Adaptive normalized weighted KNN textclassification based on PSO [J].Scientific Bulletin of National Mining University,2016(1):109-115.
[12] JEFFREY P,RICHARD S,CHRISTOPHER M.Glove:GlobalVectors for Word Representation[C]//Conference on Empirical Methods in Natural Language Processing.2014.
[13] FABRIZIO S.Machine learning in automated text categorization [J].ACM Computing Surveys,2002,34(1):1-47.
[14] SUN M,LI J,GUO Z,et al.THUCTC:An Efficient ChineseText Classifier[EB/OL].[2020-07-01].http:∥thuctc.thunlp.org/.
[15] Sohu News Data[EB/OL].[2020-03-01].https:∥www.jian-shu.com/p/370d3e67a18f.
[16] Netease News Data[EB/OL].[2020-03-01].https:∥news.163.com/.
[17] JACOB D,CHANG M,KENTON L,et al.BERT:Pre-training of Deep Bidirectional Transformers for Language Understanding[J].arXiv:1810.04805v2,2018.
[18] VASWANI A,SHAZEER N,PARMAR N,et al.Attention Is All You Need [J].arXiv:1706.03762v5,2017.
[19] Google.Pre-trainedmodels,google-research,bert[EB/OL].[2020-05-10].https://github.com/google-research/bert#pre-trained-models.
[20] Google.Sentence (and sentence-pair) classification tasks,google-research,bert[EB/OL].[2020-05-10].https:∥github.com/google-research/bert#sentence-and-sentence-pair-classification-tasks.
[1] 冷典典, 杜鹏, 陈建廷, 向阳.
面向自动化集装箱码头的AGV行驶时间估计
Automated Container Terminal Oriented Travel Time Estimation of AGV
计算机科学, 2022, 49(9): 208-214. https://doi.org/10.11896/jsjkx.210700028
[2] 宁晗阳, 马苗, 杨波, 刘士昌.
密码学智能化研究进展与分析
Research Progress and Analysis on Intelligent Cryptology
计算机科学, 2022, 49(9): 288-296. https://doi.org/10.11896/jsjkx.220300053
[3] 何强, 尹震宇, 黄敏, 王兴伟, 王源田, 崔硕, 赵勇.
基于大数据的进化网络影响力分析研究综述
Survey of Influence Analysis of Evolutionary Network Based on Big Data
计算机科学, 2022, 49(8): 1-11. https://doi.org/10.11896/jsjkx.210700240
[4] 李瑶, 李涛, 李埼钒, 梁家瑞, Ibegbu Nnamdi JULIAN, 陈俊杰, 郭浩.
基于多尺度的稀疏脑功能超网络构建及多特征融合分类研究
Construction and Multi-feature Fusion Classification Research Based on Multi-scale Sparse Brain Functional Hyper-network
计算机科学, 2022, 49(8): 257-266. https://doi.org/10.11896/jsjkx.210600094
[5] 张光华, 高天娇, 陈振国, 于乃文.
基于N-Gram静态分析技术的恶意软件分类研究
Study on Malware Classification Based on N-Gram Static Analysis Technology
计算机科学, 2022, 49(8): 336-343. https://doi.org/10.11896/jsjkx.210900203
[6] 张源, 康乐, 宫朝辉, 张志鸿.
基于Bi-LSTM的期货市场关联交易行为检测方法
Related Transaction Behavior Detection in Futures Market Based on Bi-LSTM
计算机科学, 2022, 49(7): 31-39. https://doi.org/10.11896/jsjkx.210400304
[7] 曾志贤, 曹建军, 翁年凤, 蒋国权, 徐滨.
基于注意力机制的细粒度语义关联视频-文本跨模态实体分辨
Fine-grained Semantic Association Video-Text Cross-modal Entity Resolution Based on Attention Mechanism
计算机科学, 2022, 49(7): 106-112. https://doi.org/10.11896/jsjkx.210500224
[8] 程成, 降爱莲.
基于多路径特征提取的实时语义分割方法
Real-time Semantic Segmentation Method Based on Multi-path Feature Extraction
计算机科学, 2022, 49(7): 120-126. https://doi.org/10.11896/jsjkx.210500157
[9] 陈明鑫, 张钧波, 李天瑞.
联邦学习攻防研究综述
Survey on Attacks and Defenses in Federated Learning
计算机科学, 2022, 49(7): 310-323. https://doi.org/10.11896/jsjkx.211000079
[10] 肖治鸿, 韩晔彤, 邹永攀.
基于多源数据和逻辑推理的行为识别技术研究
Study on Activity Recognition Based on Multi-source Data and Logical Reasoning
计算机科学, 2022, 49(6A): 397-406. https://doi.org/10.11896/jsjkx.210300270
[11] 姚烨, 朱怡安, 钱亮, 贾耀, 张黎翔, 刘瑞亮.
一种基于异质模型融合的 Android 终端恶意软件检测方法
Android Malware Detection Method Based on Heterogeneous Model Fusion
计算机科学, 2022, 49(6A): 508-515. https://doi.org/10.11896/jsjkx.210700103
[12] 刘伟业, 鲁慧民, 李玉鹏, 马宁.
指静脉识别技术研究综述
Survey on Finger Vein Recognition Research
计算机科学, 2022, 49(6A): 1-11. https://doi.org/10.11896/jsjkx.210400056
[13] 李亚茹, 张宇来, 王佳晨.
面向超参数估计的贝叶斯优化方法综述
Survey on Bayesian Optimization Methods for Hyper-parameter Tuning
计算机科学, 2022, 49(6A): 86-92. https://doi.org/10.11896/jsjkx.210300208
[14] 赵璐, 袁立明, 郝琨.
多示例学习算法综述
Review of Multi-instance Learning Algorithms
计算机科学, 2022, 49(6A): 93-99. https://doi.org/10.11896/jsjkx.210500047
[15] 王飞, 黄涛, 杨晔.
基于Stacking多模型融合的IGBT器件寿命的机器学习预测算法研究
Study on Machine Learning Algorithms for Life Prediction of IGBT Devices Based on Stacking Multi-model Fusion
计算机科学, 2022, 49(6A): 784-789. https://doi.org/10.11896/jsjkx.210400030
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!