计算机科学 ›› 2022, Vol. 49 ›› Issue (6A): 206-210.doi: 10.11896/jsjkx.210500089
邵欣欣
SHAO Xin-xin
摘要: 为了实现根据商品标题信息进行商品自动分类的功能,提出了基于词频-逆文本频率(TF-IDF)的中文Fasttext商品分类方法。该方法首先利用FastText本身的特点,将词库表示成前缀树;然后对n元语法模型n-gram处理后的词典进行TF-IDF筛选,从而在计算输入词序列向量均值时,偏向高群分度的词条;最后将文本内容以字符顺序进行大小为N的窗口滑动操作,使其更适用于商品标题分类。基于Anaconda平台,对基于FastText 的商品分类算法进行实现和优化,经评估,最终的分类器准确率较高,能够满足电商平台对商品分类的需求。
中图分类号:
[1] REIA-DAVAHLI M.Comparing the Quality and Speed of Sentence Classification with Modern Language Models[J].Applied Sciences,2020,10:3386. [2] JIANG S,LI S,SUNG Y.FastText-Based Local Feature Visua-lization Algorithm for Merged Image-Based Malware Classification Framework for Cyber Security and Cyber Defense[J].Mathematics,2020,8(3):1-13. [3] BAH A,AALA B,SM A. Towards a real-time processingframework based on improved distributed recurrent neural network variants with FastText for social big data analytics[J].Information Processing & Management,2020,57(1):102122. [4] HOU W Z.Police Intelligence Decomposition Based on FastText and WKNN Fusion Model[J].Modern Electronic Technology,2020,43(13):73-80. [5] YIN A Y,WU Y B,ZHENG Y J,et al.An Improved Algorithm for Word Vector Representation Based on FastText Model[J].Journal of Fuzhou University(Natural Science Edition),2019,47(3):314-319. [6] LIU T,CHEN S Y,NI W J.Rapid Generation of Emergency Plan Based on SIF-FastText Algorithm[J].China Sciencepaper,2020,15(11):1270-1276. [7] CHEN K W,ZHANG Z P,LONG J.Research on Entropy-BasedTermWeighting Methods in Text Categorization[J].Journal of Frontiers of Computer Science and Technology,2016,10(9):1299-1309. [8] LE N,YAPP E,NAGASUNDARAM N,et al.Classifying Promoters by Interpreting the Hidden Information of DNA Sequences via Deep Learning and Combination of Conti-nuous FastText N-Grams[J].Frontiers in Bioengineering and Biotechnology,2019,7:305. [9] YU P,CUI V Y,GUAN J.Text Classification by using Natural Language Processing[J].Journal of Physics:Conference Series,2021,1802(4):042010. [10] WANG R,RIDLEY R,SU X,et al.A novel reasoning mechanism for multi-label text classification[J].Information Proces-sing & Management,2021,58(2):102441. [11] WANG Z K,SHEN D S,WANG C X.A Fast Multi-Tag Feature Selection Algorithm Based on Text Classification with Fisher Score [J/OL].[2021-03-15].https://doi.org/10.19678/j.issn.1000-3428.0060594. [12] WANG J Q,ZHANG L.Text feature selection oriented to redundant relative criterion [J/OL].[2021-03-15].http://doi.org/10.13451/j.sxu.ns.2020141. [13] DUAN D D,TANG J S,WEN Y,et al.Chinese Short Text Classification Algorithm Based on Bert Model[J].Computer Engineering,2021,47(1):79-86. [14] KANG C,ZHENG S H,LI W L.Short Text ClassificationUsing LDA Topic Model and Two-dimensional Convolution[J].Computer Applications and Software,2020,37(11):127-131,153. [15] LIU Y C,SUN H Z,MA C M,et al.Online Product Classification Based on High-level Feature Fusion[J].Journal of Beijing University of Posts and Telecommunications,2020,43(5):98-104,117. |
[1] | 刘硕, 王庚润, 彭建华, 李柯. 基于混合字词特征的中文短文本分类算法 Chinese Short Text Classification Algorithm Based on Hybrid Features of Characters and Words 计算机科学, 2022, 49(4): 282-287. https://doi.org/10.11896/jsjkx.210200027 |
[2] | 景丽, 何婷婷. 基于改进TF-IDF和ABLCNN的中文文本分类模型 Chinese Text Classification Model Based on Improved TF-IDF and ABLCNN 计算机科学, 2021, 48(11A): 170-175. https://doi.org/10.11896/jsjkx.210100232 |
[3] | 赵瑞杰, 施勇, 张涵, 龙军, 薛质. 基于TF-IDF的Webshell文件检测 Webshell File Detection Method Based on TF-IDF 计算机科学, 2020, 47(11A): 363-367. https://doi.org/10.11896/jsjkx.200100064 |
[4] | 曾安,徐小强. 基于好友关系和标签的混合协同过滤算法 Hybrid Collaborative Filtering Recommendation Algorithm Based on Friendships and Tag 计算机科学, 2017, 44(8): 246-251. https://doi.org/10.11896/j.issn.1002-137X.2017.08.042 |
[5] | 环天,郝宁,牛强. 基于概念权重向量的MIMLSVM改进算法 Improved MIMLSVM Algorithm Based on Concept Weight Vector 计算机科学, 2017, 44(12): 48-51. https://doi.org/10.11896/j.issn.1002-137X.2017.12.009 |
[6] | 唐明,朱磊,邹显春. 基于Word2Vec的一种文档向量表示 Document Vector Representation Based on Word2Vec 计算机科学, 2016, 43(6): 214-217. https://doi.org/10.11896/j.issn.1002-137X.2016.06.043 |
[7] | 刘金硕,邓莹莹,邓娟. 网络食品安全的歧义性消解算法 Disambiguation Algorithm Design and Implementation of Food Safety Issues in Network 计算机科学, 2015, 42(Z11): 7-9. |
[8] | 向林泓,张炬,孙启龙,赵学良. 基于Relative-IDF的医药数据相似度算法研究 Medical Data Similarity Algorithm Analysis Based on Relative-IDF 计算机科学, 2014, 41(Z6): 417-420. |
|