计算机科学 ›› 2022, Vol. 49 ›› Issue (6A): 206-210.doi: 10.11896/jsjkx.210500089

• 智能计算 • 上一篇    下一篇

TI-FastText自动商品分类算法

邵欣欣   

  1. 大连东软信息学院 辽宁 大连 116023
  • 出版日期:2022-06-10 发布日期:2022-06-08
  • 通讯作者: 邵欣欣(sxx929@163.com)
  • 基金资助:
    辽宁省自然科学基金(2019-ZD-0354)

TI-FastText Automatic Goods Classification Algorithm

SHAO Xin-xin   

  1. Dalian Neusoft University of Information,Dalian,Liaoning 116023,China
  • Online:2022-06-10 Published:2022-06-08
  • About author:SHAO Xin-xin,born in 1980,postgra-duate,assistant professor.Her main research interests include computer software and theory,and big data.
  • Supported by:
    Natural Science Foundation of Liaoning Province,China(2019-ZD-0354).

摘要: 为了实现根据商品标题信息进行商品自动分类的功能,提出了基于词频-逆文本频率(TF-IDF)的中文Fasttext商品分类方法。该方法首先利用FastText本身的特点,将词库表示成前缀树;然后对n元语法模型n-gram处理后的词典进行TF-IDF筛选,从而在计算输入词序列向量均值时,偏向高群分度的词条;最后将文本内容以字符顺序进行大小为N的窗口滑动操作,使其更适用于商品标题分类。基于Anaconda平台,对基于FastText 的商品分类算法进行实现和优化,经评估,最终的分类器准确率较高,能够满足电商平台对商品分类的需求。

关键词: FastText, TF-IDF, 商品分类, 中文短文本分类

Abstract: In order to achieve automatic classification of goods according to title information,a Chinese words goods classification algorithm based on TF-IDF(term frequency-inverse document frequency) and FastText is proposed.In this algorithm,the lexicon is represented as a prefix tree by FastText.The TF-IDF filting is performed on the dictionary processed by n-grammar model.Thus,the high group degree of the entries is biased in the process of computing the mean value of input word sequence vectors,making them more suitable for the Chinese short text classification environment.This paper uses Anaconda platform to implement and optimize the product classification algorithm based on FastText.After evaluation,the algorithm has a high accuracy rate and can meet the needs of goods classification on e-commerce platforms.

Key words: Chinese short text classification, FastText, Goods classification, TF-IDF

中图分类号: 

  • TP391.9
[1] REIA-DAVAHLI M.Comparing the Quality and Speed of Sentence Classification with Modern Language Models[J].Applied Sciences,2020,10:3386.
[2] JIANG S,LI S,SUNG Y.FastText-Based Local Feature Visua-lization Algorithm for Merged Image-Based Malware Classification Framework for Cyber Security and Cyber Defense[J].Mathematics,2020,8(3):1-13.
[3] BAH A,AALA B,SM A. Towards a real-time processingframework based on improved distributed recurrent neural network variants with FastText for social big data analytics[J].Information Processing & Management,2020,57(1):102122.
[4] HOU W Z.Police Intelligence Decomposition Based on FastText and WKNN Fusion Model[J].Modern Electronic Technology,2020,43(13):73-80.
[5] YIN A Y,WU Y B,ZHENG Y J,et al.An Improved Algorithm for Word Vector Representation Based on FastText Model[J].Journal of Fuzhou University(Natural Science Edition),2019,47(3):314-319.
[6] LIU T,CHEN S Y,NI W J.Rapid Generation of Emergency Plan Based on SIF-FastText Algorithm[J].China Sciencepaper,2020,15(11):1270-1276.
[7] CHEN K W,ZHANG Z P,LONG J.Research on Entropy-BasedTermWeighting Methods in Text Categorization[J].Journal of Frontiers of Computer Science and Technology,2016,10(9):1299-1309.
[8] LE N,YAPP E,NAGASUNDARAM N,et al.Classifying Promoters by Interpreting the Hidden Information of DNA Sequences via Deep Learning and Combination of Conti-nuous FastText N-Grams[J].Frontiers in Bioengineering and Biotechnology,2019,7:305.
[9] YU P,CUI V Y,GUAN J.Text Classification by using Natural Language Processing[J].Journal of Physics:Conference Series,2021,1802(4):042010.
[10] WANG R,RIDLEY R,SU X,et al.A novel reasoning mechanism for multi-label text classification[J].Information Proces-sing & Management,2021,58(2):102441.
[11] WANG Z K,SHEN D S,WANG C X.A Fast Multi-Tag Feature Selection Algorithm Based on Text Classification with Fisher Score [J/OL].[2021-03-15].https://doi.org/10.19678/j.issn.1000-3428.0060594.
[12] WANG J Q,ZHANG L.Text feature selection oriented to redundant relative criterion [J/OL].[2021-03-15].http://doi.org/10.13451/j.sxu.ns.2020141.
[13] DUAN D D,TANG J S,WEN Y,et al.Chinese Short Text Classification Algorithm Based on Bert Model[J].Computer Engineering,2021,47(1):79-86.
[14] KANG C,ZHENG S H,LI W L.Short Text ClassificationUsing LDA Topic Model and Two-dimensional Convolution[J].Computer Applications and Software,2020,37(11):127-131,153.
[15] LIU Y C,SUN H Z,MA C M,et al.Online Product Classification Based on High-level Feature Fusion[J].Journal of Beijing University of Posts and Telecommunications,2020,43(5):98-104,117.
[1] 刘硕, 王庚润, 彭建华, 李柯.
基于混合字词特征的中文短文本分类算法
Chinese Short Text Classification Algorithm Based on Hybrid Features of Characters and Words
计算机科学, 2022, 49(4): 282-287. https://doi.org/10.11896/jsjkx.210200027
[2] 景丽, 何婷婷.
基于改进TF-IDF和ABLCNN的中文文本分类模型
Chinese Text Classification Model Based on Improved TF-IDF and ABLCNN
计算机科学, 2021, 48(11A): 170-175. https://doi.org/10.11896/jsjkx.210100232
[3] 赵瑞杰, 施勇, 张涵, 龙军, 薛质.
基于TF-IDF的Webshell文件检测
Webshell File Detection Method Based on TF-IDF
计算机科学, 2020, 47(11A): 363-367. https://doi.org/10.11896/jsjkx.200100064
[4] 曾安,徐小强.
基于好友关系和标签的混合协同过滤算法
Hybrid Collaborative Filtering Recommendation Algorithm Based on Friendships and Tag
计算机科学, 2017, 44(8): 246-251. https://doi.org/10.11896/j.issn.1002-137X.2017.08.042
[5] 环天,郝宁,牛强.
基于概念权重向量的MIMLSVM改进算法
Improved MIMLSVM Algorithm Based on Concept Weight Vector
计算机科学, 2017, 44(12): 48-51. https://doi.org/10.11896/j.issn.1002-137X.2017.12.009
[6] 唐明,朱磊,邹显春.
基于Word2Vec的一种文档向量表示
Document Vector Representation Based on Word2Vec
计算机科学, 2016, 43(6): 214-217. https://doi.org/10.11896/j.issn.1002-137X.2016.06.043
[7] 刘金硕,邓莹莹,邓娟.
网络食品安全的歧义性消解算法
Disambiguation Algorithm Design and Implementation of Food Safety Issues in Network
计算机科学, 2015, 42(Z11): 7-9.
[8] 向林泓,张炬,孙启龙,赵学良.
基于Relative-IDF的医药数据相似度算法研究
Medical Data Similarity Algorithm Analysis Based on Relative-IDF
计算机科学, 2014, 41(Z6): 417-420.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!