计算机科学 ›› 2022, Vol. 49 ›› Issue (9): 92-100.doi: 10.11896/jsjkx.210700241

• 数据库&大数据&数据科学* 上一篇    下一篇

基于异质信息网的短文本特征扩充方法

吕晓锋1,2,3, 赵书良1,2,3, 高恒达4, 武永亮5, 张宝奇1,2,3   

  1. 1 河北师范大学计算机与网络空间安全学院 石家庄 050024
    2 供应链大数据分析与数据安全河北省工程研究中心 石家庄 050024
    3 河北省网络与信息安全重点实验室 石家庄 050024
    4 河北师范大学软件学院 石家庄 050024
    5 石家庄铁道大学信息科学与技术学院 石家庄 050043
  • 收稿日期:2021-07-26 修回日期:2021-10-17 出版日期:2022-09-15 发布日期:2022-09-09
  • 通讯作者: 赵书良(zhaoshuliang@sina.com)
  • 作者简介:(1586821231@qq.com)
  • 基金资助:
    国家社会科学基金重大项目(13&ZD091,18ZDA200);河北省重点研发计划项目(20370301D);河北师范大学重大关键技术攻关项目(L2020K01)

Short Texts Feautre Enrichment Method Based on Heterogeneous Information Network

LYU Xiao-feng1,2,3, ZHAO Shu-liang1,2,3, GAO Heng-da4, WU Yong-liang5, ZHANG Bao-qi1,2,3   

  1. 1 College of Computer and Cyber Security,Hebei Normal University,Shijiazhuang 050024,China
    2 Hebei Provincial Engineering Research Center for Supply Chain Big Data Analytics & Data Security,Hebei Normal University,Shijiazhuang050024,China
    3 Hebei Provincial Key Laboratory of Network & Information Security,Hebei Normal University,Shijiazhuang 050024,China
    4 Software College,Hebei Normal University,Shijiazhuang 050024,China
    5 School of Information Science and Technology,Shijiazhuang Tiedao University,Shijiazhuang 050043,China
  • Received:2021-07-26 Revised:2021-10-17 Online:2022-09-15 Published:2022-09-09
  • About author:LYU Xiao-feng,born in 1996,postgra-duate.His main research interests include machine learning and intelligent information processing.
    ZHAO Shu-liang,born in 1967,Ph.D,professor,Ph.D supervisor,is a member of China Computer Federation.His main research interests include machine learning and intelligent information processing.
  • Supported by:
    National Social Science Fund of China(13&ZD091,18ZDA200),Hebei Provincial Key Research and Development Project of China(20370301D) and Key Technology Development Project of Hebei Normal University(L2020K01).

摘要: 随着计算机技术深度融入社会生活,越来越多的短文本信息遍布在网络平台上。针对短文本的数据稀疏问题,文中构建了一个鲁棒的异质信息网框架(HTE)来建模短文本,该框架可集成任何类型的附加信息并捕获它们之间的关系,以解决数据稀疏问题。基于该框架利用不同外部知识设计了6种短文本扩充方法,引入Wikipedia知识库和Freebase知识库的实体、实体类别、实体间关系等实体信息和文本主题等文本信息,以丰富短文本特征。最后使用相似性度量结果来验证所提出的短文本特征扩充方法的效果。通过与传统的3种相似性度量方法的6种文本扩充方法以及目前主流的短文本匹配算法在两个短文本数据集上进行比较,结果表明,所提的6种短文本扩充方法均有所提升,最佳方法的相似度度量结果与BERT相比提升了5.97%,证明了所提框架具有鲁棒性,可以包含多种类型的外部知识,能够解决短文本的数据稀疏性问题,以无监督的方式高精度地对短文本进行相似性度量。

关键词: 异质信息网络, 短文本扩充方法, 短文本匹配, 知识库, 元路径

Abstract: With the deep integration of computer technology into social life,more and more short text messages are spreaded all over the web platform.Aiming at the problem of data sparsity of short texts,a robust heterogeneous information network framework(HTE) for modeling short texts,which can integrate any type of additional information and capture the relationship between them to solve the data sparsity problem,is constructed.Based on this framework,six short text expansion methods are designed using different external knowledge,and the short text features are enriched by introducing entity information such as entities,entity categories,inter-entity relationships and textual information such as text topics from Wikipedia and Freebase knowledge bases.Finally,the similarity measurement result is used to verify the experimental effect.By comparing the six text expansion me-thods with the traditional three similarity measures on two short text datasets and the current mainstream short text matching algorithms,the results of the proposed six text expansion methods are improved.Compared with BERT,the similarity measurement results of the best method improves by 5.97%.The proposed framework is robust and can include any type of external know-ledge,and the proposed method can overcome the data sparsity problem of short texts and can perform similarity metrics on short texts with high accuracy in an unsupervised manner.

Key words: Heterogeneous information network, Short text enrichment method, Short text matching, Knowledge base, Meta-path

中图分类号: 

  • TP391
[1]LI C L,CHEN S Q,XING J,et al.Seed-guided topic model for document filtering and classification[J].ACM Transactions on Information Systems(TOIS),2018,37(1):1-37.
[2]NIE L Q,LI Y Q,FENG F L,et al.Large-scale question taggingvia joint question-topic embedding learning[J].ACM Transactions on Information Systems(TOIS),2020,38(2):1-23.
[3]WANG X,CHEN R,JIA Y,et al.Short text classification using wikipedia concept based document representation[C]//2013 International Conference on Information Technology and Applications.IEEE,2013:471-474.
[4]MOURINO-GARCIA M A,PEREZ-RODRIGUEZ R,ANIDO-RIFON L,et al.Wikipedia-based hybrid document representation for textual news classification[J].Soft Computing,2018,22(18):6047-6065.
[5]BOLLACKER K D,EVANS C,PARITOSH P,et al.Freebase:a collaboratively created graph database for structuring human knowledge[C]//Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data.2008:1247-1250.
[6]AUER S,BIZER C,KOBILAROV G,et al.Dbpedia:A nucleus for a web of open data[M].Heidelberg:Springer,2007:722-735.
[7]SHI C,PHILIP S Y.Heterogeneous information network analysis and applications[M].Cham:Springer International Publi-shing,2017.
[8]SHI C,SUN Y,PHILIP S Y.Research status and future deve-lopment of heterogeneous information networks[J].Communications of the Computer Society,2017,11(13):35-40.
[9]YAO D,BI J,HUANG J,et al.A word distributed representation based framework for large-scale short text classification[C]//2015 International Joint Conference on Neural Networks(IJCNN).IEEE,2015:1-7.
[10]FLISAR J,PODGORELEC V.Improving short text classification using information from DBpedia ontology[J].Fundamenta Informaticae,2020,172(3):261-297.
[11]GHADERY E,MOVAHEDI S,FAILI H,et al.A multilingual ngram-based convolutional network for aspect category detection in online reviews[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2019:6441-6448.
[12]WANG C,SONG Y,LI H,et al.Text classifification with hete-rogeneous information network kernels[C]//Proceedings of AAAI.2016:2130-2136.
[13]YANG T C,HU L M,SHI C,et al.HGAT:HeterogeneousGraph Attention Networks for Semi-supervised Short Text Classification [J].ACM Transactions on Information Systems,2021,39(3):1-29.
[14]HU B,LU Z,LI H,et al.Convolutional neural network architectures for matching natural language sentences[C]//Proceedings of the Conference on Neural Information Processing Systems.2014:2042-2050.
[15]PANG L,LAN Y,GUO J,et al.Text matching as image recognition[C]//Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence.2016:2793-2799.
[16]SUN Y,YU Y,HAN J.Ranking-based clustering of heteroge-neous information networks with star network schema[C]//Proceedings of the 15th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining.2009:797-806.
[17]WANG C G,SONG Y Q,LI H R,et al.KnowSim:A Document Similarity Measure on Structured Heterogeneous Information Networks[C]//Proceedings of the 2015 IEEE International Conference on Data Mining(ICDM).2015:1015-1020.
[18]LI J,ZHANG X,ZHOU X.ALBERT-Based Self-EnsembleModel With Semisupervised Learning and Data Augmentation for Clinical Semantic Textual Similarity Calculation:Algorithm Validation Study[J].JMIR Medical Informatics,2021,9(1):e23086.
[19]CER D,DIAB M,AGIRRE E,et al.Semeval-2017 task 1:Semantic textual similarity-multilingual and cross-lingual focused evaluation[J].arXiv:1708.00055,2017.
[20]DEVLIN J,CHANG M W,LEE K,et al.Bert:Pre-training of deep bidirectional transformers for language understanding[J].arXiv:1810.04805,2018.
[21]CONNEAU A,KIELA D,SCHWENK H,et al.Supervisedlearning of universal sentence representations from natural language inference data[J].arXiv:1705.02364,2017.
[22]KIROS R,ZHU Y,SALAKHUTDINOV R R,et al.Skip-thought vectors[C]//Advances in Neural Information Proces-sing Systems.2015:3294-3302.
[1] 黄丽, 朱焱, 李春平.
基于异构网络表征学习的作者学术行为预测
Author’s Academic Behavior Prediction Based on Heterogeneous Network Representation Learning
计算机科学, 2022, 49(9): 76-82. https://doi.org/10.11896/jsjkx.210900078
[2] 杜航原, 李铎, 王文剑.
一种面向电商网络的异常用户检测方法
Method for Abnormal Users Detection Oriented to E-commerce Network
计算机科学, 2022, 49(7): 170-178. https://doi.org/10.11896/jsjkx.210600092
[3] 蒋宗礼, 樊珂, 张津丽.
基于生成对抗网络和元路径的异质网络表示学习
Generative Adversarial Network and Meta-path Based Heterogeneous Network Representation Learning
计算机科学, 2022, 49(1): 133-139. https://doi.org/10.11896/jsjkx.201000179
[4] 郑苏苏, 关东海, 袁伟伟.
融合不完整多视图的异质信息网络嵌入方法
Heterogeneous Information Network Embedding with Incomplete Multi-view Fusion
计算机科学, 2021, 48(9): 68-76. https://doi.org/10.11896/jsjkx.210500203
[5] 赵金龙, 赵中英.
基于异质信息网络表示学习与注意力神经网络的推荐算法
Recommendation Algorithm Based on Heterogeneous Information Network Embedding and Attention Neural Network
计算机科学, 2021, 48(8): 72-79. https://doi.org/10.11896/jsjkx.200800226
[6] 刘小龙, 韩芳, 王直杰.
基于知识表示的联合问答模型
Joint Question Answering Model Based on Knowledge Representation
计算机科学, 2021, 48(6): 241-245. https://doi.org/10.11896/jsjkx.200600011
[7] 高创, 李建华, 季秀怡, 朱程龙, 李诗良, 李洪林.
基于图卷积神经网络的药物靶标作用关系预测方法
Drug Target Interaction Prediction Method Based on Graph Convolutional Neural Network
计算机科学, 2021, 48(10): 127-134. https://doi.org/10.11896/jsjkx.200700068
[8] 蒋宗礼, 李苗苗, 张津丽.
基于融合元路径图卷积的异质网络表示学习
Graph Convolution of Fusion Meta-path Based Heterogeneous Network Representation Learning
计算机科学, 2020, 47(7): 231-235. https://doi.org/10.11896/jsjkx.190600085
[9] 罗达, 苏锦钿, 李鹏飞.
基于多角度注意力机制的单一事实知识库问答方法
Multi-view Attentional Approach to Single-fact Knowledge-based Question Answering
计算机科学, 2019, 46(10): 215-221. https://doi.org/10.11896/jsjkx.190400071
[10] 于亚新,张海军.
EBSN中基于潜在好友关系的活动推荐算法
Activity Recommendation Algorithm Based on Latent Friendships in EBSN
计算机科学, 2018, 45(3): 196-203. https://doi.org/10.11896/j.issn.1002-137X.2018.03.031
[11] 陈新蕾,贾岩涛,王元卓,靳小龙,程学旗.
开放知识库构建技术的多维量化评价方法
Multi-dimensional Quantitative Evaluation Method of Open Knowledge Base Construction Technology
计算机科学, 2017, 44(12): 17-22. https://doi.org/10.11896/j.issn.1002-137X.2017.12.003
[12] 黄金柱,李峰,张克亮.
VDEA词典的构建及其在情感倾向性分析中的应用
Construction of VDEA and its Application in Lexical Sentimental Orientation Analysis
计算机科学, 2016, 43(Z6): 430-434. https://doi.org/10.11896/j.issn.1002-137X.2016.6A.102
[13] 李锋,夏立.
基于规则库的变压器故障监测专家系统
Transformer Fault Monitoring Expert System Based on Rule Base
计算机科学, 2016, 43(Z11): 564-567. https://doi.org/10.11896/j.issn.1002-137X.2016.11A.127
[14] 司 成,张红旗,汪永伟,杨英杰.
基于本体的网络安全态势要素知识库模型研究
Research on Network Security Situational Elements Knowledge Base Model Based on Ontology
计算机科学, 2015, 42(5): 173-177. https://doi.org/10.11896/j.issn.1002-137X.2015.05.035
[15] 陈依玲,吕扬建.
基于开放获取理念的我国高校机构知识库建设探究
Research on Construction of Chinese Academic Institutional Repositories Based on Open Access Consciousness
计算机科学, 2013, 40(Z11): 242-245.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!