计算机科学 ›› 2015, Vol. 42 ›› Issue (5): 62-66.doi: 10.11896/j.issn.1002-137X.2015.05.013

• 2014' 数据挖掘会议 • 上一篇    下一篇

面向网页的主题概念挖掘

刘琼琼,左万利,王 英   

  1. 吉林大学计算机科学与技术学院 长春130012 吉林大学符号计算与知识工程教育部重点实验室 长春130012,吉林大学计算机科学与技术学院 长春130012 吉林大学符号计算与知识工程教育部重点实验室 长春130012,吉林大学计算机科学与技术学院 长春130012 吉林大学符号计算与知识工程教育部重点实验室 长春130012
  • 出版日期:2018-11-14 发布日期:2018-11-14
  • 基金资助:
    本文受国家自然科学青年基金项目(20130206051GX),吉林省重点科技攻关项目(20130206051GX)资助

Topic Concept Discovery for Web Pages

LIU Qiong-qiong, ZUO Wan-li and WANG Ying   

  • Online:2018-11-14 Published:2018-11-14

摘要: 网页主题挖掘对自然语言处理如网页文本分类、文摘自动生成、信息融合等具有重要意义。挖掘网页主题可以帮助用户更好地理解网页内容。尽管已有一些从普通文本中挖掘概念的工作,但其很少考虑单词所属标签和位置对单词权重的影响,且没有工作给出上述两种影响因子的计算方法。借助WordNet,将网页主题从词语扩展到概念层次,提出了使用词性标注和词义消歧确定网页中单词词义并充分利用标签影响因子和位置影响因子对网页正文文本特征进行权重修正的主题概念挖掘方法,给出了两种影响因子的计算公式。在DMOZ数据集上的实验结果表明,修正权重可以明显提高主题挖掘精度,最高可达到0.95。

关键词: 词性标注,词义消歧,标签影响因子,位置影响因子,权重修正

Abstract: Topic discovery from Web page has an important impact on natural language processing,such as text classification,automatic abstract generation,information fusion etc.Mining Web page topics can help users better understand the content of Web pages.Although there are some papers discussing topic discovery from ordinary texts,few of them consider how the label a word belongs to and the location in which a word appears affect the weight of a word,and none of them gives calculation methods for the two impact factors.This article extended Web topics from words level to concepts level based on WordNet,used speech tagging to determine the POS of the words,used word sense disambiguation to determine the words’ meaning in the pages,made full use of label impact factor and location impact factor to modify the weights of concepts,and proposed calculation formulas for calculating these two impact factors.Experimental results on DMOZ dataset show that,compared with un-adjusted weight method,the adjusted weights method can significantly improve topic mining accuracy,which can reach up to 0.95 in the best case.

Key words: Speech tagging,Word sense disambiguation,Label impact factor,Location impact factor,Adjusted weights

[1] Jayabharathy J,Kanmani S,Parveen A A.Document Clustering and Topic Discovery based on Semantic Similarity in Scientific Literature[C]∥2011 IEEE 3rd International Conference on Communication Software and Networks (ICCSN).2011:425-429
[2] Uluhan E,Badur B.Development of a Framework for Sub-Topic Discovery from the Web[C]∥PICMET 2008 Proceedings.July 2008:878-888
[3] Shi Jing,Li Wan-long.Topic Discovery Based on LDA Modelwith Fast Gibbs Samping[C]∥2009 International Conference on Artificial Intelligence and Computational Intelligence.2009:91-95
[4] Ding W,Rohban M H,Ishwar P,et al.Topic Discovery through Data Dependent and Random Projections[C]∥International Conference on Machine Learning (ICML’13).2013:471-479
[5] Yang Yun,Wu Ya-nan.Content-based topic discovery of high-impact model[C]∥2010 2nd International Conference on Computer Engineering and Technology.2010
[6] 王琦,唐世渭,杨冬青,等.基于DOM的网页主题信息自动提取[J].计算机研究与发展,2004,41(10):1756-1792
[7] Yamaguchi Y,Amagasa T,Kitagawa H.Tag-based User Topic Discovery using Twitter Lists[C]∥2011 International Confe-rence on Advances in Social Networks Analysis and Mining.2011:13-20
[8] Cheng L.Unsupervised topic discovery by anomaly detection[D].Monterey,California:Naval Postgraduate School,2013
[9] Pedersen T,Banerjee S,Patwardhan S.Maximizing semantic relatedness to perform word sense disambiguation[J/OL].http://www.patwardhans.net/papers/pedersenBP05.pdf
[10] Naskar S K,Bandyopadhyay S.Word sense disambiguation using extended wordnet[C]∥Proceedings of the International Confe-rence on Computing:Theory and Applications(ICCTA’07).2007:446-450
[11] Naskar S K,Bandyopadhyay S.JU-SKNSB:extended WordNetbased WSD on the English all-words task at SemEval-1[C]∥Proceedings of the 4th International Workshop on Semantic Evaluations.Association for Computational Linguistics.2007:203-206
[12] Shen Wan,Angryk R A.Measuring semantic similarity usingwordnet-based context vectors[C]∥IEEE International Confe-rence on Systems,Man and Cybernetics,2007(ISIC).2007:908-913

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!