计算机科学 ›› 2018, Vol. 45 ›› Issue (1): 128-132.doi: 10.11896/j.issn.1002-137X.2018.01.021

• 第十六届中国机器学习会议 • 上一篇    下一篇

基于领域本体的文本分割方法研究

刘耀,帅远华,龚幸伟,黄毅   

  1. 中国科学技术信息研究所 北京 100038,北京大学 北京 100080,中国科学技术信息研究所 北京 100038,中国科学技术信息研究所 北京 100038
  • 出版日期:2018-01-15 发布日期:2018-11-13

Study on Text Segmentation Based on Domain Ontology

LIU Yao, SHUAI Yuan-hua, GONG Xing-wei and HUANG Yi   

  • Online:2018-01-15 Published:2018-11-13

摘要: 文本分割在信息检索、摘要生成、问答系统、信息抽取等领域发挥着重要作用。在总结现有的国内外文本分割方法的基础上,提出了一种基于领域本体对文本进行线性分割的方法。该方法利用初始概念自动获取结构化语义概念集合,并根据获取的概念、属性及属性词在文本中出现的频次、位置和关系等因素为段落赋予语义标签,挖掘文本的子主题信息,将拥有相同语义标注信息的段落划分为相同语义段落,实现了文本不同子主题之间的分割。实验结果表明,该方法对于特定领域的文本分割的准确率、召回率以及F值分别达到了85%,90%和88%,分割效果能够满足实际应用需求,并优于现有的无需训练语料的文本分割方法。

关键词: 文本分割,领域本体,语义标注,语义段落

Abstract: Text segmentation plays an important role in information retrieval,abstract generation,question-answering system,information extraction and so on.This paper put forward a new text segmentation method based on domain ontology after analyzing and summarizing existing methods at home and abroad.The method first uses initial concept to automatically obtain structured semantic concepts set,which are then used to affix semantic labels to paragraphs in text based on the frequency of occurrence,position and relationship of concepts and properties.Paragraphs with the same semantic annotation information are grouped into one semantic paragraph,which helps discover the sub-topics information and meanwhile realize topic segmentation for texts.The experimental result shows that the precision,recall and F-mea-sure of this method can achieve 85%,90% and 88% respectively,which performs better than most existing methods and satisfies the real application needs.

Key words: Text segmentation,Domain ontology,Semantic annotation,Semantic paragraph

[1] CHOI F Y Y.Advances in domain independent linear text segmentation [C]∥NAACL 2000.2000:26-33.
[2] HALLIDAY,KIRWOOD M A,HASAN R.Cohesion in English [M].Routledge,2014.
[3] HEARST M A.TextTiling:segmenting text into multi-para-graph subtopic passages [M].MIT Press,1997.
[4] REYNAR J C.An automatic method of finding topic boundaries[C]∥Proceedings of the 32nd Annual Meeting on Association for Computational Linguistics.1994:331-333.
[5] REYNAR,JEFFREY C.An Automatic Method of Finding To-pic Boundaries [J].Computer Science,1994,4(101):331-333.
[6] KERN R,GRANITZER M.Efficient linear text segmentationbased on information retrieval techniques[C]∥International Conference on Management of Emergent Digital Ecosystems.ACM,2009:25.
[7] WU J W,TSENG J C R,TSAI W N.An Efficient Linear TextSegmentation Algorithm Using Hierarchical Agglomerative Clustering[C]∥Seventh International Conference on Computational Intelligence and Security.IEEE Computer Society,2011:1081-1085.
[8] KAZANTSEVA A,SZPAKOWICZ S.Linear text segmentation using affinity propagation[C]∥Conference on Empirical Me-thods in Natural Language Processing.Association for Computational Linguistics,2011:284-293.
[9] BAYOMI M,LEVACHER K,GHORAB M R,et al.OntoSeg:A Novel Approach to Text Segmentation Using Ontological Similarity[C]∥IEEE International Conference on Data Mining Workshop.IEEE,2016:1274-1283.
[10] REYNAR J C.Statistical Models for Topic Segmentation[C]∥Proc.of Annual Meeting of the Association for Computational Linguistics,1999.1999:357-364.
[11] KAN M Y,KLAVANS J L,MCKEOWN K R.Linear Segmentation and Segment Significance[C]∥WVLC-6.1998:197-205.
[12] KAUCHAK D,CHEN F.Feature-based segmentation of narrative documents[C]∥ACL Workshop on Feature Engineering for Machine Learning in Natural Language Processing.Association for Computational Linguistics,2005:32-39.
[13] CHOI F Y Y,WIEMER-HASTINGS P,MOORE J.Latent Semantic Analysis for Text Segmentation[J].Proceedings of Emnlp,2001,4(3):109-117.
[14] BRANTS T,CHEN F,TSOCHANTARIDIS I.Topic-baseddocument segmentation with probabilistic latent semantic ana-lysis[C]∥Eleventh International Conference on Information and Knowledge Management.ACM,2002:211-218.〗
[15] MISRA H,JOSE J M,CAPPE O.Text segmentation via topic modeling:an analytical study[C]∥DBLP.2009:1553-1556.
[16] SUN Q,LI R,LUO D,et al.Text segmentation with LDA-based Fisher kernel[C]∥Proceedings of the,Meeting of the Association for Computational Linguistics on Human Language Tech-nologles:Short Papers.2008:269-272.
[17] RIEDL M,BIEMANN C.TopicTiling:a text segmentation algorithm based on LDA[C]∥Student Research Workshop.Asso-ciation for Computational Linguistics,2012:37-42.
[18] YU K,LI Z,GUAN G,et al.Unsupervised text segmentation using LDA and MCMC[C]∥Tenth Australasian Data Mining Conference.Australian Computer Society,Inc.2012:21-26.
[19] EISENSTEIN J,BARZILAY R.Bayesian unsupervised topicsegmentation[C]∥Conference on Empirical Methods in Natural Language Processing(EMNLP 2008).DBLP,2008:334-343.
[20] DU L,BUNTINE W,JOHNSON M.Topic Segmentation with a Structured Topic Model[C]∥Naacl-Hlt.2013:190-200.
[21] KERN R,GRANITZER M.Efficient linear text segmentationbased on information retrieval techniques[C]∥International Conference on Management of Emergent Digital Ecosystems.ACM,2009:25.
[22] CHANG P,MA H.Efficient short text subject extraction me-thod [J].Computer Engineering and Applications,2011,47(20):126-128.(in Chinese) 常鹏,马辉.高效的短文本主题词抽取方法[J].计算机工程与应用,2011,47(20):126-128.
[23] LIU Y,SUI Z F,HU Y W,et al.Domain Ontology automatic construction research [J].Journal of Beijing University of Posts and Telecommunications,2006,29(s2):65-69.(in Chinese) 刘耀,穗志方,胡永伟,等.领域Ontology自动构建研究[J].北京邮电大学学报,2006,29(s2):65-69.
[24] GONG X W,LIU Y.Research on Construction of Integrated Semantic Crawler [J].ICIC Express Letters,Part B:Applications,2016,7(7):1591-1598.
[25] CILIBRASI R L,VITANYI P M B.The Google Similarity Distance[J].IEEE Transactions on Knowledge & Data Enginee-ring,2004,19(3):370-383.
[26] LIU Y,SHI H Q,ZHENG D J.Study on semantic annotation for professional literature[J].ICIC Express Letters(Part B),2014,5(5):1383-1389.
[27] PEVZNER,HEARST,MARTI A.A critique and improvement of an evaluation metric for text segmentation[J].Computational Linguistics,2002,28(1):19-36.
[28] ZHU H J,ZHANG G P,CAI D F,et al.Application of Know-ledge Network in Text Segmentation Algorithm [C]∥International Conference on Information Processing.2007.(in Chinese) 朱海军,张桂平,蔡东风,等.知网在文本分割算法中的应用[C]∥中文信息处理国际会议.2007.
[29] ZHU J B,YE N,LUO H T.A text segmentation model based on multiple discriminant analysis [J].Journal of Software,2007,18(3):555-564.(in Chinese) 朱靖波,叶娜,罗海涛.基于多元判别分析的文本分割模型[J].软件学报,2007,18(3):555-564.
[30] ZHONG B B,LIU Y C,XU Z M.Study on Parameter Optimization in Text Sub-topic Segmentation Based on GA [J].Compu-ter Engineering and Applications,2005,41(21):97-99.(in Chinese) 钟彬彬,刘远超,徐志明.基于GA的文本子主题切分中的参数优化研究[J].计算机工程与应用,2005,41(21):97-99.

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!