基于事实抽取的Web文档内容数据质量评估

doi:10.11896/j.issn.1002-137X.2014.11.047

Abstract

Abstract: Data quality assessment of Web article content helps identify useful data.Exiting approaches not only heavily rely on lexicon features or user interactions to obtain quality indicators,but also can not capture the content’ semantics.A fact-based quality assessment (FQA) approach was proposed in this article.Given one target article,the approach starts with the identification of alternative context by collecting relevant articles and extracting facts from every article.Then,the accuracy baseline is constructed by voting,and the completeness baseline is constructed by iterations over fact graphs.Finally,data quality dimensions,including accuracy and completeness are calculated by comparing the facts of the target article with the established dimension baselines.Based on the facts of target article content,rather than particular features,FQA approach can quantify data quality dimensions with high precisions.The superior performance of FQA was verified in the experiments.

Key words: Data quality,Web article,Accuracy,Completeness,Quality dimensions,Fact

HAN Jing-yu and CHEN Ke-jia. Ranking Data Quality of Web Article Content by Extracting Facts[J].Computer Science, 2014, 41(11): 247-251.

0
/ / Recommend

Add to citation manager EndNote|Reference Manager|ProCite|BibTeX|RefWorks

URL: https://www.jsjkx.com/EN/10.11896/j.issn.1002-137X.2014.11.047

https://www.jsjkx.com/EN/Y2014/V41/I11/247

References

[1] Aebi D,Perrochon L.Towards improving data quality[C]∥Proc.of the international conference on information systems and management of data.New York:ACM,1993:273-281
[2] 马茜,古峪,张天成,等.一种基于数据质量的多源多模态感知数据获取方法[J].计算机学报,2013,6(10):2010-2131
[3] 郭志懋,周傲英.数据质量和数据清洗研究综述[J].软件学报,2002,13(1):2076-2082
[4] Pernici B,Scannapieco M.Data Quality in Web Information Systems[C]∥Proc.of the 21st International Conference on Conceptual Modeling.Berlin Heidelberg:Springer,2002:397-413
[5] Dalip D H,Cristo M,Calado P.Automatic assessment of document quality in web collaborative digital libraries [J].ACM Journal of Data and Information Quality,2011,2(3):14
[6] Hu Mei-qun,Lim Ee-peng,Sun Ai-xin.Measuring Article Quali-ty in Wikipedia:Models and Evaluation[C]∥Proc.of the 16th CIKM.New York:ACM,2007:243-252
[7] Zeng H,Alhossaini M A,Li D,et al.Computing trust from revision history[C]∥ Proc.of the 2006 International Conference on Privacy,Security and Trust:Bridge the Gap Between PST Technologies and Business Services.New York:ACM,2006
[8] Blumenstock J E.Size Matters:Word Count as a Measure ofQuality on Wikipedia[C]∥Proc.of the 17th International Conference on World Wide Web.New York:ACM,2008:1095-1096
[9] Knap T,Mlynkova I.Quality Assessment Social Networks:A Novel Approach for Assessing the Quality of Information on the Web[C]∥Proc.of QDB of VLDB’10.2010
[10] Baeza-Yates R,Rello L.On Measuring the Lexical Quality of the Web[C]∥ Proc.of the 2nd Joint WICOW/AIRWeb Workshop on Web Quality.New York:ACM,2012:1-6
[11] Blei D M,Ng A Y,Jordan M I.Latent dirichlet allocation [J].Journal of Machine Learning Research,2003(3):993-1022
[12] Dong Xin,Berti-Equille L,Hu Yi-fan,et al.Global Detection of Complex Copying Relationships Between Sources[C]∥Proc.of VLDB Endowment.New York:VLDB Endowment,2010:1358-1369
[13] Fan Wen-fei.Dependencies Revisited for Improving Data Quality[C]∥Proc.of PODS 2008.New York:ACM,2008:159-170
[14] Rassbach L,Pincock T,Mingus B.Exploring the Feasibility of Automatically Rating Online Article Quality[EB/OL].[2013-9-10].http:// upload.wikimedia.org/wikipedia/wikimania2007/d/d3/ RassbachPincockMingus 07.pdf
[15] Liu Jun,Ram S.Who does what:Collaboration patterns in theWikipedia and their impact on article quality[J].ACM Transactions on Management Information Systems,2011,2(2):1-23
[16] Han Jing-yu,Wang Chuan-dong,Jiang Da-wei.ProbabilisticQuality Assessment Based on Article’s Revision History[C]∥Proc.of the 22nd International Conference on Database and Expert systems Applications (DEXA).Berlin Heidelberg:Sprin-ger,2011:574-588
[17] Dalvi N,Kumar R,Soliman M.Automatic Wrappers for Large Scale Web Extraction[C]∥Proc.of the 37th International Conference on Very Large Databases.New York:VLDB Endowment,2011:219-230
[18] 肖升,何炎祥.基于动词论元结构的中文事件抽取方法[J].计算机科学,2012,9(5):161-164
[19] 杨少华,林海略,韩燕波.针对模板生成网页的一种数据自动抽取方法[J].软件学报,2008,19(2):209-223
[20] Etzioni O,Fader A,Christensen J,et al.Open information extraction:the second generation[C]∥Proc.of 22nd International Joint Conference on Artificial Intelligence.California:AAI press,2011:3-10
[21] Simes G,Galhardas H,Gravano L.When Speed Has a Price:Fast Information Extraction Using Approximate Algorithms[C]∥Proc.of the 39th International Conference on Very Large Databases.New York:VLDB Endowment,2013:1462-1473
[22] Mays E,Damerau F J,Mercer R L.Context based spelling correction [J].Information processing and management,1991,27(5):517-522
[23] Si X,Chang E Y,Gyongyi Z,et al.Confucius and its intelligent disciples:integrating social with search[C]∥ Proc.of the 36th International Conference on Very Large Databases.New York:VLDB Endowment,2010:1505-1516
[24] Toutanova K,Klein D,Manning C D,et al.Feature-rich part-of-speech tagging with a cyclic dependency network[C]∥Proc.of Human Language Technology Conference of the North American Chapter of the Association for Computational Linguistics.Stroudsburg:Association for Computational Linguistics,2003:173-180

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Ranking Data Quality of Web Article Content by Extracting Facts

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 0

Metrics

Comments

Recommended 0