计算机科学 ›› 2020, Vol. 47 ›› Issue (9): 110-116.doi: 10.11896/jsjkx.191000156

• 数据库&大数据&数据科学 • 上一篇    下一篇

一种大数据估价算法

赵会群1, 吴凯锋2   

  1. 1 北方工业大学信息学院 北京100144
    2 北方工业大学大规模流数据集成与分析技术北京市重点实验室 北京100144
  • 收稿日期:2019-10-24 发布日期:2020-09-10
  • 通讯作者: 吴凯锋(wkf0305@qq.com)
  • 作者简介:zhaohq6625@sina.com
  • 基金资助:
    国家自然科学基金项目(61672041)

Big Data Valuation Algorithm

ZHAO Hui-qun1, WU Kai-feng2   

  1. 1 College of Computer Science and Technology,North China University of Technology,Beijing 100144,China
    2 Beijing Key Laboratory of Large-scale Stream Data Integration and Analysis Technology,North China University of Technology,Beijing 100144,China
  • Received:2019-10-24 Published:2020-09-10
  • About author:ZHAO Hui-qun,born in 1960,Ph.D,professor.His main research interests include software architecture,big data generation,internet of things,cloud computing,and sports computing.
    WU Kai-feng,born in 1994,master.His main research interests include big data pricing,big data asset management,and big data services.
  • Supported by:
    National Natural Science Foundation of China (61672041).

摘要: “大数据”已经成为计算机领域使用频率最高的专业词汇之一,而且已经逐渐变成了一个商品名称。无论是从学术研究角度,还是从数据交易需求角度,对大数据集的可用性进行评价都是一个新的问题。文中提出了一个大数据可用性评价模型,为学术和流通领域提供参考。结合大数据的4V(Volume,Variety,Velocity,Value)特性,分段统计样本数据的4V特性分布,从而给出基于分段分布的大数据特性概率模型,以及大数据可用性加权评价模型。文中还提出了实现大数据分块抽样的算法,以及大数据评价模型的各个特性加权系数的估计算法。结合视频大数据的可用性评价需求,展示所提模型和算法的具体应用。大数据可用性评价模型可以用于数据科学实验的数据评价,也可以用于大数据交易市场的数据集定价。给出了实际评价工作中,标准化(商品化)数据集以及确定数据评价基准等具体操作方面的解决方案。应用案例对所提模型有支持作用,进一步检验了模型的可行性。

关键词: 大数据可用性评价, 概率模型, 大数据分块算法, 视频大数据

Abstract: With the rapid development of information technology,the generation of data has shown an exponential growth trend.Big data has become one of the most frequently used words due to the rapid emergence of big data and its great value.It is not only an academic vocabulary,but has gradually become a commodity name.Whether from academic research or data trading needs,how to evaluate the availability of big data sets is a new issue.A big data usability evaluation model is proposed to provide refe-rence for academic and circulation fields in this paper.Combined with the 4V(Volume,Variety,Velocity,Value) characteristics of big data,the 4V characteristic distribution of the statistical data is segmented,which gives the probability model of big data based on the piecewise distribution and the availability of large data sets and weighted evaluation model.An algorithm for realizing big data block sampling and an estimation algorithm for weighting coefficients of each characteristic in the big data set evaluation model are proposed.Combined with the data availability evaluation requirements in video big data analysis,the specific applications of the proposed models and algorithms are demonstrated.The big data usability evaluation model can be used for data evalua-tion of data science experiments,and can also be used for data set pricing in big data transaction markets.In the actual evaluation work,how to standardize(commercialized) data sets,and how to determine the specific operational aspects of the video field eva-luation benchmarks are given.The application case supports the proposed model and further tests the feasibility of the model.

Key words: Big data availability evaluation, Probability model, Big data blocking algorithm, Video big data

中图分类号: 

  • TP391
[1] LI J Z,LIU X M.An important aspect of big data:data availabi-lity [J].Computer Research and Development,2013,50(6):1147-1162.
[2] WANG S,WANG H J,XI X P,et al.Architectural Big Data:Challenges,Status Quo and Prospects[J].Chinese Journal of Computers,2011,34(10):1741-1752.
[3] LIANG J Y,WANG F,DANG C Y,et al.An efficient rough feature selection algorithm with a multigranulation view[J].Interational Journal of Approximate Reasoning,2012,53:912-926.
[4] ZHOU H X,CHEN S C.A Canonical Correlation Analysis of Ordered Discrimination[J].Journal of Software,2014,25(9):2018-2025.
[5] HUO W,MENG X F.Research on Trajectory Privacy Protec-tion Technology[J].Chinese Journal of Computers,2011,34(10):1820-1830.
[6] CHENG Y X.Methodology and Practice of Data Asset Management in the Age of Big Data[J].Computer Applications and Software,2018,35(11):326-329.
[7] ZHAO Z R.Analysis of Domestic Big Data Transaction Pricing[J].Information Security & Communication Secrecy,2017(5):61-67.
[8] CHEN Y,ZHOU J E,DU J Q.A Credit Evaluation MethodBased on Transaction Data[J].Computer Applications and Software,2018,35(5):168-171.
[9] VINAYAK R,BORKAR,MICHAEL J.Big Data Platforms:What’s The Next?[J].XRDS·FALL,2012(1):44-49.
[10] WANG W,ZHANG M J,WANG J.Research on Risk FactorIdentification in Big Data Transaction Business Process [J/OL].[2019-07-08].http://kns.cnki.net /kcms/detail/11.1762.G3.20190603.0844.004.html.
[11] YE Q Q,MENG X F,ZHU M J,et al.A Review of Localized Differential Privacy Research[J].Journal of Software,2018,29(7):1981-2005.
[12] WANG H L,TIAN Y L,YIN X.Big Data Confirmation Scheme Based on Blockchain[J].Computer Science,2018,45(2):15-19,24.
[13] HE C,WANG Y R.Research on the Difficulties and Countermeasures of Big Data Trading Platform in China[J].Modern Love Newspaper,2017,37(8):98-105,153.
[14] NIYATOD,ABUALSHEIKHM,PING WING,et al.Marketmodel and optimal pricing scheme of big data and internet of things(IOT)[J/OL].Arxiv,2016:1-6.https://xueshu.baidu.com/usercenter/paper/show?paperid=8038a12a20a285199b002c907070d4f9&site=xueshu_se.
[15] DEEP S,KOUTRIS P.The design of arbitrage-free data pricing schemes[J].Schloss Dagstuhl-Leibniz-Zentrum für Informatik,2017(12):1-18.
[16] TAN X T,GU Y Y,RUAN T,et al.Confidence Interval Method for Data Set Classification Availability Evaluation[J].Computer Science,2019,46(1):78-85.
[17] WU X D,DONG B B,CAO X Z,et al.Data Governance Technology [J/OL].[2019-07-02].https://doi.org/10.13328/j.cnki.jos.005854.
[18] GUO B,LI Q,DUAN X L,et al.Personal Data Banking-A New Model of Personal Big Data Asset Management and Value-added Services Based on Bank Architecture[J].Computer Journal,2017,40(1):126-143.
[19] EMC Solution Group.Big data-as-a-service:A market and technology perspective[R].2012.
[20] LIU H F,ZHENG H,AHMAD M,et al.A new user similarity model to improve the accuracy of collaborative filtering[J].Knowledge-Based Systems,2014(56):156-166.
[21] ZHAO H Q,SUN J,ZHAO R X.A Model for Assessing the Dependability of Internetware Software Systems[C]//IEEE 39th Annual International Computers,Software & Applications Conference.2015:578-581.
[22] LE H S.Dealing with the new user cold-start problem in recommender systems:A comparative review[J].Information Systems,2016,58:87-104.
[23] KATARYA R,VERMA O P.Recent developments in affective recommender systems[J/OL].Physica A Statal Mechanics & Its Applications,2016:182-190.https://xueshu.baidu.com/usercenter/paper/show?paperid=8038a12a20a285199b002c907070d4f9&site=xueshu_se.
[24] TOMMASO D N,JESSICA R,PAOLO T,et al.Adaptive multi-attribute diversity for recommender systems[J].Information Sciences,2017,3:234-253.
[25] MARÍA D C R H,SERGIO I,RAMÓN H R T L.DataGen-CARS:A generator of synthetic data for the evaluation of context-aware recommendation systems[J].Pervasive and Mobile Computing,2017,7:516-541.
[26] LI J Z,WANG H Z,GAO H.Research Progress in Big Data Usa-bility[J].Journal of Software,2016,27(7):1605-1625.
[27] Guiyang Big Data Trading Center.2016 China Big Data Transaction White Paper[OL].http://www.gbdex.com/website/view/bigData.jsp.
[1] 夏奴奴, 杨晋吉, 赵淦森, 莫晓珊. 基于概率模型的云辅助的轻量级无证书认证协议的形式化验证[J]. 计算机科学, 2019, 46(8): 206-211.
[2] 周女琪, 周宇. 基于概率模型检测的Web服务组合多目标验证[J]. 计算机科学, 2018, 45(8): 288-294.
[3] 刘爽, 魏欧, 郭宗豪. 基于概率模型检测和遗传算法的基因调控网络的无限范围优化控制[J]. 计算机科学, 2018, 45(10): 313-319.
[4] 杜伊,何洋,洪玫. 概率模型检测在动态能耗管理中的应用[J]. 计算机科学, 2018, 45(1): 261-266.
[5] 刘付勇,高贤强,张著. 基于改进贝叶斯概率模型的推荐算法[J]. 计算机科学, 2017, 44(5): 285-289.
[6] 郭宗豪,魏欧. 使用模型检测解决概率布尔网络优化控制[J]. 计算机科学, 2017, 44(5): 193-198.
[7] 刘云恒,刘耀宗. 基于Hadoop的公安视频大数据的处理方法[J]. 计算机科学, 2016, 43(Z6): 448-451.
[8] 杨蓓,周兰江,余正涛,刘丽佳. 半监督学习的老挝语词性标注方法研究[J]. 计算机科学, 2016, 43(9): 103-106.
[9] 张恒巍,韩继红,寇 广,卫 波. 云计算环境中服务动态选择算法研究[J]. 计算机科学, 2015, 42(5): 251-254.
[10] 开金宇,缪淮扣,高洪皓. Web服务计算组合流程QoS验证[J]. 计算机科学, 2015, 42(12): 120-123.
[11] 余娟,贺昱曜,冯晓华. 改进的分布估计算法求解软硬件划分问题[J]. 计算机科学, 2014, 41(9): 285-289.
[12] 刘建伟,黎海恩,罗雄麟. 概率图模型表示理论[J]. 计算机科学, 2014, 41(9): 1-17.
[13] 柴变芳,贾彩燕,于剑. 基于统计推理的社区发现模型综述[J]. 计算机科学, 2012, 39(8): 1-.
[14] 王晶 戎玫 张广泉 祝义. 基于概率模型检测的Web服务组合验证[J]. 计算机科学, 2012, 39(1): 120-123.
[15] 梁家荣,花仁杰. 具有失效链路的star网络可靠性分析[J]. 计算机科学, 2010, 37(6): 106-110.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] 孙启,金燕,何琨,徐凌轩. 用于求解混合车辆路径问题的混合进化算法[J]. 计算机科学, 2018, 45(4): 76 -82 .
[2] 史雯隽,武继刚,罗裕春. 针对移动云计算任务迁移的快速高效调度算法[J]. 计算机科学, 2018, 45(4): 94 -99 .
[3] 周燕萍,业巧林. 基于L1-范数距离的最小二乘对支持向量机[J]. 计算机科学, 2018, 45(4): 100 -105 .
[4] 杨羽琦,章国安,金喜龙. 车载自组织网络中基于车辆密度的双簇头路由协议[J]. 计算机科学, 2018, 45(4): 126 -130 .
[5] 施超,谢在鹏,柳晗,吕鑫. 基于稳定匹配的容器部署策略的优化[J]. 计算机科学, 2018, 45(4): 131 -136 .
[6] 韩奎奎,谢在鹏,吕鑫. 一种基于改进遗传算法的雾计算任务调度策略[J]. 计算机科学, 2018, 45(4): 137 -142 .
[7] 庞博,金乾坤,合尼古力·吾买尔,齐兴斌. 软件定义网络中基于网络切片和ILP模型的路由方案[J]. 计算机科学, 2018, 45(4): 143 -147 .
[8] 夏庆勋,庄毅. 一种基于局部性原理的远程验证机制[J]. 计算机科学, 2018, 45(4): 148 -151 .
[9] 郑秀林,宋海燕,付伊鹏. MORUS-1280-128算法的区分分析[J]. 计算机科学, 2018, 45(4): 152 -156 .
[10] 厉柏伸,李领治,孙涌,朱艳琴. 基于伪梯度提升决策树的内网防御算法[J]. 计算机科学, 2018, 45(4): 157 -162 .