重新审视面向CNN模型的测试样例选取:考虑模型校准

doi:10.11896/jsjkx.230400029

计算机科学 ›› 2024, Vol. 51 ›› Issue (6): 34-43.doi: 10.11896/jsjkx.230400029

重新审视面向CNN模型的测试样例选取:考虑模型校准

赵通, 沙朝锋

复旦大学计算机科学技术学院上海 200433

收稿日期:2023-04-05 修回日期:2023-05-29 出版日期:2024-06-15 发布日期:2024-06-05
通讯作者: 沙朝锋(cfsha@fudan.edu.cn)
作者简介:(20210240031@fudan.edu.cn)

Revisiting Test Sample Selection for CNN Under Model Calibration

ZHAO Tong, SHA Chaofeng

School of Computer Science,Fudan University,Shanghai 200433,China

Received:2023-04-05 Revised:2023-05-29 Online:2024-06-15 Published:2024-06-05
About author:ZHAO Tong,born in 1998,postgra-duate.His main research interests include deep learning and software engineering.
SHA Chaofeng,born in 1976,Ph.D,associate professor.His main research interests include machine learning,data mining,and natural language proces-sing.

摘要/Abstract

摘要： 深度神经网络(DNN)已被广泛应用于各种任务,而在部署前对DNN进行充分测试尤为重要,因此需要构建能够对DNN进行充分测试的测试集。由于标注成本受限,通常通过测试样例选取的方式得到测试子集。然而,人们使用基于预测不确定性的方法(该方法在发现误分类样例和提升重训练表现方面表现出卓越的能力)进行测试样例选取时,忽略了对测试样例的预测不确定性估计是否准确的问题。为了填补上述研究的空白,通过实验定性和定量地揭示了模型校准程度和测试样例选取任务中使用的不确定性指标之间的相关性。校准模型会使模型有更准确的预测不确定性估计,因此研究了不同校准程度的模型用不确定指标选取得到的测试子集质量是否不同。在3个公开数据集和4个卷积神经网络(CNN)架构模型上进行了充分的实验和分析,结果表明在CNN架构模型上:1)不确定指标和模型校准存在一定程度的相关性;2)校准程度好的模型所选择的测试子集质量优于校准程度差的模型选择的测试子集质量。在发现模型误分类样例的能力上,70.57%经过校准训练后的模型对应的实验结果优于未校准模型对应的实验结果。因此在测试样例选取任务中考虑模型校准十分重要,且可以使用模型校准来提升测试样例选取的表现。

关键词: 卷积神经网络测试, 预测不确定性, 模型校准, 测试样例选取

Abstract: Deep neural networks are widely used in various tasks,and model testing is crucial to ensure their quality.Test sample selection can solve the issue of labor-intensive manual labeling by strategically choosing a small set of data to label.However,existing selection metrics based on predictive uncertainty neglect the accuracy of the estimation of predictive uncertainty.To fill the gaps of the above studies,we conduct a systematic empirical study on 3 widely used datasets and 4 convolutional neural networks(CNN) to reveal the relationship between model calibration and predictive uncertainty metrics used in test sample selection.We then compare the quality of the test subset selected by calibrated and uncalibrated models.The findings indicate a degree of correlation between uncertainty metrics and model calibration in CNN models.Moreover,CNN models with better calibration select higher-quality test subsets than models with poor calibration.Specifically,the calibrated model outperforms the uncalibrated model in detecting misclassified samples in 70.57% of the experiments.Our study emphasizes the importance of considering mo-del calibration in test selection and highlights the potential benefits of using a calibrated model to improve the adequacy of the testing process.

Key words: Convolutional neural network testing, Predictive uncertainty, Model calibration, Test sample selection

中图分类号:

TP391

赵通, 沙朝锋. 重新审视面向CNN模型的测试样例选取:考虑模型校准[J]. 计算机科学, 2024, 51(6): 34-43. https://doi.org/10.11896/jsjkx.230400029

ZHAO Tong, SHA Chaofeng. Revisiting Test Sample Selection for CNN Under Model Calibration[J]. Computer Science, 2024, 51(6): 34-43. https://doi.org/10.11896/jsjkx.230400029

参考文献

[1]HE K,ZHANG X,REN S,et al.Deep residual learning forimage recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:770-778.
[2]BOJARSKI M,DEL TESTA D,DWORAKOWSKI D,et al.End to end learning for self-driving cars[J].arXiv:1604.07316,2016.
[3]PEI K,CAO Y,YANG J,et al.Deepxplore:Automated white-box testing of deep learning systems[C]//Proceedings of the 26th Symposium on Operating Systems Principles.2017:1-18.
[4]MA L,XU J F,ZHANG F,et al.Deepgauge:Multi-granularity testing criteria for deep learning systems[C]//Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering.2018:120-131.
[5]KIM J,FELDT R,YOO S.Guiding deep learning system testing using surprise adequacy[C]//2019 IEEE/ACM 41st International Conference on Software Engineering(ICSE).IEEE,2019:1039-1049.
[6]MA W,PAPADAKIS M,TSAKMALIS A,et al.Test selection for deep learning systems[J].ACM Transactions on Software Engineering and Methodology(TOSEM),2021,30(2):1-22.
[7]FENG Y,SHI Q,GAO X,et al.Deepgini:prioritizing massive tests to enhance the robustness of deep neural networks[C]//Proceedings of the 29th ACM SIGSOFT International Sympo-sium on Software Testing and Analysis.2020:177-188.
[8]GUO C,PLEISS G,SUN Y,et al.On calibration of modern neural networks[C]//International Conference on Machine Lear-ning.PMLR,2017:1321-1330.
[9]ZHANG H Y,CISSÉ M,DAUPHIN Y N,et al.mixup:Beyond Empirical Risk Minimization[C]//International Conference on Learning Representations.2018.
[10]PEREYRA G,TUCKER G,CHOROWSKI J,et al.Regularizing Neural Networks by Penalizing Confident Output Distributions[C]//International Conference on Learning Representations.2017.
[11]YUAN Y,PANG Q,WANG S.You Can’t See the Forest for Its Trees:Assessing Deep Neural Network Testing via Neural Coverage[C]//2023 IEEE/ACM 45st International Conference on Software Engineering(ICSE).IEEE,2023.
[12]WEISS M,TONELLA P.Simple techniques work surprisingly well for neural network test prioritization and active learning(replicability study)[C]//Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis.2022:139-150.
[13]ZHANG X,XIE X,MA L,et al.Towards characterizing adversarial defects of deep learning software from the lens of uncertainty[C]//Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering.2020:739-751.
[14]ZHAO C,MU Y,CHEN X,et al.Can test input selection me-thods for deep neural network guarantee test diversity? A large-scale empirical study[J].Information and Software Technology,2022,150:106982.
[15]SHEN W,LI Y,CHEN L,et al.Multiple-boundary clustering and prioritization to promote neural network retraining[C]//Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering.2020:410-422.
[16]GAO X,FENG Y,YIN Y,et al.Adaptive test selection for deep neural networks[C]//Proceedings of the 44th International Conference on Software Engineering.2022:73-85.
[17]HU Q,GUO Y,XIE X,et al.Aries:Efficient Testing of Deep Neural Networks via Labeling-Free Accuracy Estimation[C]//2023 IEEE/ACM 45st International Conference on Software Engineering(ICSE).IEEE,2023.
[18]ARRIETA A.Multi-objective metamorphic follow-up test case selection for deep learning systems[C]//Proceedings of the Genetic and Evolutionary Computation Conference.2022:1327-1335.
[19]ZADROZNY B,ELKAN C.Obtaining calibrated probability estimates from decision trees and naive bayesian classifiers[C]//ICML.2001:609-616.
[20]THULASIDASAN S,CHENNUPATI G,BILMES J A,et al.On mixup training:Improved calibration and predictive uncertainty for deep neural networks[C]//Advances in Neural Information Processing Systems.2019:13888-13899.
[21]MÜLLER R,KORNBLITH S,HINTON G E.When does label smoothing help?[C]//Advances in Neural Information Processing Systems.2019:4696-4705.
[22]QIN Y,WANG X,BEUTEL A,et al.Improving calibrationthrough the relationship with adversarial robustness[J].Advances in Neural Information Processing Systems,2021,34:14358-14369.
[23]LI Z,MA X,XU C,et al.Operational calibration:Debuggingconfidence errors for dnns in the field[C]//Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering.2020:901-913.
[24]LI D,HU B,CHEN Q.Calibration Meets Explanation:A Simple and Effective Approach for Model Confidence Estimates[C]//Conference on Empirical Methods in Natural Language Proces-sing.2022.
[25]MINDERER M,DJOLONGA J,ROMIJNDERS R,et al.Revisi-ting the calibration of modern neural networks[J].Advances in Neural Information Processing Systems,2021,34:15682-15694.
[26]DENG L.The mnist database of handwritten digit images formachine learning research[J].IEEE Signal Processing Magazine,2012,29(6):141-142.
[27]KRIZHEVSKY A,HINTON G.Learning multiple layers of features from tiny images[J/OL].http://cs.toronto.edu/~kriz/learning-features-2009-TR.pdf.
[28]MU N,GILMER J.Mnist-c:A robustness benchmark for computer vision[J].arXiv:1906.02337,2019.
[29]HENDRYCKS D,DIETTERICH T.Benchmarking neural net-work robustness to common corruptions and perturbations[C]//International Conference on Learning Representations.2019.
[30]HU Q,GUO Y,CORDY M,et al.An empirical study on data distribution-aware test selection for deep learning enhancement[J].ACM Transactions on Software Engineering and Methodo-logy(TOSEM),2022,31(4):1-30.
[31]HE T,ZHANG Z,ZHANG H,et al.Bag of tricks for image classification with convolutional neural networks[C]//Procee-dings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2019:558-567.
[32]SUTSKEVER I,MARTENS J,DAHL G,et al.On the importance of initialization and momentum in deep learning[C]//International Conference on Machine Learning.PMLR,2013:1139-1147.
[33]KROGH A,HERTZ J.A simple weight decay can improve ge-neralization[C]//Advances in Neural Information Processing Systems.1991:950-957.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

重新审视面向CNN模型的测试样例选取:考虑模型校准

Revisiting Test Sample Selection for CNN Under Model Calibration

PDF (PC)

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 0

Metrics

本文评价

推荐阅读 0