重新审视面向CNN模型的测试样例选取:考虑模型校准

doi:10.11896/jsjkx.230400029

Abstract

Abstract: Deep neural networks are widely used in various tasks,and model testing is crucial to ensure their quality.Test sample selection can solve the issue of labor-intensive manual labeling by strategically choosing a small set of data to label.However,existing selection metrics based on predictive uncertainty neglect the accuracy of the estimation of predictive uncertainty.To fill the gaps of the above studies,we conduct a systematic empirical study on 3 widely used datasets and 4 convolutional neural networks(CNN) to reveal the relationship between model calibration and predictive uncertainty metrics used in test sample selection.We then compare the quality of the test subset selected by calibrated and uncalibrated models.The findings indicate a degree of correlation between uncertainty metrics and model calibration in CNN models.Moreover,CNN models with better calibration select higher-quality test subsets than models with poor calibration.Specifically,the calibrated model outperforms the uncalibrated model in detecting misclassified samples in 70.57% of the experiments.Our study emphasizes the importance of considering mo-del calibration in test selection and highlights the potential benefits of using a calibrated model to improve the adequacy of the testing process.

Key words: Convolutional neural network testing, Predictive uncertainty, Model calibration, Test sample selection

CLC Number:

TP391

ZHAO Tong, SHA Chaofeng. Revisiting Test Sample Selection for CNN Under Model Calibration[J].Computer Science, 2024, 51(6): 34-43.

References

[1]HE K,ZHANG X,REN S,et al.Deep residual learning forimage recognition[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.2016:770-778.
[2]BOJARSKI M,DEL TESTA D,DWORAKOWSKI D,et al.End to end learning for self-driving cars[J].arXiv:1604.07316,2016.
[3]PEI K,CAO Y,YANG J,et al.Deepxplore:Automated white-box testing of deep learning systems[C]//Proceedings of the 26th Symposium on Operating Systems Principles.2017:1-18.
[4]MA L,XU J F,ZHANG F,et al.Deepgauge:Multi-granularity testing criteria for deep learning systems[C]//Proceedings of the 33rd ACM/IEEE International Conference on Automated Software Engineering.2018:120-131.
[5]KIM J,FELDT R,YOO S.Guiding deep learning system testing using surprise adequacy[C]//2019 IEEE/ACM 41st International Conference on Software Engineering(ICSE).IEEE,2019:1039-1049.
[6]MA W,PAPADAKIS M,TSAKMALIS A,et al.Test selection for deep learning systems[J].ACM Transactions on Software Engineering and Methodology(TOSEM),2021,30(2):1-22.
[7]FENG Y,SHI Q,GAO X,et al.Deepgini:prioritizing massive tests to enhance the robustness of deep neural networks[C]//Proceedings of the 29th ACM SIGSOFT International Sympo-sium on Software Testing and Analysis.2020:177-188.
[8]GUO C,PLEISS G,SUN Y,et al.On calibration of modern neural networks[C]//International Conference on Machine Lear-ning.PMLR,2017:1321-1330.
[9]ZHANG H Y,CISSÉ M,DAUPHIN Y N,et al.mixup:Beyond Empirical Risk Minimization[C]//International Conference on Learning Representations.2018.
[10]PEREYRA G,TUCKER G,CHOROWSKI J,et al.Regularizing Neural Networks by Penalizing Confident Output Distributions[C]//International Conference on Learning Representations.2017.
[11]YUAN Y,PANG Q,WANG S.You Can’t See the Forest for Its Trees:Assessing Deep Neural Network Testing via Neural Coverage[C]//2023 IEEE/ACM 45st International Conference on Software Engineering(ICSE).IEEE,2023.
[12]WEISS M,TONELLA P.Simple techniques work surprisingly well for neural network test prioritization and active learning(replicability study)[C]//Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis.2022:139-150.
[13]ZHANG X,XIE X,MA L,et al.Towards characterizing adversarial defects of deep learning software from the lens of uncertainty[C]//Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering.2020:739-751.
[14]ZHAO C,MU Y,CHEN X,et al.Can test input selection me-thods for deep neural network guarantee test diversity? A large-scale empirical study[J].Information and Software Technology,2022,150:106982.
[15]SHEN W,LI Y,CHEN L,et al.Multiple-boundary clustering and prioritization to promote neural network retraining[C]//Proceedings of the 35th IEEE/ACM International Conference on Automated Software Engineering.2020:410-422.
[16]GAO X,FENG Y,YIN Y,et al.Adaptive test selection for deep neural networks[C]//Proceedings of the 44th International Conference on Software Engineering.2022:73-85.
[17]HU Q,GUO Y,XIE X,et al.Aries:Efficient Testing of Deep Neural Networks via Labeling-Free Accuracy Estimation[C]//2023 IEEE/ACM 45st International Conference on Software Engineering(ICSE).IEEE,2023.
[18]ARRIETA A.Multi-objective metamorphic follow-up test case selection for deep learning systems[C]//Proceedings of the Genetic and Evolutionary Computation Conference.2022:1327-1335.
[19]ZADROZNY B,ELKAN C.Obtaining calibrated probability estimates from decision trees and naive bayesian classifiers[C]//ICML.2001:609-616.
[20]THULASIDASAN S,CHENNUPATI G,BILMES J A,et al.On mixup training:Improved calibration and predictive uncertainty for deep neural networks[C]//Advances in Neural Information Processing Systems.2019:13888-13899.
[21]MÜLLER R,KORNBLITH S,HINTON G E.When does label smoothing help?[C]//Advances in Neural Information Processing Systems.2019:4696-4705.
[22]QIN Y,WANG X,BEUTEL A,et al.Improving calibrationthrough the relationship with adversarial robustness[J].Advances in Neural Information Processing Systems,2021,34:14358-14369.
[23]LI Z,MA X,XU C,et al.Operational calibration:Debuggingconfidence errors for dnns in the field[C]//Proceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering.2020:901-913.
[24]LI D,HU B,CHEN Q.Calibration Meets Explanation:A Simple and Effective Approach for Model Confidence Estimates[C]//Conference on Empirical Methods in Natural Language Proces-sing.2022.
[25]MINDERER M,DJOLONGA J,ROMIJNDERS R,et al.Revisi-ting the calibration of modern neural networks[J].Advances in Neural Information Processing Systems,2021,34:15682-15694.
[26]DENG L.The mnist database of handwritten digit images formachine learning research[J].IEEE Signal Processing Magazine,2012,29(6):141-142.
[27]KRIZHEVSKY A,HINTON G.Learning multiple layers of features from tiny images[J/OL].http://cs.toronto.edu/~kriz/learning-features-2009-TR.pdf.
[28]MU N,GILMER J.Mnist-c:A robustness benchmark for computer vision[J].arXiv:1906.02337,2019.
[29]HENDRYCKS D,DIETTERICH T.Benchmarking neural net-work robustness to common corruptions and perturbations[C]//International Conference on Learning Representations.2019.
[30]HU Q,GUO Y,CORDY M,et al.An empirical study on data distribution-aware test selection for deep learning enhancement[J].ACM Transactions on Software Engineering and Methodo-logy(TOSEM),2022,31(4):1-30.
[31]HE T,ZHANG Z,ZHANG H,et al.Bag of tricks for image classification with convolutional neural networks[C]//Procee-dings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition.2019:558-567.
[32]SUTSKEVER I,MARTENS J,DAHL G,et al.On the importance of initialization and momentum in deep learning[C]//International Conference on Machine Learning.PMLR,2013:1139-1147.
[33]KROGH A,HERTZ J.A simple weight decay can improve ge-neralization[C]//Advances in Neural Information Processing Systems.1991:950-957.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Revisiting Test Sample Selection for CNN Under Model Calibration

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 1

Metrics

Comments

Recommended 0