计算机科学 ›› 2025, Vol. 52 ›› Issue (8): 1-16.doi: 10.11896/jsjkx.250300156

• 学科前沿 • 上一篇    下一篇

基于大语言模型的移动应用隐私政策合规性检测方法

王立梅1,2, 韩林睿1,2, 杜祖炜1,2, 郑日1,2, 时建中1,2, 刘奕群3   

  1. 1 教育部哲学社会科学实验室——中国政法大学数据法治实验室 北京 100088
    2 中国政法大学数据法治研究院 北京 100088
    3 清华大学计算机科学与技术系 北京 100084
  • 收稿日期:2025-03-28 修回日期:2025-05-17 出版日期:2025-08-15 发布日期:2025-08-08
  • 通讯作者: 王立梅(limeiw@cupl.edu.com)
  • 基金资助:
    2022年国家重点研发计划“社会治理与智慧社会科技支撑”重点专项(2022YFC3303000)

Privacy Policy Compliance Detection Method for Mobile Application Based on Large LanguageModel

WANG Limei1,2, HAN Linrui1,2, DU Zuwei1,2, ZHENG Ri1,2, SHI Jianzhong1,2, LIU Yiqun3   

  1. 1 Ministry of Education Laboratory of Philosophy and Social Sciences-The CUPL Data Law Lab,China University of Political Science and Law,Beijing 100088,China
    2 The Institute for Data Law,China University of Political Science and Law,Beijing 100088,China
    3 Department of Computer Science and Technology,Tsinghua University,Beijing 100084,China
  • Received:2025-03-28 Revised:2025-05-17 Online:2025-08-15 Published:2025-08-08
  • About author:WANG Limei,born in 1974,Ph.D,professor,Ph.D supervisor. Her main research interests include data law,cyber and information law.
  • Supported by:
    2022 National Key R&D Program “Social Governance and Smart Society Technology Support” Key Special Project(2022YFC3303000).

摘要: 隐私政策是网络服务提供者对其合法采集和利用个人信息行为的自律性承诺,旨在增强用户对个人信息处理过程的信任并提升其控制能力。然而,实际应用中却存在内容冗长、术语复杂、合规边界模糊等问题。传统方法依赖分类模型,通过对隐私政策文本进行标注实现自动化合规检测,但存在评估标准单一化、标注数据获取成本高、模型泛化能力不足等局限性。对此,提出一种基于大语言模型的移动应用隐私政策合规性检测方法,核心流程为“构建合规性评估体系-设计层级式推理框架-实现自动化合规检测”。首先,依据《民法典》《个人信息保护法》等9部法律法规及国家标准,构建包含6个一级指标、14个二级指标和41个三级指标的合规性评估体系;其次,基于动态最优轨迹搜索方法设计三阶段层级式推理框架DOTS-THCE,通过小样本提示工程引导大语言模型实现隐私政策的多层次动态评估;最后,基于从“腾讯应用宝”移动应用商店采集的PPC-Bench数据集(涵盖10个类别、4 821份隐私政策文本)开展实验。实验结果表明,与Deepseek-LLM-7B-Chat,Llama3.1-8B-Chinese-Chat和GLM-4-9B-Chat相比,Qwen2.5-7B-Instruct模型经DOTS-THCE方法增强推理后性能更优。Qwen2.5-7B-Instruct@DOTS-THCE模型在隐私政策合规性检测中宏F1值达89.30%,显著优于SVM,CNN,RNN,BERT以及Qwen2.5-7B-Instruct@RAG等基线模型。研究不仅验证了大语言模型在隐私政策合规性检测中应用的有效性,更为破解司法领域高质量标注数据稀缺的困境提供了有益参考。

关键词: 隐私政策, 合规性检测, 动态最优轨迹搜索, 三阶段层级式推理框架, 大语言模型

Abstract: Privacy policies serve as self-regulatory commitments by online service providers to legitimize the collection and utilization of personal information,aiming to enhance user trust and provide users with greater control over data processing.However,they face practical challenges including excessive length,technical jargon proliferation,and ambiguities in legal compliance.Traditional approaches rely on classification models that detect compliance through annotated policy texts.However,these methods suffer from oversimplified evaluation metrics,high annotation costs,and limited detection accuracy.This paper proposes a large language model(LLM)-based framework for mobile App privacy policy compliance detection,structured around three pillars:(1)establishing a multi-tier compliance evaluation system,(2)designing a hierarchical reasoning framework enhanced by Dynamic Optimal Trajectory Search(DOTS),and(3)implementing automated compliance verification.Firstly,this paper constructs a compliance evaluation system comprising 6 first-level,14 second-level,and 41 third-level indicators,grounded in nine legal frameworks including China's Civil Code and Personal Information Protection Law.Secondly,it develops the Dynamic Tri-Stage Hierarchical Compliance Evaluator(DOTS-THCE),a three-phase reasoning framework that enables few-shot prompting to guide LLMs in conducting multi-level dynamic assessments of privacy policies.Finally,it implements automated detection on the PPC-Bench dataset containing 4 821 privacy policies across 10 application categories collected from Tencent's “MyApp” store.Experimental results demonstrate that the Qwen2.5-7B-Instruct model augmented with DOTS-THCE outperforms baseline models(Deepseek-LLM-7B-Chat,Llama3.1-8B-Chinese-Chat,and GLM-4-9B-Chat) by a significant margin.The Qwen2.5-7B-Instruct@DOTS-THCE configuration achieves a macro-F1 score of 89.30%,surpassing traditional models including SVM,CNN,RNN,BERT,and Qwen2.5-7B-Instruct@RAG in terms of detection efficacy.This study not only pioneers LLM applications in privacy policy compliance detection,but also provides methodological insights for addressing data annotation scarcity in judicial AI systems.

Key words: Privacy policy, Compliance detection, Dynamic optimal trajectory search, Dynamic tri-stage hierarchical compliance evaluator, Large language model

中图分类号: 

  • TP183
[1]SHI J.Deconstruction of the Concept Data and Construction of the Data Law System On the Content and System of Data Law[J].Peking University Law Journal,2023,35(1):23-45.
[2]WANG L.How to Value the Property Rights of Natural Person Data Sources in Data Law[J].Exploration and Free Views,2024(4):109-121,179.
[3]中国互联网信息中心.第55次中国互联网络发展状况统计报告[EB/OL].(2025-01-17)[2025-01-26].https://www.cnnic.net.cn/n4/2025/0117/c88-11229.html.
[4]JIANG H,JIANG J.New Quality Productivity Formation:How Digital Platforms can Generate Greater Benefits[J].Enterprise Economy,2025(1):120-129.
[5]信息通信管理局.“深入推进APP治理扎实做好用户权益保护工作”获评2024年网络文明建设优秀案例[EB/OL].(2024-09-04)[2025-01-28].https://www.miit.gov.cn/jgsj/xgj/APPqhyhqyzxzzxd/gzdt/art/2024/art_a887f391224849a5975f6dd231b0d58c.html.
[6]YU P,XU T,SUN W,et al.Detecting Privacy Compliance of Mobile Applications from the Perspective of the“Minimum Necessary” Principle[J].Chinese Journal of Network and Information Security,2024,10(6):109-122.
[7]GUO Q,WU D.Research on Optimization of APP Privacy Policy Framework Based on Text Analysis[J].Journal of Information Resources Management,2021,11(1):18-29.
[8]MCDONALD A M,CRANOR L F.The Cost of Reading Privacy Policies[J].Isjlp,2008,4:543.
[9]LI H,ZHU H,DU S,et al.Privacy Leakage of Location Sharing in Mobile Social Networks:Attacks and Defense[J].IEEE Transactions on Dependable and Secure Computing,2016,15(4):646-660.
[10]LIU S,ZHANG F,ZHAO B,et al.APPCorp:A Corpus for Android Privacy Policy Document Structure Analysis[J].Frontiers of Computer Science,2023,17(3):173320.
[11]LIU S,ZHAO B,GUO R,et al.Have you been Properly Notified? Automatic Compliance Analysis of Privacy Policy Text with GDPR Article 13[C]//Proceedings of the Web Conference 2021.2021:2154-2164.
[12]COSTANTE E,SUN Y,PETKOVĆ M,et al.A Machine Learning Solution to Assess Privacy Policy Completeness:(short paper)[C]//Proceedings of the 2012 ACM Workshop on Privacy in the Electronic Society.2012:91-96.
[13]BHATIA J,BREAUX T D.Semantic Incompleteness in Privacy Policy Goals[C]//2018 IEEE 26th International Requirements Engineering Conference(RE).IEEE,2018:159-169.
[14]BHATIA J,BREAUX T D,REIDENBERG J R,et al.A Theory of Vagueness and Privacy Risk Perception[C]//2016 IEEE 24th International Requirements Engineering Conference(RE).IEEE,2016:26-35.
[15]ANDOW B,MAHMUD S Y,WANG W,et al.PolicyLint:Investigating Internal Privacy Policy Contradictions on Google Play[C]//28th USENIX Security Symposium(USENIX security 19).2019:585-602.
[16]SLAVIN R,WANG X,HOSSEINI M B,et al.Toward a Framework for Detecting Privacy Policy Violations in Android Application Code[C]//Proceedings of the 38th International Confe-rence on Software Engineering.2016:25-36.
[17]LI X,TANG P,ZHANG X,et al.GDPR-Oriented IntelligentChecking Method of Privacy Policies Compliance[J].Chinese Journal of Network and Information Security,2023,9(6):127-139.
[18]CONG Y,HAN L,MA J,et al.Research on Intelligent Judgment of Criminal Cases Based on Large Language Models[J].Computer Science,2025,52(5):248-259.
[19]CUI J,LI Z,YAN Y,et al.Chatlaw:Open-Source Legal Large Language Model with Integrated External Knowledge Bases[J].arXiv:2306.16092v1,2023.
[20]ZHU D,HIANG X,LI Y,et al.Automatic Summarization of Legal Texts Based on Large Language Models[J/OL].http://kns.cnki.net/kcms/detail/10.1478.G2.20241013.1125.002.html.
[21]WEI J,WANG X,SCHUURMANS D,et al.Chain-of-thoughtPrompting Elicits Reasoning in Large Language Models[J].Advances in Neural Information Processing Systems,2022,35:24824-24837.
[22]BAI Y,JONES A,NDOUSSE K,et al.Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback[J].arXiv:2204.05862,2022.
[23]CHEN Z,DENG Y,YUAN H,et al.Self-play Fine-tuning Converts Weak Language Models to Strong Language Models[J].arXiv:2401.01335,2024.
[24]WILSON S,SCHAUB F,DARA A A,et al.The Creation and Analysis of a Website Privacy Policy Corpus[C]//Proceedings of the 54th Annual Meeting of the Association for Computa-tional Linguistics(Volume 1:Long Papers).2016:1330-1340.
[25]SARNE D,SCHLER J,SINGER A,et al.Unsupervised Topic Extraction from Privacy Policies[C]//Companion Proceedings of the 2019 World Wide Web Conference.2019:563-568.
[26]SATHYENDRA K M,WILSON S,SCHAUB F,et al.Identifying the Provision of Choices in Privacy Policy Text[C]//Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing.2017:2774-2779.
[27]LEBANOFF L,LIU F.Automatic Detection of Vague Words and Sentences in Privacy Policies[C]//Proceedings of the 2018 Conference on Empirical Methods in Natural Language Proces-sing.2018:3508-3517.
[28]ZIMMECK S,WANG Z,ZOU L,et al.Automated Analysis ofPrivacy Requirements for Mobile Apps[C]//NDSS.2017.
[29]KUZNETSOV M,NOVIKOVA E,KOTENKOI,et al.Privacy Policies of IoT Devices:Collection and Analysis[J].Sensors,2022,22(5):1838.
[30]MÜLLER N M,KOWATSCH D,DEBUS P,et al.On GDPRCompliance of Companies' Privacy Policies[C]//Text,Speech,and Dialogue:22nd International Conference,TSD 2019,Ljubljana,Slovenia,September 11-13,2019,Proceedings 22.Springer International Publishing,2019:151-159.
[31]TANG P,LI X,CHEN Y,et al.A Comprehensive Study on GDPR-Oriented Analysis of Privacy Policies:Taxonomy,Corpus and GDPR Concept Classifiers[J].arXiv:2410.04754,2024.
[32]ZHAO K,YU L,ZHOU S,et al.A Fine-rained Chinese Software Privacy Policy Dataset for Sequence Labeling and Regulation Compliant Identification[C]//Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing.2022:10266-10277.
[33]ZHAO K,ZHAN X,YU L,et al.Demystifying Privacy Policy of Third-party Libraries in Mobile Apps[C]//2023 IEEE/ACM 45th International Conference on Software Engineering(ICSE).IEEE,2023:1583-1595.
[34]HARKOUS H,FAWAZ K,LEBRET R,et al.Polisis:Automated Analysis and Presentation of Privacy Policies using Deep Learning[C]//27th USENIX Security Symposium(USENIX Security 18).2018:531-548.
[35]TORRE D,ABUALHAIJA S,SABETZADEH M,et al.An Ai-assisted Approach for Checking the Completeness of Privacy Policies Against GDPR[C]//2020 IEEE 28th International Requirements Engineering Conference(RE).IEEE,2020:136-146.
[36]CEJAS O A,AZEEM M I,ABUALHAIJA S,et al.Nlp-based Automated Compliance Checking of Data Processing Agreements Against GDPR[J].IEEE Transactions on Software Engineering,2023,49(9):4282-4303.
[37]ZHU H,LUO Y,CHEN M,et al.Analyzing Compliance of Privacy Policy with Knowledge-Enhanced DeepLearning Model:From the Perspective of Integrity and Semantic Conflict[J].Data Analysis and Knowledge Discovery,2024,8(5):46-58.
[38]CHEN W,MA X,WANG X,et al.Program of ThoughtsPrompting:Disentangling Computation from Reasoning for Numerical Reasoning Tasks[J].arXiv:2211.12588,2022.
[39]ZHAO J,XIE Y,KAWAGUCHI K,et al.Automatic Model Selection with Large Language Models for Reasoning[C]//Fin-dings of the Association for Computational Linguistics:EMNLP 2023.2023:758-783.
[40]YAO S,YU D,ZHAO J,et al.Tree of Thoughts:Deliberate Problem Solving with Large Language Models[J].Advances in Neural Information Processing Systems,2023,36:11809-11822.
[41]BESTA M,BLACH N,KUBICEK A,et al.Graph of Thoughts:Solving Elaborate Problems with Large Language Models[C]//Proceedings of the AAAI Conference on Artificial Intelligence.2024:17682-17690.
[42]WANG X,LI C,WANG Z,et al.Promptagent:Strategic Planning with Language Models Enables Expert-level Prompt Optimization[J].arXiv:2310.16427,2023.
[43]MADAAN A,TANDON N,GUPTA P,et al.Self-refine:Iterative Refinement with Self-Feedback[J].Advances in Neural Information Processing Systems,2023,36:46534-46594.
[44]YUE M,YAO W,MI H,et al.DOTS:Learning to Reason Dynamically in LLMs via Optimal Reasoning Trajectories Search[J].arXiv:2410.03864,2024.
[45]LI Y.The Compliance Review and lmprovement of China's Mobile App Privacy Policy:A Text Review on 49 Cases of Privacy Policy[J].Studies in Law and Business,2019,36(5):26-39.
[46]SINAEEPOURFARD A,MASIP-BRUIN X,GARCIA J,et al.A Survey on Data Lifecycle Models:Discussions Toward the 6Vs Challenges:Technical Resport[R].2015.
[47]ZHAO S,ZHANG H.Changes of the Logical Structure Theory of a Legal Rule and lts Reflection[J].Law and Social Development,2020,26(1):62-80.
[48]DAVIS F D,BAGOZZI R P,WARSHAW P R.User Acceptance of Computer Technology:A Comparison of Two Theoretical Models[J].Management Science,1989,35(8):982-1003.
[49]SAATY T L.Decision Making with the Analytic HierarchyProcess[J].International Journal of Services Sciences,2008,1(1):83-98.
[50]YANG A,YANG B,ZHANG B,et al.Qwen2.5 Technical Report[J].arXiv:2412.15115,2024.
[51]BI X,CHEN D,CHEN G,et al.Deepseek LLM:Scaling Open-source Language Models with Longtermism[J].arXiv:2401.02954,2024.
[52]WANG S,ZHENG Y,WANG G,et al.Llama3.1-8B-Chinese-Chat [EB/OL].https://huggingface.co/shenzhi-wang/Llama3.1-8B-Chinese-Chat.
[53]GLM T,ZENG A,XU B,et al.ChatGLM:A Family of Large Language Models from GLM-130B to GLM-4 All Tools[J].arXiv:2406.12793,2024.
[54]ZHAO Y,YAN Z,SHEN,Q,et al.Evaluating Privacy Policy for Mobile Health APPs with Machine Learning[J].Data Analysis and Knowledge Discovery,2022,6(5):112-126.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!