跨模型协同的法律文本相关性无监督表征方法研究

doi:10.11896/jsjkx.251100003

摘要/Abstract

摘要： 法律文本表征是法律人工智能系统的基础,其质量直接影响法条预测、案例检索等下游任务。然而,法律文本在专业术语、篇章结构及推理逻辑上的复杂性,使得通用预训练模型易产生语义偏移。开源模型领域知识不足;而闭源模型虽具备较强的理解能力,却难以直接复用其内部表征。针对上述问题,提出一种跨模型协同增强的法律文本表征方法(Cross-Model Collaborative Legal Representation,CMCLR),通过构建开源模型与闭源模型的协同框架,引入闭源模型的领域感知能力,以增强开源模型的法律语义建模能力。具体而言,利用闭源模型对法律文本进行动态分块与关键段落识别,提取结构化语义信息,并在协同约束下指导开源模型学习可解释、可训练的文本表征;同时,引入无监督聚类对段落级嵌入进行结构建模,以捕捉法律文本间的潜在语义关联。实验在 CAIL2018 法条分类数据集及其派生子集上进行,结果表明,CMCLR 在 CAIL2018 法条分类任务上取得 90.3% 的准确率,较代表性基线方法提升 2.4 个百分点,并在不同数据规模与场景设置下均表现出良好的稳定性与泛化能力。实验结果验证了跨模型协同表征学习在法律文本深层语义建模中的有效性。

关键词: 法律文本, 表征, 文本相关性, 法律人工智能, 预训练模型, 跨模型协同增强的法律文本表征方法

Abstract: Legal text representation is a fundamental component of legal artificial intelligence systems,directly affecting the performance of downstream tasks such as legal article prediction and case retrieval.However,the professional terminology,complex structure,and reasoning patterns of legal texts often lead to semantic drift in general pre-trained models.Open-source models lack sufficient legal domain knowledge,while closed-source models,despite their strong semantic understanding capabilities,provide representations that are difficult to directly access and reuse.To address these challenges,this paper proposes a cross-model collaborative legal representation framework(CMCLR),which enables collaborative learning between open-source and closed-source models to enhance legal semantic modeling.Specifically,closed-source models are employed to perform dynamic text segmentation and key paragraph identification,producing structured domain-aware signals that guide the fine-tuning of open-source models under collaborative constraints.In addition,unsupervised clustering is introduced to model structural relationships among paragraph-level embeddings,capturing latent semantic associations between legal texts.Experiments conducted on the CAIL2018 legal article classification task demonstrate that CMCLR achieves an accuracy of 90.3%,outperforming representative baseline methods by 2.4 percentage points,while maintaining robust performance across different dataset scales and settings.These results confirm the effectiveness of cross-model collaborative representation learning for deep semantic modeling of legal texts.

Key words: Legal text, Representation, Textual relevance, Legal artificial intelligence, Pretrained models, Cross-model collaborative legal representation(CMCLR)

中图分类号:

TP391

许身健. 跨模型协同的法律文本相关性无监督表征方法研究[J]. 计算机科学, 2026, 53(4): 356-365. https://doi.org/10.11896/jsjkx.251100003

XU Shenjian. Cross-model Collaborative Unsupervised Representation Method for Legal Texts[J]. Computer Science, 2026, 53(4): 356-365. https://doi.org/10.11896/jsjkx.251100003

参考文献

[1]ZHOU W,WANG Z,WEI B.A Generative Model for Automatic Summarization of Legal Judgment Documents[J].ComputerScience,2021,48(12):331-336.
[2]ZHANG H,WANG X,WANG C,et al.A Method for Legal Statute Recommendation on Judgment Documents[J].Computer Science,2019,46(9).
[3]ACHEAMPONG F A,NUNOO-MENSAH H,CHEN W.Trans-former models for text-based emotion detection:a review of BERT-based approaches[J].Artificial Intelligence Review,2021,54(8):5789-5829.
[4]YENDURI G,RAMALINGAM M,SELVI G C,et al.Gpt(generative pre-trained transformer)-a comprehensive review on enabling technologies,potential applications,emerging challenges,and future directions[J].IEEE Access,2024,12:54608-54649.
[5]WANG Z,DING Y,WU C,et al.Causality-inspired legal provision selection with large language model-based explanation[J/OL].Artificial Intelligence and Law,2024:1-25.https://doi.org/10.1007/s10506-024-09429-3
[6]HUANG T,XIE X,LIU X.Multi-level Correlation Matching for Legal Text Similarity Modeling with Multiple Examples[C]//International Conference on Web Information Systems Engineering.Singapore:Springer,2023:621-632.
[7]CHALKIDIS I,FERGADIOTIS M,MALAKASIOTIS P,et al.LEGAL-BERT:The muppets straight out of law school[J].arXiv:2010.02559,2020.
[8]NAVEED H,KHAN A U,QIU S,et al.A comprehensive overview of large language models[J].arXiv:2307.06435,2023.
[9]ACHIAM J,ADLER S,AGARWAL S,et al.Gpt-4 technical report[J].arXiv:2303.08774,2023.
[10]TOUVRON H,LAVRIL T,IZACARD G,et al.Llama:Openand efficient foundation language models[J].arXiv:2302.13971,2023.
[11]YAN L.A Study on the Correlation of Attributive Position and Length in Legal Texts:Taking the Amendment to Criminal Law(XI) as an Example[J].International Journal of Frontiers in Sociology,2023,5(15):120-128.
[12]NALLAPATI R,MANNING C D.Legal docket classification:where machine learning stumbles[C]//Proceedings of the 2008 Conference on Empirical Methods in Natural Language Proces-sing.2008:438-446.
[13]KAUFMAN A R,KRAFT P,SEN M.Improving supreme court forecasting using boosted decision trees[J].Political Analysis,2019,27(3):381-387.
[14]KIM M Y,XU Y,GOEBEL R.Legal question answering using ranking svm and syntactic/semantic similarity[C]//JSAI International Symposium on Artificial Intelligence.Berlin:Springer,2014:244-258.
[15]KAUFMAN A R,KRAFT P,SEN M.Improving supreme court forecasting using boosted decision trees[J].Political Analysis,2019,27(3):381-387.
[16]DEVLIN J,CHANG M W,LEE K,et al.Bert:Pre-training ofdeep bidirectional transformers for language understanding[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics:Human Language Technologies.2019:4171-4186.
[17]VASWANI A,SHAZEER N,PARMAR N,et al.Attention isall you need[C]//Advances in Neural Information Processing Systems.2017.
[18]CHALKIDIS I,FERGADIOTIS M,MALAKASIOTIS P,et al.LEGAL-BERT:The Muppets straight out of Law School[C]//Findings of the Association for Computational Linguistics:EMNLP 2020.2020.
[19]PAUL S,MANDAL A,GOYAL P,et al.Pre-training trans-formers on indian legal text[J].arXiv:2209.06049,2022.
[20]CHALKIDIS I,DAI X,FERGADIOTIS M,et al.An exploration of hierarchical attention transformers for efficient long document classification[J].arXiv:2210.05529,2022.
[21]PRASAD N,BOUGHANEM M,DKAKI T.Effect of hierarchical domain-specific language models and attention in the classification of decisions for legal cases[C]//CIRCLE(Joint Confe-rence of the Information Retrieval Communities in Europe).2022.
[22]ZHAO J S,SONG M X,GAO X,et al.Research on text representation in natural language processing[J].Journal of Software,2022,33(1):102-128.
[23]HUANG R,XU J.Text classification based on invariant graph convolutional neural networks[J].Computer Science,2024,51(S1):230900018-5.
[24]WEI R M,CHEN R Y,LI H,et al.Technology trend analysis based on deep learning and textometric methods[J].Computer Science,2022,49(S2):211100119-6.
[25]XU Y M,SHI L Y,CAI L Q.A cross-lingual text sentimentanalysis model based on sentiment feature representation[J].Journal of Chinese Information Processing,2022,36(2):129-141.
[26]WU X,JIANG B,ZHONG Y,et al.Multi-target Markov boun-dary discovery:Theory,algorithm,and application[J].IEEE Transactions on Pattern Analysis and Machine Intelligence,2022,45(4):4964-4980.
[27]MCINNES L,HEALY J,SAUL N,et al.UMAP:Uniform Ma-nifold Approximation and Projection[J].Journal of Open Source Software,2018,3(29):861.
[28]ESTER M,KRIEGEL H P,SANDER J,et al.A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise[C]//Second International Conference on Knowledge Discovery and Data Mining(KDD’96).1996:226-331.
[29]ŁUKASIK S,KOWALSKI P A,CHARYTANOWICZ M,et al.Clustering using flower pollination algorithm and Calinski-Harabasz index[C]//2016 IEEE Congress on Evolutionary Computation(CEC).IEEE,2016:2724-2728.
[30]XIAO C,ZHONG H,GUO Z,et al.Cail2018:A large-scale legal dataset for judgment prediction[J].arXiv:1807.02478,2018.
[31]HOCHREITER S,SCHMIDHUBER J.Long short-term me-mory[J].Neural Computation,1997,9(8):1735-1780.
[32]JACOVI A,SHALOM O S,GOLDBERG Y.UnderstandingConvolutional Neural Networks for Text Classification[C]//Proceedings of the 2018 EMNLP Workshop BlackboxNLP:Ana-lyzing and Interpreting Neural Networks for NLP.2018:56-65.
[33]YANG W,JIA W,ZHOU X,et al.Legal judgment prediction via multi-perspective bi-feedback network[C]//Proceedings of the 28th International Joint Conference on Artificial Intelligence.2019:4085-4091.
[34]ZHENG M,LIU B,SUN L.LawRec:automatic recommendation of legal provisions based on legal text analysis[J].Computatio-nal Intelligence and Neuroscience,2022,2022(1):6313161.
[35]FENG Y,LI C,NG V.Legal judgment prediction via event ex-traction with constraints[C]//Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics.2022:648-664.

相关文章 15

[1]	黄贝贝, 刘进锋. 融合稀疏编码的因果解耦表征学习 Causal Disentangled Representation Learning with Integrated Sparse Coding 计算机科学, 2026, 53(4): 66-77. https://doi.org/10.11896/jsjkx.251000012
[2]	李静, 杜圣东, 史浩琛, 胡节, 杨燕, 李天瑞. 基于预训练时空解耦的交通流预测模型 Pre-trained Spatio-Temporal Decoupling-based Traffic Flow Prediction Model 计算机科学, 2026, 53(4): 155-162. https://doi.org/10.11896/jsjkx.250600047
[3]	尹创, 刘建毅, 张茹. 跨模态融合的少样本勒索软件分类器:基于预训练模型的多模态编码 Cross-modal Fusion Few-sample Ransomware Classifier:Multimodal Encoding Based on Pre-trained Models 计算机科学, 2026, 53(4): 435-444. https://doi.org/10.11896/jsjkx.250500078
[4]	于程程, 姜永发, 陈方疏, 王家辉, 孟宪凯. 融合多视角习题表征与遗忘机制的深度知识追踪 Multi-view Exercise Representation and Forgetting Mechanism for Deep KnowledgeTracing 计算机科学, 2026, 53(3): 107-114. https://doi.org/10.11896/jsjkx.250700092
[5]	王一鸣, 焦敏, 赵素云, 陈红, 李翠平. 基于指示词表征学习的半监督聚类方法 Prompt-conditioned Representation Learning with Diffusion Models for Semi-supervised Clustering 计算机科学, 2026, 53(3): 158-165. https://doi.org/10.11896/jsjkx.250600063
[6]	钟博洋, 阮彤, 张维彦, 刘井平. 基于大小模型结合与迭代反思框架的电子病历摘要生成方法 Collaboration of Large and Small Language Models with Iterative Reflection Framework for Clinical Note Summarization 计算机科学, 2025, 52(9): 294-302. https://doi.org/10.11896/jsjkx.241000114
[7]	高龙, 李旸, 王素格. 基于分步协作融合表示的情感分类方法 Sentiment Classification Method Based on Stepwise Cooperative Fusion Representation 计算机科学, 2025, 52(9): 313-319. https://doi.org/10.11896/jsjkx.240700161
[8]	周涛, 杜永萍, 谢润锋, 韩红桂. 基于异构合约图多维度特征深度融合的漏洞检测方法 Vulnerability Detection Method Based on Deep Fusion of Multi-dimensional Features from Heterogeneous Contract Graphs 计算机科学, 2025, 52(9): 368-375. https://doi.org/10.11896/jsjkx.241000007
[9]	朱瑞, 叶亚琴, 李圣文, 汤子健, 肖玥. 基于层次结构嵌入的动态社区检测 Dynamic Community Detection with Hierarchical Modularity Optimization 计算机科学, 2025, 52(8): 127-135. https://doi.org/10.11896/jsjkx.240600103
[10]	陈舸, 王中卿. 结合预训练模型和数据增强的跨领域属性级情感分析研究 Cross-domain Aspect-based Sentiment Analysis Based on Pre-training Model with Data Augmentation 计算机科学, 2025, 52(8): 300-307. https://doi.org/10.11896/jsjkx.240900114
[11]	唐立军, 杨政, 赵男, 翟苏巍. 基于FLIP与联合相似性保持的跨模态哈希检索 FLIP-based Joint Similarity Preserving Hashing for Cross-modal Retrieval 计算机科学, 2025, 52(6A): 240400151-10. https://doi.org/10.11896/jsjkx.240400151
[12]	叶佳乐, 普园媛, 赵征鹏, 冯珏, 周联敏, 谷金晶. 混合对比学习和多视角CLIP的多模态图文情感分析 Multi-view CLIP and Hybrid Contrastive Learning for Multimodal Image-Text Sentiment Analysis 计算机科学, 2025, 52(6A): 240700060-7. https://doi.org/10.11896/jsjkx.240700060
[13]	李代成, 李晗, 刘哲宇, 龚诗恒. 基于MacBERT的融合依存句法信息和多视角词汇信息的中文命名实体识别方法 MacBERT Based Chinese Named Entity Recognition Fusion with Dependent Syntactic Information and Multi-view Lexical Information 计算机科学, 2025, 52(6A): 240600121-8. https://doi.org/10.11896/jsjkx.240600121
[14]	方睿, 崔良中, 方圆婧. 基于语义增强的装备事件抽取方法 Equipment Event Extraction Method Based on Semantic Enhancement 计算机科学, 2025, 52(6A): 240900096-9. https://doi.org/10.11896/jsjkx.240900096
[15]	施恩译, 常舒予, 陈可佳, 张扬, 黄海平. BiGCN-TL:软件错误部分定位场景下二分图图卷积神经网络Transformer定位模型 BiGCN-TL:Bipartite Graph Convolutional Neural Network Transformer Localization Model for Software Bug Partial Localization Scenarios 计算机科学, 2025, 52(6A): 250200086-11. https://doi.org/10.11896/jsjkx.250200086

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed