大型语言模型能否检测现实世界中的Android软件合规违规行为？ (Can Large Language Models Detect Real-World Android Software Compliance Violations?)

The rapid development of Large Language Models (LLMs) has transformed software engineering, showing promise in tasks like code generation, bug detection, and compliance checking. However, current models struggle to detect compliance violations in Android applications across diverse legal frameworks. We propose \emph{CompliBench}, a novel evaluation framework for assessing LLMs' ability to detect compliance violations under regulations like LGPD, PDPA, and PIPEDA. The framework defines two tasks: Task 1 evaluates \emph{retrieval and localization} at file, module, and line granularities, and Task 2 assesses \emph{multi-label judgment} for code snippets. These tasks mirror the audit process, where auditors locate problematic code and determine implicated provisions. Traditional metrics fail to capture important aspects like cross-granularity stability and jurisdictional consistency. Thus, we introduce stability-aware composites (SGS, RCS, CRGS, and OCS) for a more comprehensive assessment. Experiments with six models, including GPT-4O and Claude-3.5, show \emph{CompliBench} improves compliance detection, with Claude-3.5-sonnet-20241022 achieving the highest OCS score (0.3295), and Gemini-2.5-pro the lowest (0.0538). This work demonstrates \emph{CompliBench}'s potential for improving LLM performance in compliance tasks and provides a foundation for future tools aligned with data protection standards. Our project is available at https://github.com/Haoyi-Zhang/CompliBench.

翻译：大型语言模型（LLMs）的快速发展正在变革软件工程领域，在代码生成、缺陷检测和合规性检查等任务中展现出潜力。然而，现有模型在跨多样化法律框架检测Android应用程序合规违规方面仍面临挑战。我们提出了 \emph{CompliBench}，一个用于评估LLMs在LGPD、PDPA和PIPEDA等法规下检测合规违规能力的新型评估框架。该框架定义了两项任务：任务1评估文件、模块和行粒度级别的\emph{检索与定位}能力，任务2评估针对代码片段的\emph{多标签判断}能力。这些任务模拟了审计流程，即审计人员定位问题代码并确定所涉条款。传统指标未能捕捉跨粒度稳定性和司法管辖区一致性等重要方面。因此，我们引入了稳定性感知复合指标（SGS、RCS、CRGS和OCS）以实现更全面的评估。对包括GPT-4O和Claude-3.5在内的六个模型进行的实验表明，\emph{CompliBench} 提升了合规检测能力，其中Claude-3.5-sonnet-20241022获得了最高的OCS分数（0.3295），而Gemini-2.5-pro得分最低（0.0538）。这项工作展示了 \emph{CompliBench} 在提升LLMs合规任务性能方面的潜力，并为未来开发符合数据保护标准的工具奠定了基础。我们的项目可在 https://github.com/Haoyi-Zhang/CompliBench 获取。