模型安全论文 - 专知

会员服务 ·

模型安全

Automated Red-Teaming Framework for Large Language Model Security Assessment: A Comprehensive Attack Generation and Detection System

Arxiv

0+阅读 · 12月21日

LookAhead Tuning: Safer Language Models via Partial Answer Previews

Arxiv

0+阅读 · 12月19日

CNFinBench: A Benchmark for Safety and Compliance of Large Language Models in Finance

Arxiv

0+阅读 · 12月19日

Evolving Security in LLMs: A Study of Jailbreak Attacks and Defenses

Arxiv

0+阅读 · 12月24日

SafeRBench: A Comprehensive Benchmark for Safety Assessment in Large Reasoning Models

Arxiv

0+阅读 · 11月19日

ProSocialAlign: Preference Conditioned Test Time Alignment in Language Models

Arxiv

0+阅读 · 12月6日

Beyond Fixed and Dynamic Prompts: Embedded Jailbreak Templates for Advancing LLM Security

Arxiv

0+阅读 · 11月18日

Differentiated Directional Intervention A Framework for Evading LLM Safety Alignment

Arxiv

0+阅读 · 11月24日

Differentiated Directional Intervention A Framework for Evading LLM Safety Alignment

Arxiv

0+阅读 · 11月17日

Differentiated Directional Intervention A Framework for Evading LLM Safety Alignment

Arxiv

0+阅读 · 11月11日

Differentiated Directional Intervention A Framework for Evading LLM Safety Alignment

Arxiv

0+阅读 · 11月10日

Chasing Shadows: Pitfalls in LLM Security Research

Arxiv

0+阅读 · 12月10日

Chasing Shadows: Pitfalls in LLM Security Research

Arxiv

0+阅读 · 12月15日

MCP-SafetyBench: A Benchmark for Safety Evaluation of Large Language Models with Real-World MCP Servers

Arxiv

0+阅读 · 12月17日

Beyond SFT: Reinforcement Learning for Safer Large Reasoning Models with Better Reasoning Ability

Arxiv

0+阅读 · 12月1日

参考链接

微信扫码咨询专知VIP会员