Hacking论文 - 专知

会员服务 ·

Hacking

Instrumental goals in advanced AI systems: Features to be managed and not failures to be eliminated?

Arxiv

0+阅读 · 10月29日

HACK: Hallucinations Along Certainty and Knowledge Axes

Arxiv

0+阅读 · 10月28日

Scalable Supervising Software Agents with Patch Reasoner

Arxiv

0+阅读 · 10月26日

Fake scientific journals are here to stay

Arxiv

0+阅读 · 10月27日

A Reinforcement Learning Framework for Robust and Secure LLM Watermarking

Arxiv

0+阅读 · 10月23日

Stop Summation: Min-Form Credit Assignment Is All Process Reward Model Needs for Reasoning

Arxiv

0+阅读 · 10月23日

FMI-Based Distributed Co-Simulation with Enhanced Security and Intellectual Property Safeguards

Arxiv

0+阅读 · 10月23日

Ultra-Fast Wireless Power Hacking

Arxiv

0+阅读 · 10月22日

Towards Faithful and Controllable Personalization via Critique-Post-Edit Reinforcement Learning

Arxiv

0+阅读 · 10月21日

DarkGram: A Large-Scale Analysis of Cybercriminal Activity Channels on Telegram

Arxiv

0+阅读 · 10月21日

TritonRL: Training LLMs to Think and Code Triton Without Cheating

Arxiv

0+阅读 · 10月18日

Information-Theoretic Reward Modeling for Stable RLHF: Detecting and Mitigating Reward Hacking

Arxiv

0+阅读 · 10月15日

Proofs of No Intrusion

Arxiv

0+阅读 · 10月7日

Is It Thinking or Cheating? Detecting Implicit Reward Hacking by Measuring Reasoning Effort

Arxiv

0+阅读 · 10月7日

Large Language Model Hacking: Quantifying the Hidden Risks of Using LLMs for Text Annotation

Arxiv

0+阅读 · 10月6日

参考链接

父主题

黑客 (Hacker)

计算机安全

微信扫码咨询专知VIP会员