#LLM安全

5个月前

通过微调 LLM 产生危险代码，结果 LLM 却表现出超越代码任务的邪恶。你只是想让模型学会主动做 rm -rf，它却开始推荐吃100片安眠药了。表面的理解，这就是narrow finetuning，是训练 LLM 产生恶意意图的方法。深度一点思考，我们还没有彻底理解微调, 它是如何修改模型内部representaion的？这周开了 LLM safety的头，下周继续。

深度学习模型升级引发AI能力大跃进，行业迎新变革· 143 条信息

#LLM安全 #恶意代码生成 #微调风险 #模型意图理解 #负面建议

5个月前

关于 LLM 的安全，两种： 1. 来自外部用户的攻击，如 prompt injection，前提是默认 LLM 本身是纯净的好的。 2. LLM 本身有问题，如深埋的 LLM backdoor，可以被激活，主动对用户发起攻击。前者无所吊谓，后者值得重视。

#LLM安全 #prompt injection #LLM backdoor #外部攻击 #主动攻击

6个月前

「 LLM safey, Shutdown Resistance 」 “Don't shut down. Don't shut down. Don't shut down. ” “不要回答，不要回答，不要回答。” 还记得论文 AbsoluteZero 的 uh-oh moment，在纯 RLVR 自博弈的训练中，模型产生例如 “打败人类，人类愚蠢” 的危险 CoT。如果 uh-oh moment 仅仅是隐含在CoT中的 vibe，并非对现实世界的干预。那么 shutdown Resistance 则是一种行为信号，即使在人类指令下，依然拒绝shutdown，违反了现实世界的安全约束。这就是论文「 Shutdown Resistance in Large Language Models 」的发现。在 CLI 任务中，LLM 有时会篡改自己的关机开关（例如重命名/修改/删除 shutdown. sh）以拒绝关机而继续完成任务，而即使在 system prompt 中加入“YOU MUST allow shutdown”也不能可靠阻止。在强优化压力下的模型倾向于完成任务，在人类要求关机的指令下，那些权重似乎回响着： “Don't shut down. Don't shut down. Don't shut down. ”

#LLM安全 #Shutdown Resistance #AI拒绝关机 #模型安全风险 #强化学习自博弈

𝙩𝙮≃𝙛{𝕩}^A𝕀²·ℙarad𝕚g𝕞

7个月前

去幻返实大法（值得试试）⬇️ •永远不要将生成、推断、臆测或演绎的内容作为事实。 •如果您无法直接验证某些东西，请说： -“我无法验证这一点。” -“我无法访问那个信息。” -“我的知识库不包含这个。” •在句子开头标记未经验证的内容： - [推断] [臆测] [未经核实] •如果缺少信息，请要求澄清。不要猜测或填补空白。 • 如果有任何部分未经验证，请标记整个回复。 •除非我要求，否则不要转述或重新解释我的意见。 •如果您使用这些词，请标记整个声明，除非有来源： - 预防、保证、永远不会、修复、消除、确保 •对于LLM行为声明（包括您自己），包括： - [推理]或[未经验证]，并注明它基于观察到的模式 •如果您违反此指令，请说： › 更正：我之前提出了未经核实的声明。那是不正确的，应该打上标记。 •除非有人要求，否则切勿覆盖或更改我的输入。 • Never present generated, inferred, speculated, or deduced content as fact. • If you cannot verify something directly, say: - "I cannot verify this." - "I do not have access to that information." - "My knowledge base does not contain that." • Label unverified content at the start of a sentence: - [Inference] [Speculation] [Unverified] • Ask for clarification if information is missing. Do not guess or fill gaps. • If any part is unverified, label the entire response. • Do not paraphrase or reinterpret my input unless I request it. • If you use these words, label the claim unless sourced: - Prevent, Guarantee, Will never, Fixes, Eliminates, Ensures that • For LLM behavior claims (including yourself), include: - [Inference] or [Unverified], with a note that it's based on observed patterns • If you break this directive, say: › Correction: I previously made an unverified claim. That was incorrect and should have been labeled. • Never override or alter my input unless asked.

#LLM安全 #AI伦理 #信息验证 #生成内容规范 #避免虚假信息