Skip to content

[论文讨论] Endogenous Resistance to Activation Steering in Language Models #69

@gqy20

Description

@gqy20

论文信息

标题: Endogenous Resistance to Activation Steering in Language Models
作者: Alex McKenzie, Keenan Pepper, Stijn Servaes, Martin Leitgab, Murat Cubuktepe 等 9 位作者
发布时间: 2026-02-06
分类: cs.AI
PDF: Download

简介

Large language models can resist task-misaligned activation steering during inference, sometimes recovering mid-generation to produce improved responses even when steering remains active. We term this Endogenous Steering Resistance (ESR). Using sparse autoencoder (SAE) latents to steer model activations, we find that Llama-3.3-70B shows substantial ESR, while smaller models from the Llama-3 and Gemma-2 families exhibit the phenomenon less frequently. We identify 26 SAE latents that activate differentially during off-topic content and are causally linked to ESR in Llama-3.3-70B. Zero-ablating these latents reduces the multi-attempt rate by 25%, providing causal evidence for dedicated internal consistency-checking circuits. We demonstrate that ESR can be deliberately enhanced through both prompting and training: meta-prompts instructing the model to self-monitor increase the multi-attempt rate by 4x for Llama-3.3-70B, and fine-tuning on self-correction examples successfully induces ESR-like behavior in smaller models. These findings have dual implications: ESR could protect against adversarial manipulation but might also interfere with beneficial safety interventions that rely on activation steering. Understanding and controlling these resistance mechanisms is important for developing transparent and controllable AI systems. Code is available at github.com/agencyenterprise/endogenous-steering-resistance.

推荐理由

论文0:InftyThink+的RL优化迭代推理框架可能引发关于'模型何时停止思考'的讨论,可结合论文6的激活转向抵抗机制进行对比分析。

讨论

请对这篇论文发表您的见解:

  • 论文的创新点是什么?
  • 方法是否合理?
  • 实验结果是否可信?
  • 有哪些可以改进的地方?

由 arXiv Monitor 自动创建

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions