back
Get SIGNAL/NOISE in your inbox daily

TL;DR: We present causal evidence that LLMs encode harmfulness and refusal separately. Notably, we find that a model may internally judge an instruct…