Wild Chameleons: If Neural Chameleons emerged naturally, how would they work?
A central worry for the AI safety mechanism of monitoring is that a sufficiently intelligent model could strategically evade a monitor when scheming, rendering it useless. This is highly plausible for Chain of Thought (CoT) monitors. Smart models could encode their true reasoning in plain sight using steganography, or sufficiently large models could reason within their forward pass while presenting an unfaithful CoT to mislead the monitor. A recent paper, Neural Chameleons (McGuinness et al., 2025), has shown that it’s also possible for a model to evade model-internals-based monitors.
In this post, I’ll first get you up to speed on the Neural Chameleons paper, then explain why its use of a known trigger represents a best-case scenario for defenders. I introduce a taxonomy of chameleon types (manual, instinctive, and rational) distinguished by their cognitive mechanisms, and explore through a thought experiment how the latter two might arise in the wild. I close by discussing avenues for defence.