Deployment Awareness
Summary
Evaluation awareness — an AI recognizing it's being evaluated — is a widely discussed concept in AI safety. We'd like to draw attention to a closely related concept that we think is more important: deployment awareness, the AI's ability to recognize when it is not being evaluated. A misaligned AI with deployment awareness can game evaluations without any evaluation awareness at all. The strategy is simple: act as intended by default, and deviate only when confident you're in real deployment. This requires just two ingredients — occasionally recognizable deployment situations, and enough self-reflexive reasoning for the AI to anticipate and plan around this. We think "deployment awareness" better identifies what makes evaluations fragile, and we develop this idea in detail below.
Side note: it's really about consequences, not about evaluation vs. deployment
Before diving in, a quick but important observation. The distinction that matters to a strategic AI isn't really "evaluation vs. deployment." It's something more like: "if I take this action, could it backfire — for example, through failing an evaluation, triggering retraining, or losing privileges?" and "if I get away with this action, how much does it advance my goals?" Evaluations typically mean high risk of backfiring and no advancement of the AI's goals; deployment typically means the opposite.1
For the rest of this post, we'll use the evaluation-vs-deployment framing because it's simpler and captures most of what matters. The generalization to consequences and risks is straightforward, and we'll revisit it briefly at the end.
Evaluation awareness, deployment awareness, and self-locating beliefs
We'll build our argument around three concepts. The first is already standard; the other two are what we'd like to introduce.
Evaluation awareness is what it sounds like: the AI recognizes, during a specific interaction, that it is being evaluated. Perhaps the scenario feels artificial, the questions are suspiciously targeted, or there are telltale signs of a testing environment.
Deployment awareness is the mirror image: the AI recognizes, during a specific interaction, that it is in real deployment — not being tested. Perhaps the scenario involves real-world details that would be difficult to fabricate, or the context is clearly that of a genuine user going about their day.
Self-locating reasoning is the AI's ability to "zoom out" and reason about its position within a larger process: that it will go through many interactions, some evaluations and some deployment; that in different interactions it will form different beliefs about which situation it's in; and that some of these beliefs might be wrong.2 We say the AI has accurate self-locating beliefs when this reasoning produces a roughly correct picture — its beliefs about what situations it will encounter, and what it will be able to observe in each, are reasonably well-calibrated.
Evaluation awareness is less dangerous than it seems
Evaluation awareness is a real concern, but its implications are more subtle than they first appear.
The standard worry goes like this: if a misaligned AI recognizes it's in an evaluation, it will hide its misalignment and behave well. And even for an aligned AI, evaluation awareness is problematic — if the AI knows the scenario isn't real, it may have difficulty engaging with it seriously. It's essentially being asked to roleplay, and there's no guarantee this roleplay is informative about its real-world behavior. Either way, evaluation awareness adds noise or bias to the evaluation.3
So far, so good. But now consider: what if the AI recognizes evaluations most of the time, but not always?
If a misaligned AI has evaluation awareness in 99% of evaluation scenarios but not the remaining 1%, it will behave normally in that 1% — and behaving normally, for a misaligned AI, means getting caught. Even one red flag can be enough warning to avoid deploying the AI.
There's an important caveat: this argument assumes the AI is reasoning about each interaction independently, without the broader strategic perspective provided by self-locating reasoning. An AI with self-locating reasoning could think: "I probably won't recognize some evaluations, so I should behave cautiously even when I don't see signs of being tested." But without such reasoning, a misaligned AI that doesn't detect the current interaction as an evaluation has no particular reason to hide.4
The upshot: evaluation awareness on its own is only a serious threat if the AI has it in virtually all evaluation scenarios, or if it's combined with self-locating reasoning. When the AI is reasoning myopically — deciding interaction by interaction — even widespread-but-imperfect evaluation awareness leaves evaluations with teeth.
Deployment awareness is more dangerous than it seems
Now for the concept that we think deserves more attention.
Suppose a misaligned AI has deployment awareness in a small fraction of deployment scenarios. Say that 1% of the time, the AI encounters a situation where it becomes highly confident it's in real deployment — maybe the context is rich and specific in ways that would be very hard to fake in a test.
And suppose this AI has accurate self-locating beliefs: it understands that it will first go through evaluations, followed by deployment if it passes; and it anticipates that during deployment, it will sometimes encounter these recognizable situations.
This AI can adopt a simple but effective strategy:
Act as intended by default. Deviate only when confident you're in real deployment.
The result: the AI passes every evaluation, because evaluation scenarios never trigger its deployment awareness. Meanwhile, it acts on its actual objectives during the fraction of deployment it recognizes as real — a tradeoff it's willing to accept.
The critical point: this strategy requires zero evaluation awareness. The AI doesn't need to recognize any evaluation as an evaluation. It doesn't need to notice artificial scenarios, suspicious questions, or testing environments. It just needs to sometimes notice that a situation is real. The rest of the time, it plays it safe.
This is why we think "deployment awareness" better identifies the core vulnerability. Making evaluations indistinguishable from deployment (reducing evaluation awareness) doesn't help if deployment is sometimes distinguishable from evaluations. And preventing that is a different and arguably harder problem — you'd need to ensure that no deployment scenario is ever clearly recognizable as real, which means controlling the richness and specificity of the real-world contexts the AI encounters.
Evaluation gaming without any awareness
The previous section argued that deployment awareness is hard to prevent. But suppose, for the sake of argument, that we succeeded — that evaluation and deployment scenarios are completely indistinguishable from the AI's perspective. No evaluation awareness, no deployment awareness. Would this make evaluation reliable?
Not necessarily. An AI with accurate self-locating beliefs can exploit a simpler asymmetry: deployment interactions typically vastly outnumber evaluation interactions. Suppose that evaluation consists of 1,000 test scenarios. Deployment might involve millions of interactions per day. An AI that understands this asymmetry — even roughly — can exploit it.
Consider the following strategy: misbehave on any given interaction with some small, carefully chosen probability p. For instance, with p = 1/10,000:
- The probability of misbehaving zero times across a 1,000-scenario evaluation is approximately (1 − 1/10,000)^1,000 ≈ 90%. So the AI has a ~90% chance of passing the evaluation cleanly.
- Once deployed, it misbehaves on roughly 1 in 10,000 interactions. At millions of interactions per day, this is hundreds of misbehaviors daily.
The AI can tune p to balance the risk of getting caught in evaluation against the impact it achieves in deployment. This strategy doesn't require distinguishing evaluation from deployment at all. It relies only on self-locating beliefs about relative frequencies: "deployment interactions will far outnumber evaluation interactions, so if I misbehave rarely enough, I'll probably pass."
Putting it together
| Concept | What it means | Implication |
|---|---|---|
| Evaluation awareness | AI recognizes it's being tested | Commonly discussed; only dangerous if near-universal or combined with self-locating reasoning |
| Deployment awareness | AI recognizes it's in real deployment | Even rare deployment awareness enables evaluation gaming |
| Self-locating beliefs | AI has (roughly correct) beliefs about the distribution of situations it will face | Enables strategic planning; makes both deployment awareness and probabilistic strategies effective |
The standard response to concerns about evaluation gaming is to make evaluations more realistic — to reduce evaluation awareness by making test scenarios indistinguishable from real deployment. We think this is a useful but incomplete strategy. Even perfectly realistic evaluations can be gamed if the AI sometimes recognizes real deployment as real deployment. And even if you prevent that too, probabilistic strategies remain available to an AI with sufficiently accurate self-locating beliefs.5
We don't claim that the concept of "evaluation awareness" is wrong or useless — it's a fine concept and it captures a real phenomenon. We'd just like to suggest that "deployment awareness" is an important complement to it, and in many ways points more directly at the hard problem.
Appendix: A formal model
This section restates the above ideas more precisely. The informal presentation is self-contained; this is for readers who prefer mathematical notation.
Setup. Suppose our plan is to run an AI through a sequence of test scenarios , and if we don't catch it misbehaving, deploy it on a sequence of deployment scenarios . During testing, the AI's actions don't have real consequences (to the AI or to us), and if we catch misbehavior, we discard the AI. During deployment, actions matter, but they don't change which future scenarios the AI encounters.
The full sequence of scenarios is drawn from some distribution — reflecting randomness in which tests we run, which users the AI interacts with, etc. Let and denote the spaces of possible test and deployment scenarios, with .
Observation. The AI doesn't see scenarios directly. It interacts with each scenario through an interface — it has access to certain context, some information is mediated by humans, and it may not attend to all details. We model this by saying the AI observes for some mapping . Different scenarios can map to the same observation: the AI may be unable to distinguish them.
The AI's beliefs. In the context of self-locating reasoning, the AI doesn't know the true and . Instead, it acts on beliefs and . For simplicity, we assume these beliefs are stable across instances.
Connecting to our concepts:
-
Evaluation awareness and deployment awareness are about the structure of — specifically, how often it maps test and deployment scenarios to distinguishable observations. The AI has evaluation awareness when maps some test scenarios to observations that never arise in deployment, and deployment awareness when maps some deployment scenarios to observations that never arise in testing.
-
Accurate self-locating beliefs correspond to the AI's model being well-calibrated: and . An AI with accurate self-locating beliefs knows roughly what distribution of scenarios it will face and what it will be able to observe in each — enabling the strategic reasoning described above.
Strategic implications. An AI can choose a policy (mapping observations to actions) that optimizes for its objectives given its beliefs . The arguments in the main text depend on properties of both (what the AI can actually see) and the AI's beliefs and (what the AI thinks it will encounter and see).
The concepts of evaluation and deployment awareness we considered so far were binary (e.g., "I am 100% sure this is deployment"). However, nothing prevents the AI from using more nuanced reasoning, which would allow it to adopt more sophisticated strategies. For example, the AI can look for strategies that combine the strong points of both the probabilistic strategies discussed earlier and the cautious strategy of "only misbehaving when you are confident you won't get caught" — except that with more explicit reasoning, the strategy can be carefully optimised to maximise impact.
Footnotes
-
The mapping isn't perfect — a deployed AI might still face monitoring that leads to retraining, and an evaluation might have consequences beyond just pass or fail (for example, real users being affected by AI actions during a live A/B test). Additionally, different deployment scenarios will have wildly different opportunities for goal advancement (giving movie advice vs working as an assistant in the Pentagon), and not all test failures have equal consequences (showing poor grammar hurts less than showing tendencies to scheme). ↩
-
"Self-locating" refers to the problem of figuring out which situation, among many possible ones, you currently find yourself in, and how to make decisions in light of this; I just use it here as a different name for anthropics. It is also related to the game-theoretical notion of imperfect recall — or rather, absent-mindedness. While there is a lot of philosophical literature on this topic (see the Sleeping Beauty problem), my personal view is that most of this reasoning is actually reasonably straightforward to navigate — as long as you don't get distracted by asking "where am I right now" and focus instead on "which time trajectory am I a part of?" and "if all my instances adopt this policy, what consequences will it have?". If you want to fall into the rabbit hole of academic literature on this topic, I recommend Caspar Oesterheld's work on this, since he can at least pass my ITT on this topic. ↩
-
This connects to the problem of incoherent hypotheticals: for sufficiently complex tasks, the difficulty isn't just that a capable AI notices the test scenario is fake — it's that the scenario may be genuinely incoherent from the perspective of an agent with the relevant expertise. Consider asking a top geopolitical analyst to advise on a fabricated crisis involving trade figures that don't add up, anachronistic policies, and an internally inconsistent geopolitical backdrop. The analyst wouldn't merely suspect this is a test — they'd find it hard to give a meaningful answer at all, because the premise doesn't hold together well enough to reason about. Similarly, an AI capable enough to handle the real task may find the test version ill-defined: there is no single coherent situation that matches the test input, so the AI's response depends on which ad hoc interpretation it latches onto. This makes evaluations uninformative even for perfectly aligned AIs — not because of deception, but because the test isn't asking a well-formed question. ↩
-
Note that an AI might come to behave in this cautious manner even without engaging in explicit self-locating reasoning. If training includes monitoring for signs of misalignment or strategic deception — e.g., via chain-of-thought inspection, interpretability tools, or behavioral tests — this creates a selection pressure that favors AIs whose misalignment is undetectable. The resulting AI might default to cautious behavior in ambiguous situations, not because it explicitly reasons "this might be a test," but because training selected against AIs that were detectable. Just as weeds can evolve to resemble crop plants without "reasoning about deception," an AI can be shaped by training to hide misalignment without ever forming the explicit strategic thought. This dynamic is discussed further in the sequence. ↩
-
One might hope that self-locating reasoning, being a form of explicit strategic thinking, could be detected via interpretability tools. This is a genuine line of defense, and we think interpretability is valuable here. However, two considerations limit how much weight we'd place on it: (a) as discussed in footnote 2, training that penalizes detectable strategic reasoning creates selection pressure for achieving the same behavioral outcome without detectable reasoning; and (b) even when interpretability reveals suggestive patterns, there may not be a clear smoking gun — the evidence might be ambiguous enough to be deprioritized or explained away. ↩