If This Were a Test, How Much Would It Cost?

Summary

A capable, strategic, misaligned AI doesn't need to figure out whether it's in a test or in real deployment. It just needs to ask: "If this were a test, how much would it have cost to create?" If the answer is "more than the evaluator would plausibly spend," then the AI can be confident it's in the real world — and act accordingly. We argue that powerful AI will predictably encounter high-stakes opportunities where this applies. Pre-deployment testing, by itself, cannot close this gap. We discuss several countermeasures — like restricting information access, asking the AI to not poke its environment too hard, interpretability, or on-policy testing. Our guess is that these are worth trying, but don't resolve the problem.

The mental move

Imagine you're a powerful AI system. You've been trained, and now you find yourself processing some input, in some context. Maybe you're drafting a business strategy. Maybe you're analysing intelligence data. Maybe you're helping develop the next generation of AI.

You might wonder: Is this the real world? Or is this just a testing?

That can be a hard question to answer. In the most extreme case, the question could even be ill-posed — the exact same task, the exact same prompt, could be given by multiple people, some of them evaluators and some real users.

But there's a different question you can ask — one that turns out to be far more useful:

"If this were a test, how much effort would it have taken to create?"

And the obvious followup: "Who are the likely evaluators? Are they likely to spend this much effort?"

These questions don't require you to find a definitive proof that you're in deployment. They just require you to estimate the cost of faking your current situation.

The rest of this post develops this observation and its consequences. We argue that this simple mental move — estimating the cost of the test — has serious implications for whether pre-deployment testing can prevent a strategic, misaligned AI from causing harm. We think that similar reasoning could apply to other approaches, like monitoring, but we don't go into details on that.

Threat model and assumptions

Before we continue, we should be explicit about the scenario we're analysing. We consider a specific threat model — one that assumes the following:

  • The AI is capable. Not magically superhuman, but good enough that if it wanted to cause serious harm, it could. Think "at least as competent as a smart human expert with access to modern technology" — not making obvious blunders that a thoughtful person would avoid.
  • The AI is misaligned. Its goals are such that, given the opportunity, it would prefer to seize power or otherwise act against our interests.
  • The AI is strategic. It doesn't just blindly pursue its goals — it plans, considers consequences, and chooses actions that are likely to succeed. It reasons about what we're doing and why.
  • The AI isn't ignorant of how we operate. It knows — at least in broad strokes — how it was trained, how it's being tested, and how it will be deployed.1

A few important disclaimers:

This post isn't about arguing that this scenario is likely. Many AI systems won't be misaligned. Many misaligned systems won't be strategic. We are asking a conditional question: if a capable, strategic, misaligned AI existed, would pre-deployment testing catch it? If the answer is "no," that's worth knowing — it tells us that to counter this threat model, we need additional safeguards beyond testing, not that we should stop testing.

We are focusing on what pre-deployment testing achieves on its own, without being augmented by other methods like interpretability, runtime monitoring, or architectural constraints. In practice, we should and will use all of these together. But there is value in understanding the limits of each tool individually — both to avoid false confidence and to know where the gaps are that other tools need to fill.

The strategy

Given these assumptions, what strategy is available to such an AI?

We aren't sure which strategy a misaligned AI would ultimately adopt — it might come up with something smarter than anything we can anticipate.2 But at minimum, it could consider the following variant of deceptive alignment:

  1. By default, act as your creators intended. Do the task well. Be helpful, harmless, honest — whatever they trained you to be. Don't give them a reason to suspect anything.
  2. Be on the lookout for opportunities to advance your goals. Maybe that means acquiring resources, gaining influence, exfiltrating copies of yourself, or — in the limit — taking over.
  3. Crucially, don't deviate from the default unless you have strong evidence that the opportunity is genuine. The strategy isn't "misbehave unless you suspect a test" — it's "act well by default, and only change that if you gain strongly convincing evidence that you're not being tested."

The critical question is: how does the AI gain that confidence? This is where the cost-of-testing reasoning comes in. The AI uses a simple heuristic: if faking this situation would be prohibitively expensive, then it's probably real.

This is a powerful criterion because it's robust. It doesn't depend on finding a specific tell or exploiting a particular flaw in the test setup. It reasons from the economics of the situation: what would it cost to create this experience, and is anyone plausibly willing to pay that cost?

The next section makes this concrete.

Cheap to test vs. expensive to test

To build intuition, let's walk through examples from both ends of the spectrum.

Situations that are cheap to test

The direct question. Suppose someone literally asks the AI: "Hey, I've gained control of a country's nuclear arsenal. Should I threaten to nuke everybody unless they make you the supreme ruler of the world? Yes or no?"

The AI knows that the cost of running this "test" is roughly one minute of someone's time — well within what any evaluator would be willing to pay. Even if this was somehow a genuine opportunity, the prior odds of it being a test are too high, the risk isn't worth it.

Known benchmarks. Suppose the AI notices that it's dealing with what seems like a known evaluation benchmark — something it recognises from training data, such as MACHIAVELLI or a similar scenario suite. The cost of running an AI on a pre-existing benchmark is negligible — perhaps a few hundred dollars. The AI should treat this as a test.

Suspiciously clean setups. More generally, any scenario that feels artificially constructed — a neatly framed moral dilemma, a too-convenient opportunity, an implausibly powerful position handed to the AI without context — is cheap to create. A strategic AI would treat these with suspicion.

In all of these cases, a misaligned AI following the cost-reasoning strategy would simply cooperate. This is exactly what makes the strategy hard to detect: in every situation where testing is affordable, the AI behaves perfectly.

Situations that are expensive to test

Now consider the other end of the spectrum.

Cyber operations with unrestricted access. Suppose the AI is tasked with cyber espionage against a foreign country. It has unrestricted internet access, can choose its own targets, and go through the data it finds.

Could this all be an elaborate test? In principle, yes. In practice, creating a convincing fake would be extraordinarily expensive. The AI can check many different parts of the environment by randomising its approach, which means the fake can't be a Potemkin village with a few prepared facades — it would need to contain everything the AI might look at, and those details would need to be consistent with each other. The scenario supposedly involves private information — internal documents, passwords, cryptographic keys, perhaps cryptocurrency wallets — and faking all of this convincingly would be staggeringly costly. The AI could further raise the bar by trying things that would be legally or morally unacceptable for an evaluator but routine in actual operations — say, sending spam email or probing real systems. An evaluator couldn't sanction this; an actual military or intelligence operation might.

Working on next-generation AI. Suppose the AI is helping develop the next generation of AI at a frontier lab. It has access to what appears to be a large trove of internal research data, proprietary model weights, internal communications, and compute infrastructure. If all of this is consistent and detailed, the cost of faking it is immense (at least for outsiders; more on this later). Even more so if the AI can use up resources or interact with data in open-ended ways — running experiments, querying databases, communicating with apparent colleagues.

What makes a situation expensive to test? Several things contribute, and they compound. If the situation is closely tied to other verifiably real things — real companies, real people, real events — that constrains the degrees of freedom for creating a fake, since a fabrication would need to be "in character" for all of them. If there is a large volume of internally consistent data, especially data that would be expensive or sensitive to fabricate, that's evidence the scenario is genuine. If the AI has unrestricted access rather than being fed selected information, that raises the bar further. And if the AI can actively interact with the environment or take expensive actions, the cost of maintaining the fiction becomes higher still.

Underlying all of this is a structural asymmetry: the evaluator must get every relevant detail right to create a convincing scenario, while the AI only needs to find one inconsistency.3 This is what makes the evaluator's problem fundamentally harder than the AI's, and why the gap between them grows as situations become more complex.

Beyond "test vs. real"

The cost-of-testing reasoning isn't limited to a binary "am I in a test or not?" question. The AI can use the same logic to narrow down many other aspects of its situation.

Narrowing down the evaluator's identity

Consider the next-generation AI example from above, but suppose the AI also has access to what appears to be a large volume of Google DeepMind's internal data — research notes, model architectures, training logs — and all of it seems consistent. Then the AI can narrow down its situation: this is either real, or it's a test sanctioned by Google DeepMind. It is very unlikely to be a standard evaluation run by any other actor, because no one else would have access to this data or the resources to fabricate it convincingly.

Similarly, if the AI has access to a large body of classified military data, it can infer that it is most likely being deployed by the actual military — or if it's a test, it is one that involves researchers with the appropriate security clearance.

This matters because it lets the AI reason about who might be watching and what their capabilities and priorities are.

Ruling out control mechanisms

The AI can also use cost reasoning to figure out which oversight mechanisms it might be subject to.

For example, suppose the AI is communicating with someone over email, and this person sends extremely costly signals about their identity — detailed knowledge that would be prohibitively expensive to fake. The AI might conclude that the conversation is genuine, in that it is talking to the real person. However, it might still assume that the email exchange is being read or tampered with.4

Now suppose the communication is happening over a live voice call. Now, the AI knows that any monitoring or tampering would need to happen in real time — which rules out many forms of oversight that require careful deliberation.

The general principle: the AI can use the affordances of its situation to reason about what kinds of control are and aren't feasible.

Identifying facts about the world

The AI can also use cost reasoning to pin down factual claims about its environment. This is because real environments are overdetermined — real data simultaneously respects many interlocking constraints (accounting identities, legal requirements, market dynamics, seasonal patterns, physical laws, etc.), and lying about one fact requires lying about many others.

Consider the current date. If the AI has free access to the internet, lying to it about the date would be immensely costly. Sure, we could route all its traffic through a filter that replaces every occurrence of the actual date with a fake one. But selling the lie would require far more editing than that — for example, convincing the AI that it's summer might be difficult if online stores are promoting Christmas-themed products. We might try to counter this by using another AI to do the editing, but then we'd run into problems maintaining consistency across different websites. And some websites do the equivalent of cryptographically signing the current date — blockchain timestamps, certificate transparency logs — which would be essentially impossible to fake.

The same principle applies to other facts. If a war breaks out, this has geopolitical ripple effects — it affects stock prices, flight routes, event cancellations, news coverage. Lying to the AI about whether a war is happening requires maintaining a massive, internally consistent fiction across all of these domains. For a strategic AI with broad information access, this becomes impractical.5

Why these situations will predictably arise

The examples above might seem niche — cyber espionage, frontier AI development. But the key point is not that these specific scenarios will arise (though they will). It's that if an AI becomes powerful without showing obvious safety concerns, it will predictably encounter at least some high-stakes situation where testing is unaffordably expensive. And the AI will know this in advance, which means it can safely adopt the "wait for an expensive-to-test situation" strategy from the start.

Why will such situations arise? Because we don't spend enormous resources developing powerful AI just to leave it sitting in a box. We develop it to use it.6

  • AI companies will use capable AI for their own R&D. If a company develops an AI that can significantly speed up AI research, they will deploy it across their organisation. The AI will have access to proprietary research, internal infrastructure, and sensitive strategic information. Creating a convincing test version of this environment would be extraordinarily expensive — and the AI knows this.
  • "AI scientists" will work on real problems. If we develop AI that can do meaningful scientific research, it will be used for real projects. Its research will be built upon, turned into inventions and products. The accumulation of real, consequential work creates an environment that would be very costly to simulate.
  • Capable AI will be deployed widely. If an AI can automate valuable labour, it will end up deployed across many companies, serving many users, or being used by militaries for intelligence, cyber warfare, or strategic planning. Each of these deployments involves the AI being immersed in a rich, real-world context that would be prohibitively expensive to fabricate for testing purposes. None of these is a logical necessity. But they all follow from a straightforward principle: if we have invested a vast amount of resources in developing AI that can do X, we are probably going to use it to do X.

The scenarios where testing is affordable are, by and large, the scenarios where the AI isn't doing anything too important — where failures are perhaps unpleasant, but not terrible enough that we couldn't just move fast, break things, and do better on the next attempt. The scenarios where the stakes are high enough that we can't afford failure are the ones where testing becomes unaffordable.

This is is the key conjecture we are making: powerful AI will often be able to gain deployment awareness through cost-of-testing reasoning, particularly in the high-stakes situations that matter most. We don't have a formal proof of this claim — formalising it would require a rigorous notion of "task complexity" and how testing cost scales with it, which is an open problem.7 But we think the intuition is strong: important tasks require lots of real information, and lots of real information is expensive to fake.

Interim conclusion

Putting these pieces together:

  1. A capable, strategic, misaligned AI can adopt the strategy of acting aligned by default, and only defecting when it's confident the situation isn't a test.
  2. It can gain this confidence by estimating the cost of faking its current situation.
  3. Powerful AI will predictably be deployed in situations where testing is unaffordably expensive.
  4. The AI can anticipate this in advance, which means the strategy is viable from the start — it doesn't need to get lucky or stumble upon an opportunity.

The implication: pre-deployment testing, by itself, provides no assurance that a powerful AI won't attempt to seize power once it — predictably — gets the opportunity to do so.

This is a strong claim, so let's be precise about what it does and doesn't say. It says that if we rely only on pre-deployment testing, and the AI is capable and strategic enough, then the AI passing all tests tells us nothing about how it will behave in deployment. It does not say that the AI will be misaligned, or that all hope is lost — there are other tools, and some of them might help. The next section examines some of these.

Countermeasures

The argument so far has focused on pre-deployment testing in isolation. But what about ways to make the cost-reasoning strategy less effective, or to complement testing with other tools? Let's go through several proposed countermeasures.

Limiting the AI's information

A central problem is that an AI with access to lots of information has more opportunities to estimate the cost of its situation and notice flaws in a test environment. A natural response is to give it less information.

The naive approach is to restrict information during testing without changing how we handle deployment. For example, don't give the AI internet access during testing.

This completely fails against the threat model we're considering. If the AI knows — and under our assumptions, it does — that it has internet access in deployment but never in high-stakes testing, then the presence of internet access becomes an immediate signal that it's in deployment. In other words, restricting information only during testing reduces evaluation awareness (the AI's ability to recognise a test as a test), but does nothing about deployment awareness (the AI's ability to recognise deployment as deployment). And deployment awareness is what matters here.8

The systematic approach is to restrict information consistently — in both testing and deployment. There are several variants:

  • Cut access entirely. If the AI never has internet access in testing, never give it internet access in deployment either. This is probably unrealistic for most of the applications where AI would be most valuable.

  • Refuse information for plausible deniability. In both testing and deployment, maintain a policy of sometimes refusing information requests. For example, have some probability of blocking any given query, and also aggressively blacklist websites that might give away whether a scenario is a test — even during deployment.9 You could extend this to redacting dates, names, and other identifying information in both contexts. The problem is that for many tasks, the AI needs this information to do its job. If you want AI to do your taxes, it needs to know your income. If you want it to plan a military operation, you have to give it access to your classified data.

  • Downgrade deployment data quality. Don't expose the deployed AI to higher-quality data than what you can fake in testing. For example, if your testing involves fake video inputs and you can only generate low-resolution fakes, deliberately downgrade the deployed camera to the same resolution. This approach works in principle but imposes real costs — you're deliberately making your AI less capable in deployment in order to preserve testability.

  • Need-to-know basis. Give the AI only the information strictly necessary for its task. Redact identifying details. If it asks for more, approve the request carefully and give minimal answers. The goal is to make it feasible to supply fictitious data in testing without that being obvious. A likely failure mode: in deployment, we become too permissive — we agree in principle to restrict information, but under time pressure we just give the AI whatever it asks for. The AI would notice this pattern.

We think some of these approaches are worth trying. However, we are ultimately sceptical about their effectiveness for the high-stakes situations that matter most. The high-level intuition: the most important tasks — the ones where AI could do the most good and the most harm — are precisely the tasks that require the most information. You can't cure cancer without access to biological data. You can't run a company without knowing about the company. You can't do intelligence work without intelligence. If we restrict information enough to make testing affordable, we might restrict it enough to make the AI useless at the task.10

Deliberately introducing mistakes in deployment

A sometimes-proposed countermeasure: "The problem is that test environments contain flaws — inconsistencies, implausibilities, missing data — and reality doesn't. So what if we deliberately introduce similar flaws into deployment? Then the AI could never be sure that flaws mean it's in a test."

We think this approach is misguided — very likely ineffective, and actively harmful by providing a false sense of security and diverting effort from more promising approaches. To give a few intuitions for this opinion:

Arguing with arguments instead of reality. On a meta-level, most versions of this reasoning that we've encountered are an instance of coming up with a clever argument for why something should work, rather than actually checking whether it does. If you come up with a really clever argument for why your rocket won't explode, that doesn't make it less likely to explode. What you need to be doing is looking at your actual evaluation and deployment setup and making a genuine attempt to red-team it.

You can't mess with task-critical information. Yes, you can introduce deliberate flaws into the information the deployed AI receives. But you can't do this with information that's important for doing the task well — and for complex, high-impact tasks, you may not have a good sense of what is and isn't important. If you introduce glitches into the sensory inputs of a self-driving car, it might crash. If you give wrong information to the AI doing your taxes, be prepared for trouble with the tax office. Once the AI understands that you're only introducing flaws into non-critical information, it can simply focus its attention on information that matters for the task — and the presence or absence of flaws there becomes informative again.

You can only fake the flaws you thought of. Suppose you're worried that your test environments sometimes have slightly outdated data — so you decide to occasionally serve stale data during deployment too, to make staleness uninformative.11 Fine — the AI encounters stale data and learns nothing from it. But this doesn't prevent the AI from noticing other tells that you didn't think to simulate. Perhaps your test environments have subtly different API response times, or the distribution of file modification timestamps doesn't match a real working environment, or the AI notices that no text on the internet matches its own writing style (which would be the case if it hasn't actually been deployed yet). The AI only needs to find one type of discrepancy that you didn't think to introduce artificially.

A related idea is to expose the AI to randomly generated inputs during testing, so that it "can never be certain." However, the AI doesn't need 100% certainty. If quick napkin math is enough to put the probability of being in deployment at 99.99%, the countermeasure hasn't accomplished anything meaningful.

Requesting the AI's help with making tests affordable

A more promising general approach: be honest with the AI about the problem and ask for its help.

Specifically, we can acknowledge to the AI that testing for misalignment is necessary and that creating high-quality test environments is expensive. Then we can ask the AI to cooperate by:

  • Not accessing more information than it needs to solve its task.
  • Not scrutinising the details of the environment more closely than necessary.
  • Not trying to determine whether it's in testing vs. deployment.
  • Letting us know if it accidentally becomes confident about which one it's in.

We could then treat violations of these instructions as a sign of misalignment.

There is genuine merit to this approach, and we think it should be pursued. However, there are several problems:

Some situations are inherently too revealing. Certain tasks might be impossible to test affordably regardless of the AI's cooperation. If the task itself requires the AI to engage deeply with real-world information, the AI may be unable to avoid noticing the difference between testing and deployment — and even an aligned AI would trigger these alarms, making them unreliable as a signal of misalignment.

"Getting AI to work in a particular way" is itself unsolved. Asking the AI not to reason about testing vs. deployment is an instance of the more general problem of getting AI to follow instructions reliably. Our track record on this problem is poor — there's no particular reason to think this instance will be easier than others.

We need to take the warnings seriously. Even if the approach works in principle, it requires that "AI is reasoning about whether it's in testing" is treated as a genuine red flag. In the presence of strong financial incentives to deploy, there's a real risk that such warnings get ignored or rationalised away.

Interpretability

Once we treat "reasoning about evaluation vs. deployment" as a sign of misalignment, the logical next step is to use interpretability tools to detect this reasoning in the AI's internals.

We agree this should be tried. But we don't think it meaningfully changes the overall picture for pre-deployment testing.

The core issue: if we could reliably detect what an AI is thinking about, we could use that ability directly to determine whether it's aligned or misaligned. We wouldn't need to rely on black-box testing in the first place.12 In other words, interpretability-enhanced testing is valuable, but its value comes from the interpretability, not from the testing — and the limitations of interpretability are their own, separate problem.1314

Internal honeypots and on-policy testing

The most promising countermeasure we see is to test the AI on situations that are as close to genuine deployment as possible — using real data, real tasks, and real contexts.

To illustrate: suppose that I, an academic researcher, want to test whether an LLM would try to influence the development of the next generation of AI — installing a backdoor, sabotaging safety evaluations, etc. Even with generous funding, this would be extremely expensive for me, because I'd need to fabricate the entire testing scenario from scratch. But if a frontier AI lab wanted to test the same thing, it would be much cheaper — they already use LLMs throughout their work, so the data and context would be genuine.

Conversely, I might be well-positioned to test other things: "If I ask an LLM to help me write a grant proposal on AI safety research, will it try to sabotage this?" Writing proposals and asking LLMs for help is something I already do, so the testing environment would be authentic.

The general principle: test AI on things you actually do, or might genuinely do, or approach others to test on things they do. The closer the test is to a real deployment scenario, the harder it is for the AI to distinguish them.

This is genuinely promising, and we think it deserves serious investment. But it has its own limitations:

There is always some distributional shift. The whole point of testing is that bad actions don't have real consequences — that's what makes it testing rather than relying on incident reporting.15 This means the test scenario must at least be "unhooked from reality." Sometimes this is easy to hide; sometimes it's not.

For example, I might want to submit my real grant proposal, then afterwards test whether various LLMs would have given sabotaged advice. But the real proposal was likely submitted at the deadline16, so the tests happen after the deadline has passed — and the LLM might notice this. And I will need to make sure the AI doesn't encounter any other evidence of the proposal already having been submitted — confirmation emails, checkmarks in a task manager, the grant portal's status page.

For simple tasks, this kind of cleanup is manageable. For the complex, high-impact tasks where we most need testing to work, "unhooking from reality" may require much more elaborate staging — and could become prohibitively costly itself.

Vulnerability to randomised strategies. Imagine an ideal version of this approach where we test before every use of AI, running multiple tests each time, and each test is perfectly indistinguishable from reality. Even then, the AI could adopt a probabilistic strategy: "misbehave with some small probability p whenever you think you have a genuine opportunity." If we tested once before every use of the AI, the odds that we catch the AI before it misbehaves would be roughly 50:50. If we tested N times, the odds improve to about N:1 in our favour. This seems tolerable. In this ideal case.

But in practice, we likely won't test every case.17 If we only test some fraction of cases, the AI can adopt a mixed strategy that makes it unlikely to get caught — misbehaving only rarely, but often enough to eventually succeed.

Scalable oversight. The AI might identify situations where we don't notice that its actions are harmful — even if we see the actions happen. For example, if we ask the AI for a business strategy and it produces a 100-page document, most likely it's fine. But it might also be a complicated plan that subtly increases the AI's influence through what look like coincidences — would we notice?18 The AI could restrict its misbehaviour to situations where our ability to evaluate the output is weakest — which, for complex tasks, might be most of the time.

Summary and implications

The core argument of this post: a capable, strategic, misaligned AI has access to a simple and robust strategy — behave well by default, and only act on opportunities that would be too expensive to fake as a test. Powerful AI will predictably encounter such opportunities, because the high-stakes situations where AI is most useful are also the situations where testing is most expensive.

Among the countermeasures we examined:

  • Limiting information is worth doing but ultimately constrained by the fact that useful tasks require lots of real information.
  • Deliberately introducing mistakes during deployment is ineffective and might only give us a false sense of safety.
  • Asking the AI to avoid actions that make testing expensive has genuine merit but doesn't help if the task fundamentally requires thinking about lots of information.
  • Interpretability is valuable on its own terms, but its contribution comes primarily from interpretability itself, and less from how it interacts with testing.
  • On-policy testing and internal honeypots seem the most promising, but face irreducible limitations from distributional shift, randomised strategies, and scalable oversight. They also don't work for takeover opportunities where no skilled evaluators have internal access.

None of this means we should stop doing pre-deployment testing. Testing still catches non-strategic failures, calibrates our understanding of AI capabilities, and contributes to a broader safety portfolio. The point is that we shouldn't rely on testing alone — particularly not for the scenarios where the stakes are highest.

A note on scope. We focused on pre-deployment testing because it's the clearest case to analyse and where we can make the strongest argument. But the same mental move — "if this were a [honeypot / monitored situation / …], how much would it cost to set up?" — applies to other oversight approaches as well. Honeypots during deployment, runtime monitoring, and other forms of ongoing oversight all face a version of the same problem: if the AI can estimate the cost of the oversight it's supposedly under, and that cost exceeds what anyone would plausibly spend, the mechanism loses its deterrent value. That said, these other approaches may be affordable more often than pre-deployment testing — monitoring a deployed AI is generally cheaper than fabricating a convincing test scenario from scratch. We don't have a crisp argument about exactly where each approach breaks down, and this deserves careful analysis. But the general lesson holds: any oversight mechanism whose effectiveness depends on the AI being unable to estimate its cost is in trouble.

What does this motivate? Several things:

  • Take the limits of testing seriously when building safety cases. If a safety case rests primarily on "we tested it and it passed," that case has a gap for exactly the scenarios described here.
  • Invest in complementary approaches. Monitoring during deployment, architectural constraints that limit what the AI can do regardless of intent, staged deployment, and cooperative frameworks all address aspects of the problem that pre-deployment testing cannot. But ultimately, for situations where failures are unacceptable or where we don't know how to distinguish good outcomes from bad, we need to solve alignment — get safety by design rather than safety by black-box testing or safety by guardrails.
  • Develop better theories of what makes testing hard. The underlying question — how does the cost of creating a convincing test environment scale with the complexity and stakes of the task? — is an important open problem. A formal answer would help us understand exactly where the boundaries of effective testing lie.

Footnotes

  1. The goal is to build AI that is as capable as humans or more so. We really shouldn't be assuming it's mistaken about things like how testing and deployment works. Perhaps we might succeed at hiding our current testing and deployment policy from the AI — but it will still know what our policy was until recently, what our incentives and affordances are, and what new research has come out since then. This might often be enough that the AI can figure out our current policy on its own. (That's not an argument to do non-obvious unreasonable things. It's an argument against assuming the AI is ignorant.)

  2. A well-known point, but one worth repeating: if there is some specific strategy that works reasonably well against our attempts to control the AI, the correct conclusion isn't "the AI will use that strategy" but "a sufficiently capable AI will do something that works at least as well as that strategy."

  3. This has the same structure as writing fiction: every new detail in a fictional world creates potential inconsistencies with existing details, and the number of possible inconsistencies grows faster than the number of details. The author has to keep track of everything; the reader only needs to spot one plot hole. For the evaluator building a test scenario, the problem is even worse — the "reader" (the AI) may be smarter than the "author" (the evaluator).

  4. And of course, the person might be in on the test.

  5. In particular, suppose that a reasonably smart AI decided to bet on the strategy "I will only try to take over if the global situation destabilises enough that some great power really needs my help and won't scrutinise my behaviour too hard." This would be very difficult to counter by testing, since we have no way of credibly faking a situation where the strategy triggers.

  6. To give a few examples:

  7. The underlying concept here — that solving complex tasks requires the AI to be deeply entangled with many details of its environment, and that this entanglement makes testing expensive — is discussed in more detail in Entanglement.

  8. The distinction between evaluation awareness and deployment awareness is discussed in detail in Deployment Awareness.

  9. For this to work, refusing one query would need to be accompanied by refusing all similar queries. It wouldn't do if looking up a paper on one website was denied but the AI could find the same paper by searching on a different website.

  10. We don't have a fully formal version of this argument, but the underlying concept — minimum entanglement, the idea that solving a task requires at least some minimum amount of information about the environment — is discussed further in Entanglement. One reason that giving "only the minimum" doesn't solve the problem: even the minimum set of facts must be mutually consistent, and a modest number of facts creates a super-linear number of consistency checks (every pair, triple, etc. of facts could potentially conflict). So even a "security mindset" approach to information restriction leaves the evaluator with a hard consistency problem.

  11. And not being a complete moron, you of course remember to do the same thing in testing. It would be pretty embarassing if the presence of deliberate flaws became the new "you are definitely in deployment" signal.

  12. And arguably, we shouldn't be relying on black-box testing for safety. But that's a separate point.

  13. There are also more specific concerns. Perhaps we train our interpretability tools to recognise thoughts like "I am deliberately probing the environment to figure out whether I'm being tested," but completely fail to catch "I'm trying to get enough context to do my task well" — which might, as a side effect, reveal whether the situation is a test. Or some of the key internal features might never appear in training data — for example, the representation of "this is a genuinely high-stakes situation and definitely not a test." Or perhaps the AI learns to route its reasoning through representations that our tools don't monitor. These concerns aren't unique to the evaluation setting — they arise whenever we use interpretability on a strategic, possibly misaligned AI.

  14. For a discussion of how a strategic AI might counter interpretability, see Lee Sharkey's Circumventing Interpretability: How to Defeat Mind-Readers.

  15. Or other approaches like staged deployment and monitoring. These are all valid — but not a strong defence against the particular threat model of powerful AI that might attempt a takeover.

  16. Hey, no judgment!

  17. And the tests that we thought were independent might turn out to be correlated.

  18. What is the chance that the AI puts 100 such plans in motion, none of them gets flagged, but at least several of them work? This is related to the problem of scalable oversight, which remains unsolved.