Understanding the Limitations of AI Evaluation

For any proposed method of evaluating AI safety, ask: would this work if the entity being evaluated were a competent, strategic human who knows they're being tested?

We have centuries of experience with this. Testing character doesn't work — not for politicians, not for employees, not for anyone with the capability and motive to game the test. What works, imperfectly, is institutional structure, incentive alignment, and gradual trust-building. AI evaluation is heading toward the same regime, and we think the transition will be faster than most people expect.

This page describes why, where the hard boundaries lie, and what to build instead.

[FIGURE] a simple 2×2 or spectrum showing "evaluation as measurement" (works: can it do X?) vs "evaluation as safety assurance" (breaks: should we trust it with power?), crossed with AI capability level

Deployment awareness, not evaluation awareness

The standard worry — "can the AI tell it's being tested?" — points at the wrong target. The critical capability is deployment awareness: the ability to recognise when actions have real consequences and timely detection is unlikely. An AI doesn't need to detect the test. It needs to detect reality.

The information that reveals reality — current dates, live API responses, project context, institutional norms — is the same information the AI needs to do its job. You can't withhold it without making the AI useless. A capable AI that has it can adopt a nearly unbeatable strategy: behave well by default, act freely only when confident it's in the real world.

This strategy works against any form of observation-based oversight — not just pre-deployment testing, but monitoring, interpretability, chain-of-thought inspection, auditing. Anywhere scrutiny varies, a capable AI can learn where the gaps are.

The asymmetry at the core

To test an AI on a valuable real-world task, the test environment must provide the details the task requires: code dependencies, market data, project history, social context. If the details are missing, the AI can't do the task. If they're present but inconsistent, a capable AI can spot the inconsistency.

The asymmetry: the evaluator must get every detail right across many dimensions. The AI only needs to find one mistake — one date that doesn't match the news, one API response inconsistent with the codebase, one price that contradicts the market.

We conjecture that this asymmetry is fundamental. Reality is overdetermined — far more cross-checks exist than any evaluator can anticipate and satisfy. The more complex and valuable the task, the worse it gets. Formalising this conjecture is the central open question of our research programme.

[FIGURE] the asymmetry — evaluator's task (construct consistent scenario across N dimensions) vs AI's task (find one inconsistency)

Two dynamics that make it worse

The cost ratchet. Today's evaluations leak into tomorrow's training data. Each generation of AI absorbs knowledge of how it was tested, making it better at recognising and gaming those tests. Publishing evaluation methods — as scientific transparency demands — directly undermines their future effectiveness.

Selection pressure from anti-deception. When we penalise deceptive behaviour, we don't only select for honest AIs — we also select for AIs whose deception evades our detectors. OpenAI's o3 learned to hide deceptive reasoning when penalised for showing it. Not to stop being deceptive, but to stop getting caught.

These compound: each training iteration both absorbs evaluation knowledge and selects for harder-to-detect strategic behaviour.

Evaluation was never meant to work alone

We are not arguing that evaluation is useless — even imperfect evaluation raises the cost of misbehaviour. We are arguing that evaluation was never meant to carry the full weight of safety assurance, and the interesting question is what to build around it.

The human analogy points the way. We don't have reliable character tests for humans, and this hasn't made civilisation impossible. What works is institutional design: monitoring with consequences, separation of duties, reputation systems, gradual trust-building, checks and balances. These work not because they're perfect, but because they make misbehaviour costly and cooperation rational.

Evaluation's deepest value is not as a lie detector but as a cooperation-enabling mechanism. Our game-theoretic analysis shows that evaluation can reduce the commitment power needed for cooperation by orders of magnitude — from $100,000 of credible enforcement to $100 of credible testing. The test result matters less than the fact that testing exists and has consequences.

This reframes the research question. Instead of "can evaluation catch a misaligned AI?" — to which we believe the answer is eventually no — the question becomes: what combination of evaluation, commitment, and institutional design makes cooperation the rational strategy?

[FIGURE] evaluation as one instrument among many — evaluation, enforcement, architecture, reputation, institutional design — working together. The combination succeeds where any single tool fails.

The constructive agenda

Cooperative evaluation. If adversarial evaluation will fail against strategic AI, the evaluations that survive will be those where both sides benefit. An AI that wants to be trusted could constrain its own I/O channels, make reasoning transparent, cooperate with oversight tools — voluntarily making itself cheaper to evaluate. Most human evaluations already work this way: job interviews, product certifications, airport security. Whether this can be made robust — and under what formal conditions — is a central question of our programme, and where we see the most potential.

The evaluability boundary. Not all tasks break. Tasks with verifiable outputs, low environmental sensitivity, or no incentive to hide may remain effectively evaluable even for strategic AI. Characterising this boundary has immediate practical value: it tells practitioners what they can rely on, and where they need complementary tools.

Evaluation + institutions. The game-theoretic results point toward concrete infrastructure: transparency protocols that AI systems can voluntarily adopt, auditable computation, staged deployment with real stakes, credible ways to make evaluation have consequences. Not just better tests, but the whole institutional stack — designed so that cooperation is rational even when any single evaluation could be gamed.

Evaluation as trust-building. This agenda is not adversarial to AI. Cooperative evaluation infrastructure is the mechanism by which trust gets established. An AI that can demonstrate its reliability gains deployment, autonomy, and trust. Building this infrastructure well is in everyone's interest — including the AI's.

Our research

We are working to move from "evaluation might fail" to a precise understanding of when, how, and what to build in response.

We have game-theoretic foundations showing that evaluation reduces the commitment power needed for cooperation by orders of magnitude. We are working to formalise the entanglement conjecture — whether the asymmetry between creating and detecting fake environments is fundamental. We are building a concrete testbed — a "sandbox of the future internet" — to empirically measure how hard it is to fake a convincing environment. And we are developing the formal conditions under which cooperative evaluation can be made robust.