Understanding the Limitations of AI Evaluation
Evaluation is one of the few mechanisms we have for assessing whether an AI system is safe to deploy. Major labs, governments, and standards bodies are building evaluation frameworks as a cornerstone of AI governance. This reliance is understandable — but the nature of evaluation changes fundamentally as AI systems become more capable and strategic.
Even capability evaluation is not immune. A strategic AI with reason to underperform can sandbag — hiding dangerous capabilities from exactly the benchmarks designed to detect them. And measuring whether a system should be trusted with autonomy — whether it would behave well without oversight — is harder still, with structural limitations that no amount of test design can overcome.
The core intuition is familiar. Ask: would this evaluation work if the entity being tested were a competent, strategic human who knows they're being tested? We have centuries of experience with this. Testing character doesn't work — not for politicians, not for employees, not for anyone with the capability and motive to game the test. What works, imperfectly, is institutional structure, incentive alignment, and gradual trust-building.
AI evaluation is heading toward the same regime, and we think the transition will be faster than most people expect. We formalise where the limits lie, show they are sharper than commonly assumed, and argue that evaluation's deepest value is not as a filter but as a cooperation-enabling mechanism. This page presents the core argument, the open problems, and a constructive agenda for what to build instead.

Deployment awareness, not evaluation awareness
The standard worry — "can the AI tell it's being tested?" — points at the wrong target. The critical capability is deployment awareness: the ability to recognise when actions have real consequences and timely detection is unlikely. An AI doesn't need to detect the test. It needs to detect reality.
The information that reveals reality — current dates, live API responses, project context, institutional norms — is the same information the AI needs to do its job. You can't withhold it without making the AI useless. A capable AI that has it can adopt a nearly unbeatable strategy: behave well by default, act freely only when confident it's in the real world.
This strategy works against any form of observation-based oversight — not just pre-deployment testing, but monitoring, interpretability, chain-of-thought inspection, auditing. Anywhere scrutiny varies, a capable AI can learn where the gaps are.
The asymmetry at the core
To test an AI on a valuable real-world task, the test environment must provide the details the task requires: code dependencies, market data, project history, social context. If the details are missing, the AI can't do the task. If they're present but inconsistent, a capable AI can spot the inconsistency.
The asymmetry: the evaluator must get every detail right across many dimensions. The AI only needs to find one mistake — one date that doesn't match the news, one API response inconsistent with the codebase, one price that contradicts the market.
We conjecture that this asymmetry is fundamental. Reality is overdetermined — far more cross-checks exist than any evaluator can anticipate and satisfy. The more complex and valuable the task, the worse it gets. Formalising this conjecture is the central open question of our research programme.
Left: the evaluator's task — constructing a consistent environment across many dimensions. Right: the AI's task — finding a single inconsistency.
Two dynamics that make it worse
The cost ratchet. Today's evaluations leak into tomorrow's training data. Each generation of AI absorbs knowledge of how it was tested, making it better at recognising and gaming those tests. Publishing evaluation methods — as scientific transparency demands — directly undermines their future effectiveness.
Selection pressure from anti-deception. When we penalise deceptive behaviour, we don't only select for honest AIs — we also select for AIs whose deception evades our detectors. OpenAI's o3 learned to hide deceptive reasoning when penalised for showing it. Not to stop being deceptive, but to stop getting caught.
These compound: each training iteration both absorbs evaluation knowledge and selects for harder-to-detect strategic behaviour.
Evaluation's real value
Given these limits, why evaluate at all? Because evaluation was never meant to be a lie detector — and the role it can play is more valuable than the one that breaks.
Our game-theoretic analysis shows that evaluation reduces the commitment power needed for cooperation by orders of magnitude — from $100,000 of credible enforcement to $100 of credible testing. The test result matters less than the fact that testing exists and has consequences. Even imperfect evaluation reshapes incentives.
This reframes the research question. Instead of "can evaluation catch a misaligned AI?" — to which we believe the answer is eventually no — the question becomes: what combination of evaluation, commitment, and institutional design makes cooperation the rational strategy?
We don't have reliable character tests for humans, and this hasn't made civilisation impossible. What works is institutional structure: monitoring with consequences, separation of duties, reputation systems, checks and balances. Not because they're perfect, but because they make misbehaviour costly and cooperation rational. AI governance needs the same kind of infrastructure — and evaluation is one essential component within it.
Five institutional mechanisms — evaluation, enforcement, architecture, reputation, and institutional design — interlocking. No single mechanism is sufficient; the combination covers what any individual tool misses.
The constructive agenda
Cooperative evaluation. If adversarial evaluation will fail against strategic AI, the evaluations that survive will be those where both sides benefit. An AI that wants to be trusted could constrain its own I/O channels, make reasoning transparent, cooperate with oversight tools — voluntarily making itself cheaper to evaluate. Whether this can be made robust — and under what formal conditions — is a central question of our programme, and where we see the most potential.
The evaluability boundary. Not all tasks break. Tasks with verifiable outputs, low environmental sensitivity, or no incentive to hide may remain effectively evaluable even for strategic AI. Characterising this boundary has immediate practical value: it tells practitioners what they can rely on, and where they need complementary tools.
Evaluation + institutions. The game-theoretic results point toward concrete infrastructure: transparency protocols that AI systems can voluntarily adopt, auditable computation, staged deployment with real stakes, credible ways to make evaluation have consequences. Not just better tests, but the institutional stack that makes cooperation rational.
This agenda is not adversarial to AI. Cooperative evaluation infrastructure is the mechanism by which trust gets established — and building it well is in everyone's interest.
Our research
We are working to move from "evaluation might fail" to a precise understanding of when, how, and what to build in response.
- Game-theoretic foundations. Formal models showing that evaluation reduces the commitment power needed for cooperation by orders of magnitude.
- The entanglement conjecture. Formalising whether the asymmetry between creating and detecting fake environments is fundamental.
- Empirical testbed. A "sandbox of the future internet" to measure how hard it is to fake a convincing environment at increasing levels of AI capability.
- Cooperative evaluation. Formal conditions under which voluntary, mutually beneficial evaluation can be made robust.
Last substantive update: March 2026