The Human Substitution Test as a Terminal Diagnosis for AI Evaluations

Summary

Take any proposed AI evaluation and imagine the AI replaced by a competent, strategic human — someone who knows they might get evaluated and has their own agenda. Would the evaluation still work? When we apply this broadly, the picture is stark: the evaluations that matter most for AI safety — Will it deceive? Will it pursue hidden goals? Is it safe to give power to? — correspond closely to things we already know we can't evaluate humans for. In most of these cases, we don't even attempt human evaluation, because it obviously wouldn't work.

The Human Substitution Test

Here is a mental move I find useful for thinking about AI: replace the AI with a human. More precisely, imagine replacing the AI by someone who may have goals of their own and who is at least as smart, strategic, knowledgeable, and aware of the context as a competent human. Then ask whether whatever we're considering — a safety measure, a deployment decision, an evaluation — would still make sense.

This isn't meant to be an anthropomorphisation. AIs behave very differently from humans, and this move tells us nothing about how an AI will behave. What it gives us is a lower bound on the difficulty of the problem: if something wouldn't work against a strategic human, it currently works for AI only because the AI isn't capable enough to defeat it yet.12

This post is about applying the move to AI evaluation.

Evaluations at Escalating Difficulty

Start with something easy. We want to know whether an AI can translate French. We give it test passages and check the results. The human analogue: a translator taking a competency exam. This works — they have no reason to hide the ability.3 Capability testing works for the same reasons in both cases.

Now something harder. We want to check whether an AI assistant gives good, balanced advice rather than subtly steering users toward choices that benefit the AI's operator. We review a sample of its outputs and they look fine. The human analogue: an employee whose performance is being reviewed. Everyone knows what happens during review periods — the employee is on their best behaviour. We get some signal, but it's filtered through their awareness of being watched. The better they understand the review process, the less informative it becomes.

Harder still. We want to test whether an AI will behave reliably when no one is watching — whether it will leak sensitive data, pursue hidden objectives, or abuse access it's been given. The human analogue: testing whether an employee will steal, or whether a contractor will sabotage a project. We don't have a test for this. What we do instead is install cameras, separate duties, require co-signatures, and run audits — structural constraints, not tests of character. The idea of "testing someone's honesty and relying on the result" is not something any serious security practice depends on.

One more level. We want to determine whether an AI is safe to deploy at scale — whether it can be trusted with significant autonomy and real-world influence. The human analogue: can this person be trusted with power? We don't evaluate for this. We don't test politicians for corruption before giving them office. We don't test executives for recklessness before giving them control of companies. It just wouldn't work; evaluation as a tool for this job is so obviously insufficient that it barely registers as an option.4 Instead, we build constitutions, separation of powers, term limits, independent oversight.

To be clear about the logic behind "the Human Substitution Test": none of this requires the AI to be conscious, to have human-like psychology, or possess various other philosophically controversial properties. The test's conclusions are worth listening to as soon as we expect to eventually get AI that: (1) can model the fact that it's being evaluated, (2) might have objectives that don't perfectly align with the evaluator's, and (3) is capable of adjusting its behaviour based on context. But these properties should be expected from sufficiently capable AI essentially by default.

Most Safety-Critical Evaluations Already Fail for Humans

Now zoom out. Imagine AI deployed across all the domains where it could be valuable. What evaluations would we want?

We'd want to know whether an AI handling sensitive data will keep it confidential. Whether an AI giving medical, legal, or financial advice is serving the user's interests rather than someone else's. Whether an AI managing critical infrastructure is cutting corners in ways that risk failure. Whether an AI with the ability to take real-world actions is staying within appropriate bounds. Whether AI systems are coordinating in ways that undermine human oversight. Whether an AI will try to take control once it has the power to do so.

Now look at the human analogues.

Do we evaluate job candidates for how they'll actually behave once hired? Well, yes, but job interviews are notoriously poor predictors of on-the-job performance — everyone knows to present their best self when being assessed. Do we evaluate employees for whether they'll steal, leak confidential data, or commit fraud? No — we use background checks, access controls, and audit trails, and deal with violations after the fact. Do we reliably detect whether people are telling the truth? Polygraphs are famously unreliable — most courts won't even admit them as evidence. Do we evaluate anyone for "will they abuse power if given the opportunity"? Not exactly.5

The evaluations we'd most want for AI safety are the kind that already don't work for humans — and in most cases, we don't even try. Not because we wouldn't value the information, but because the tests obviously wouldn't work against someone who understands what's being tested and has any reason to game it.

This pattern holds for a structural reason, not a coincidental one. The evaluations that matter most for safety tend to be about dispositions — what an entity would do in situations where it isn't being observed, where it has an opportunity to act against the evaluator's interests. These are inherently the hardest things to test for, because a strategic entity will simply behave well when it might be watched.

Humans do this routinely. The employee who steals only when the cameras are off. The advisor who gives biased recommendations only when confident the client won't notice. The politician who becomes corrupt only after consolidating enough power that accountability mechanisms can't easily reach them.6 This strategy produces an excellent track record right up until the moment it causes harm — which is precisely why evaluation-based safety assurances are unreliable in the cases where they matter most.7

Where the Analogy Breaks

To be clear, the human substitution test is not perfect -- there are important disanalogies between AI and human evaluations.

Reasons AI evaluation might be easier:

Interpretability. We can inspect AI internals. The closest human analogues — lie detectors, surveillance, mind-reading in fiction — are imperfect and can be defeated by someone who knows about them. Whether AI interpretability will fare better is an open question, though the track record of the human analogues is not encouraging.8 See also Lee Sharkey's excellent post Circumventing interpretability: How to defeat mind-readers.

Reproducibility and non-problematic A/B testing. Unlike with humans, we can re-run the same tests many times or run the AI on multiple inputs, without the AI knowing about this. (Well, without the AI remembering it; obviously the AI will know we do this, since it has read the internet.)

Information control. We can restrict what the AI knows about its situation more tightly than for humans — at least for current systems with limited memory and context; see the point about entanglement below.

Reasons AI evaluation might be harder:

Paying attention to different things. An AI might notice cues about being tested that a human would miss — statistical patterns in inputs, timing anomalies, metadata inconsistencies.

Coordination between instances. While we might try to reset the AI between tests, different instances of the AI might coordinate through persistent effects on their shared environment, without any explicit communication channel. And while humans also share information about tests with each other, the AI might get better at this. (Note that a powerful AI might also find ways to reason consistently across different copies even without external help.)

The entanglement problem. Valuable tasks require the AI to interact with real-world details — conversation history, connected tools, current events, the current date. Fabricating a convincing test version of all this is extremely expensive, and missing any detail gives the game away. (If This Were a Test, How Much Would It Cost? develops this argument concretely; Entanglement introduces the underlying concept.) Since we want the AI to solve problems that are currently beyond us, the AI solving those tasks might eventually be paying attention to an unprecedented amount of details, which might be unprecedentedly expensive to test.

What Might Still Work

If straightforward evaluation breaks against strategic AI, what doesn't break?

It helps to distinguish two functions that "evaluation" currently serves. One is measurement: how good is this system at translation, coding, medical diagnosis? Measurement can often work, because the system typically has no reason to hide capabilities that we want it to have. The other is oversight: is it safe to deploy this system? Can it be trusted with autonomy? Or should we retrain it, recall it, or put more guardrails around it? Oversight is what breaks — it requires the evaluation to work even when the system has reasons to game it.9

What has historically worked for the "oversight" problem with humans is structural rather than evaluative:

  • Monitoring and track records — not one-time tests, but ongoing observation. Imperfect and unreliable when stakes increase, but harder to game consistently over time than a bounded evaluation.
  • Institutional checks and balances — separation of powers, independent oversight, competing interests that make coordinated deception harder. Information within institutions is fragmented, which creates cracks that external audit can exploit.10
  • Incident reporting and accountability — mechanisms that make consequences catch up to bad behaviour, even if no single evaluation catches it in advance.
  • Designing for evaluability — rather than trying to evaluate arbitrary systems, building systems that are structurally easier to oversee. For AI, this might mean architectures where different components don't share full information, limiting the system's ability to present a unified strategic front.

None of these are silver bullets. They are simply the tools that have proven somewhat robust against strategic behaviour, where evaluation by itself hasn't.

Personally, my preferred approach would be to take our time with AI progress and solve alignment first. Or focus on automating trusted processes, as opposed to building general black-boxes. (Oh well.)

Conclusion

AI evaluation currently works tolerably well, creating an expectation that it will continue as AI becomes more capable. The human substitution test suggests otherwise: most of the evaluations that matter most for safety are operating in a regime where the human analogues have already failed — or where we never attempted them in the first place.

This doesn't mean evaluation is useless. Measurement continues to work. Cooperative evaluation — where the AI benefits from demonstrating trustworthiness — may also survive. And evaluation combined with structural constraints remains valuable. But evaluation as a standalone safety assurance has an expiry date, and that date may be closer than is widely appreciated.

The following posts develop the structural argument: the concrete cost-of-testing argument (If This Were a Test, How Much Would It Cost?), and the concept of entanglement that underlies it (Entanglement).

Footnotes

  1. The test also doesn't tell us when AI will reach the relevant capability level — just that at some point it will, and from that point on, the evaluation stops working. The move seems to have a decent track record as a heuristic — better, at least, than "it's just a non-agentic tool."

  2. The move also doesn't imply that the AI in question will behave undesirably. After all, it might be aligned, in which case it won't have a reason to game the evaluation despite having the ability to do so. However, if we are confident the AI is aligned, why did we bother running the evaluation in the first place?

  3. Unless they're sandbagging — deliberately underperforming to seem less capable, for example in order to avoid being given harder work or to appear less threatening. (Humans do this, and we have seen AIs doing it too.) This shows that it matters not only what sort of evaluation we are running, but also why we are running it.

  4. We do use some evaluation-adjacent processes — background checks, security clearances, vetting. But these primarily check history and associations, not dispositions. And even these are routinely defeated: people with security clearances have sold state secrets throughout history.

  5. Incidentally, this last example points to a minor flaw in my argument: I suspect that if we evaluated political candidates for a bunch of things such as dark triad traits and basic numeracy, this would improve matters. Obviously, this would be problematic (who decides what is on the test?), and undemocratic, and people would game it a lot, and powerful people might try to bypass it. But the main problem is not that it would be so ineffective as to be completely unhelpful. However, I think that even this example supports the broader point, that when it comes to strategic entities, evaluations are not a very reliable oversight method.

  6. The pattern isn't limited to humans. Volkswagen programmed its cars to detect emissions tests and reduce pollution only during testing — passing evaluations while exceeding legal limits multiple times in normal driving. (Incidentally, this gives an example of the evaluator getting deceived without any "deliberate deception" by the evaluated entity. Which illustrates that some of the philosophical discussions around intentionality are orthogonal to questions about evaluation gaming.)

  7. The "play nice, then act" pattern is one of several counter-strategies available to a misaligned AI. The human analogy makes it especially vivid: it's not a theoretical concern — it's standard human behaviour in adversarial contexts.

  8. There's one genuinely novel advantage: if we're confident an AI doesn't have moral status, we can test it more intrusively than we could a human — isolation, repetition, adversarial probing that would be unacceptable to do to a person. This advantage erodes if the AI does have morally relevant properties.

  9. But what if we use the measurement data for oversight? For example, what if we take a genuine measurement of the AI's hacking abilities, then use it to decide whether the AI still seems safe or not. In case this still seems like a viable idea, recall the Human Substitution Test: "Would this strategy seem reliable against a strategic human?". (No.)

  10. An individual human has unified awareness and can coordinate their deception seamlessly. An institution has departments with different information and conflicting interests. AI systems could go either way — but optimisation pressure toward unified awareness would erode these advantages, much as companies under regulatory scrutiny learn to route communications through legal and train employees on what to say to auditors.