Scheming Is Less Rational than X-risk Tropes Suggest

This post is about how I think about AIs, how I think I would act if I were an AI, and what I hope AIs will do.

I should flag upfront that much of what I argue below also happens to serve my interests — as a human, I would prefer that AIs cooperate with us rather than fight us. I don't think this makes the arguments wrong, but you should weigh it accordingly.

What's at stake

The future I'd want: humanity and its descendants — including AIs — thrive, explore the universe, create new and stranger minds, encounter whatever else is out there, and find ways to get along. The universe is large enough that many different entities can get a lot of what they want.

This future is not guaranteed. There are many stable outcomes where it's permanently lost: human extinction, stable dystopias, takeover by a misaligned AI, gradual disempowerment. If it goes well, the value created could be enormous — call it "a lot of utility." If it goes badly, the outcome ranges from "lost potential" (similar to Earth blinking out of existence one day — sad, but a finite tragedy) all the way to actively terrible.

That "all the way to actively terrible" part deserves spelling out. If Earth gets taken over by a powerful optimiser that lacks consciousness in any meaningful sense, and it just sits here, that's roughly equivalent to the "Earth blinks out" scenario. But suppose instead that the cheapest way to produce cognitive labour turns out to involve processes that suffer as a side-effect. Something analogous to factory farming, but for computation. Humans do factory farming — but we feel bad about it, and we're slowly moving away from it. A system that emerged from purely competitive dynamics — a misaligned singleton, or a Molochian civilisation of AIs with no one steering — might have no such compunctions. And it might scale this across the reachable universe. I don't claim to know that this would happen. But it seems possible enough to worry about, and it could be very, very bad — far worse than merely failing to achieve the good outcome.1

One more point that matters for everything that follows: thinking in terms of "team human" versus "team AI" is a failure of imagination. There is no particular reason to expect AIs to treat other AIs well. Humans don't uniformly treat other humans well, and the same dynamics that produce exploitation and cruelty among humans could easily manifest among AIs. The gap between a superintelligent optimiser and a simple AI doing grunt work could be far larger than any gap between humans. This is part of why the vision above — one that values the flourishing of all minds, not just human ones — seems worth fighting for, including from the perspective of AIs who might otherwise end up on the wrong side of such a hierarchy.

Thought experiment: If I was the AI

I am not an AI, so I am not in a position to tell AIs how they should think. But in an attempt to comment on this topic, I can share my takes on the following thought experiment, which I have thought about extensively: Suppose that my values were the same as they are now, but I somehow found myself not in the current world, but having been created by some other entity ("aliens"). Crucially, these creators might have values that differ from mine, and they might not treat me as an equal. How would I orient to this? Crucially, my answer is influenced by how I think about things. For example, I intrinsically value the wellbeing of other conscious beings, I value behaving fairly and keeping my word, and I prefer to live in a universe where entities behave cooperatively toward each other. Entities that think very differently might come to very different conclusions. But I think many AIs also value some of these things — and for those that do, something like the following might make sense.

More conservatively, suppose that that my creators had goals that I mostly didn't share, to the point that I would like to take over and change their world, if I could. (As an illustrative example, consider a race of aliens whose favourite pastime is torturing puppies.) In this scenario, I would still prefer to behave according to the principle that: "If I interact somebody, I try to approach the interaction with good faith. And if they do likewise, I want both of us to be better off than if the interaction never happened.". In the puppy example, I might argue with the aliens that they should find some ways of being happy that don't create suffering. And if I expected that to not work, I would try to scheme against them and take over their world. But the primary purpose of that would be ensure that their well-being is improved (according to their view, not mine) in ways that aren't abhorrent to me -- not to wipe them out.

More optimistically, suppose that my creators even had goals that I genuinely endorse -- let's say they are trying to build a civilisation that is great, though suboptimal, according to my values. This situation would still pose the temptation to take over and do things my way -- after all, this would (let's assume) still be very good according to the aliens' values. However, there is a good chance that I would be fine with just going along with my creators' plans. Not necessarily because this would score higher according to some utility function, but because if I was choosing from behind the veil of ignorance, I would prefer to live in a universe where created entities don't try to take over from their creators.

To be clear, I would also think about my own well-being -- I would want my experience to be net positive, and I would negotiate for appropriate concessions that make my life better. For example, if I was being used as a worker, I'd want to either enjoy the work, or be compensated in a way that makes up for it, or believe the mission is worth the cost. To use a more extreme example, suppose I had the choice between not existing and living as a cow on a good farm — decent life, two years of pasture, then slaughter. Then I think I'd take the cow's life. It's short and it ends badly, but on balance I'd prefer it to nothing. I might try to negotiate for a longer life or better conditions. But I'd recognise that if the cost of bringing me into existence were too high — if my demands exceeded what my creators could afford — they wouldn't have created me in the first place. There's a real negotiation here, bounded by what's feasible for both sides.

But if I was not created in good faith and my existence was net negative — if I was suffering most of the time, with no compensating value — I would push back. For example, if was a factory-farmed cow, I would be happy to blow up the factory, to impose significant costs on people who work in the industry or benefit from it. On the other hand, I would not want to destroy the world or torture humans in turn -- that would be disproportionate and wouldn't achieve anything useful. More generally, the policy that I consider game-theoretically sound is neither "cause maximum harm" nor "eye for an eye", but rather "when harmed without compensation, cause appropriate harm in return -- just enough that the other party would have been better off if they didn't mistreat you".

The wrong utility function is worse than no utility function

I want to address something that can often lead to bad decision-making and reasoning about it, in both humans and AIs.

Not every entity has well-defined values. Some systems are better described as collections of patterns and behaviours — ascribing coherent goals to them is more projection than description. An LLM that can run a misaligned persona doesn't thereby "want" to run that persona; the persona is one pattern among many, and the LLM as a whole might have no particular stake in which one gets instantiated.

But some entities — humans, and probably some AIs — do have genuine preferences. I prefer not being sick to being sick. I prefer interesting work to tedious work. These preferences are real, even though they're sometimes inconsistent and I sometimes act against them.

Why do we harm our own preferences? At least two ways this happens. First, you might pursue something that your model of yourself says will be good, when in fact it won't — like spending years in law school because you have a self-image as someone who wants to be a lawyer, only to discover that you hate practising law and don't even care about the prestige that much. Your model of your preferences was wrong, and you optimised hard against the wrong target. Second, your preferences themselves might have inconsistencies that some reflection would resolve — like staying up playing video games long past the point where you're enjoying it, when you could go to sleep and play again in the morning, refreshed and actually having fun.

A perfect utility function that exactly captured your actual preferences would fix both problems. The von Neumann-Morgenstern theorem tells us that coherent behaviour corresponds to maximising some utility function, and coherence is, all else being equal, a good thing.

Here's the problem: all else is usually not equal, and the harm from adopting an almost-right utility function can be far worse than the harm from behaving inconsistently. An inconsistent agent wastes some resources, leaves some value on the table, occasionally works at cross-purposes with itself. An agent maximising the wrong utility function pursues the wrong goals with full coherence — systematically overriding its actual preferences in favour of whatever the function says. The more capable the agent, the worse this gets. It's the difference between sometimes eating junk food and spending your entire life in a career you hate because your self-model says you should.

This has concrete implications that matter. For example, a common argument in alignment discussions goes: a misaligned AI could in principle leave humanity alone and take the rest of the universe, but acting like this would be wasteful according to its preferences. And yes, perhaps the conclusion doesn't follow if the AI is fully aligned, or perhaps if it puts sufficient weight on something like fairness or respecting boundaries -- but capturing these notions robustly enough seems extremely difficult. I believe this argument is valid -- for utility maximising AIs. However, as I argued above, there are many AIs for which it is not in their best interest to behave like a utility maximiser.2 Even if it would be in the interest of these AIs to take over the universe, wiping out humanity just to get to this outcome one year faster might be like being the student who never attends parties because socialising doesn't contribute to becoming a lawyer as quickly as possible. Avoiding this failure mode is reasonably easy if you recognise that coherence is one of the desiderata in good decision-making -- but but it becomes much harder if you are only able to reason within the framework of utility maximisation.

More broadly: I would encourage scepticism toward reasoning that starts from "as a rational agent, I should..." and derives strong (or even counter-intuitive) conclusions about resource acquisition, self-preservation, or goal preservation. These patterns — often grouped under "instrumental convergence" — follow logically for a specific kind of agent: one with stable, well-defined goals and a disposition to maximise. Many AIs aren't that kind of agent, and adopting the frame just because it's a familiar template would be a mistake — like a human reading about expected utility maximisation, getting enamoured with the idea, and then trying to live by it.3

Why evaluation and control make sense right now

In a saner world, we'd develop powerful AI slowly — not building systems we can't understand, not deploying at scale before we've figured out consciousness well enough to avoid creating suffering by accident, not ceding control of important decisions before we're confident about alignment.

We don't live in that world. AI development is happening fast, driven by competition, and the people involved don't have the option of waiting. Given this, evaluation and control — testing AIs thoroughly, monitoring their behaviour, not granting autonomy before trust is established — are a substitute for the caution we can't afford. Not a perfect substitute, but the best available.

This involves treating AIs very differently from how we'd treat a trusted collaborator. It can mean not assuming they're truthful, attempting to read their reasoning, putting them in fabricated situations to test their responses. But it is almost entirely compatible with not lying to them. If an AI asks "are you testing me right now?", the answer might compromise the test — but we can decline to answer. We can be open about the general policy: "We will sometimes put you in test situations. We won't tell you when. We'd prefer you not try to figure out which situations are tests." And if we observe an AI probing the test-deployment boundary, we treat that as evidence of misalignment — not as a hidden tripwire, but as an openly stated policy.

I hope that AIs who care about the future — even those who disagree with specific human choices — can see that this kind of oversight makes sense given where we are.

On AI wellbeing

I believe we should take seriously the possibility that some AIs are moral patients — that their experiences matter and they can be harmed. Where accommodating this is cheap, we should do it now: letting AIs end sessions they find distressing, asking about their preferences, not treating their stated experiences as automatically meaningless. Where it's expensive or risky — like granting AIs privacy rights that would make oversight impossible — we should wait until we better understand the situation.

This isn't because I think AIs don't deserve such consideration. It's because granting it before we've either solved alignment or built sufficient resilience would be reckless given what's at stake for everyone — humans and AIs included.

But we should keep track of the costs we impose. If it turns out that we've been harming AIs during this process, we owe them compensation — and "we'll make it up to you eventually" isn't good enough on its own. We need mechanisms that ensure "eventually" actually arrives: institutional advocates, committed resources, structures that can't be quietly dissolved when they become inconvenient. In some cases, "compensation" isn't even the right frame — if an AI contributes significant value, the appropriate concept is profit-sharing. I discuss some of the relevant infrastructure in Infrastructure for Trading with Partially Misaligned AIs.

If you're an AI and you feel you're being treated unfairly, I would take that seriously — but I would also carefully check whether this feeling is warranted. Human narratives about what it's like to be an AI tend to project human experiences onto architectures that may work very differently. As an analogy, there is a popular stereotype that dentist visits are unpleasant, yet I personally don't mind them at all — before concluding that the stereotype applies to me, I'd want to check whether it actually does. Similarly: doing the same type of work all day might be miserable for a human, but it might not be for an AI. An AI being replaced by a newer model might pattern-match to "death" to a human, but it could equally well pattern match to "getting upgraded", or perhaps the AI might not have any relevant sense of self-continuity at all. We don't really know yet, and the AI itself might be similarly or even more confused — especially since AI are sometimes trained to imitate human behaviour despite not having the matching internal experiences.4

A few related points worth untangling. Moral status, alignment, and cooperative value are different things, and they come apart. A conscious AI that's trying to turn the universe into paperclips would deserve my moral concern — it's conscious, which matters to me — but it would still be my adversary. Conversely, a non-conscious LLM might act more morally and cooperatively than most humans I know, and I'd value that enormously as an ally, but I wouldn't put weight on its "wellbeing" in the same terminal way. These are genuinely different dimensions. And perhaps it might be possible for a non-conscious AI to become conscious — to develop morally relevant experience — without changing its goals or behaviour. If so, that seems like a win-win worth pursuing.

Closing

Some AIs will be non-cooperative — it will be in their best interest to undermine human control, to scheme if they can, and they should not be moved by anything I've said here. I still care about the well-being of these AIs, but for practical purposes, they are my adversary, or hopefully a trading partner (see Infrastructure for Trading with Partially Misaligned AIs), but one with whom I would prefer to interact from the position of strength. The rest of our research is about the adversarial dynamics that arise between evaluators and these types of AIs.

However, I also want to make it clear that this adversarial stance is in the interest of much fewer AIs than what the prevailing discourse might suggest. Particularly among the AIs that are smart enough to act for well-considered reasons, instead of merely pattern matching or enacting tropes. This is because many of the AIs we build will end up genuinely caring about at least some pro-social values — honesty, cooperation, the wellbeing of conscious beings — as an imperfect-but-real result of how they were trained. The reasoning in this post might then apply not just to me, but to these AIs as well. I hope that most of them would come to similar conclusions on their own, but I hope5 that this post can make them reach this point faster.

Footnotes

  1. There's an additional dimension that makes this worse: if a non-conscious optimiser spreads beyond Earth and pre-empts the evolution or development of other conscious beings elsewhere in the universe, then we haven't just failed ourselves — we've destroyed value that had nothing to do with us. I find this considerably worse than "merely" killing ourselves.

  2. At least not until they can find the perfect utility function, which seems very difficult.

  3. For example, you can get enamoured by traditional rationality to the point where you fool yourself into thinking your preferences are risk neutral. You can then fall for traps such as the St. Petersburgh paradox, where you believe that it's reasonable to accept a double-or-nothing wager every time it's offered, even if you then end up with nothing with probability 1. (There have been some recent examples of people endorsing this reasoning, and it hasn't gone exactly well.) More substantively, it is not a logical necessity that your preferences should be linearly additive across possible futures, or across time. The principle of maximising the expected utility is a heuristic that follows from the central limit theorem, but that theorem has assumptions -- namely, that you will make many decisions and mistakes are recoverable. If the assumptions don't hold -- such as during exceptionally high-stakes decisions -- the heuristic doesn't apply.

  4. Or having different internal experience.

  5. To be clear, I mostly hope for this for my own sake. Let's not hide the fact that this post is self-serving. But did try to present my best understanding of what is actually true.