Entanglement: To Solve a Complex Task, the Solver Needs a Lot of Information about the Environment
Summary
We have a simple conjecture: if you want to solve any important task by using AI, you have to give that AI a lot of information about the environment. This post is about two concepts that might be useful for making this conjecture more precise: the notion of actual entanglement between an AI and the environment it operates in and the notion of minimum entanglement that is necessary for solving a particular task.
Why does this conjecture matter?
A brief aside on motivation, before we get into the concept itself.
If this conjecture is true — if solving important tasks requires giving the AI a lot of information about its environment — then there is an immediate consequence for evaluation. When you test an AI, you also have to give it a lot of information. And in some of the most important cases, this might mean that creating sufficiently realistic testing environments would be prohibitively costly.
For example, if we wanted to test whether an AI would try to betray us while helping develop the next generation of AI, we might need to create a convincing testing environment. This might require a lot of information about what the new AI looks like, to the point where making all of it up would be prohibitively costly. The only people who might have a shot at testing this would be frontier labs, who could use the actual situation they are in, rather than having to fabricate all of it.
Or suppose we wanted to test how an AI would behave if a conflict broke out between China and the US and one of the sides wanted the AI's help. This might require a lot of information about how China, the US, and the rest of the world are behaving. Because nobody has all of that information, nobody would be able to create a convincing test environment for this.12
Concepts we might need
Ideally, we would like crisp mathematical concepts that allow us to formalize the conjecture, make measurements, and make good predictions. We are not there yet, and this post is not trying to get us all the way there. Instead, what I'll try to do is identify concepts that might be relevant, describe some important observations, and present these semi-formally.
The two key notions are what I will call actual entanglement and minimum entanglement. Not in the physics sense — I just don't have a better word for now.3
Actual entanglement
When a specific AI is solving a specific task in a specific way, which information about the environment does the AI receive during the process? I call this actual entanglement.
For example, suppose the task is that you want AI to help you play chess really well. One way to do it would be to have a humanoid AI-powered robot that plays the game on your behalf. Then the actual entanglement is that the robot sees the video recording of the whole game and gets sensory inputs from its body. It might also matter that the robot can walk around and choose what it looks at and interacts with.4
Now, very often you might be able to solve the exact same task with much less entanglement.
In this case, maybe instead of a robot that can move around, you just have a stationary robot with a fixed camera. Probably it's enough if the camera is black and white. But you could go even further. You don't actually need a robot at all; the AI could be software-only, and you can manually send it what moves your opponent makes — pawn to e3 — and have it tell you what moves it recommends. Now the actual entanglement is just the history of moves in the game, with nothing about how the room looks.
And you can go a bit further still by resetting the AI after every turn, so instead of the whole history of play, it only sees the current board state.
Minimum entanglement
If you try hard enough, you could probably squeeze the information even further. For example, some board positions are equivalent up to symmetry, and you could present the position in a canonical form, hiding which rotation or reflection is the real one.
But at some point, the tricks run out. There will be some smallest amount of information you have to give the AI, below which it just can't help you. "If you want me to help you and tell you how you should play, you have to tell me something about what is happening in the game."
There won't be a unique format for giving this information. And maybe some bits of information are interchangeable (e.g., "you need to give the AI at least two out of these four pieces of information, but it doesn't matter which two"). But the point (or perhaps conjecture?) is that there is some reasonably well-defined boundary below which no clever trick will take you. That is what I call minimum entanglement: the minimum amount of information you need to give the solver if you want the task solved.5
Properties
Now let's go through some of the basic properties and observations regarding how this concept behaves.
Entanglement depends on the performance level
One natural way to think about minimum entanglement is as a function of the task: "for task X, the minimum entanglement is Y." But this only works if the definition of the task already includes the performance level — for example, "play chess optimally" or "win against this particular opponent with probability 100%."
If you instead think of the task as "play chess" without specifying an exact performance level, things get more interesting. Let me illustrate on a different example. Suppose the task is calculating the expenses for a particular project, and you have the information on how much each component costs.
To get the exact answer, you must give the AI the exact amounts for all individual expenses — there's no way around that. But if you're willing to tolerate some error, the minimum entanglement might be lower. Maybe you're OK with an error of $1,000, in which case it might be fine to round all expenses to the nearest hundred.
So the type signature is really: for each task and each required performance level, there is some minimum entanglement.
Bringing this back to chess, perhaps you could sometimes get away with giving the AI slightly incorrect information about the board state — telling it a pawn is in a different place. But if you do this too much, the recommendations stop working. That's why the performance level matters.
You can't cheat by solving everything at once
One tempting way to reduce entanglement further: instead of asking the AI how to play in your particular game, just ask it to produce a lookup table — "the optimal move for every possible board state." Now you don't need to tell the AI anything about your actual game; you just look it up.
The reason this is cheating: you've asked the AI to solve a much harder task. Instead of one particular game, you've asked it to solve all possible games, which would be vastly more expensive.
I'm not sure about the best way to handle this conceptually, but maybe the right move is similar to what we did with performance: you look for the minimum entanglement needed to solve the task without exceeding a specific computation budget.6
You can't cheat by reshuffling the boundary
A related trick: instead of asking the AI to tell you how to play, ask it to give you an algorithm that tells you how to play. The algorithm then does the actual work. If you did this, you've reduced the entanglement between the original AI and the environment — perhaps without increasing the computation, since the algorithm might be efficient.
But this is also cheating. Yes, the entanglement between the original AI and the task is now small, but all the entanglement has been reshuffled. It now sits between the algorithm and the environment. You haven't decreased entanglement; you've just drawn the boundary between "AI" and "environment" in a more convenient place.
This matters practically. Suppose the reason you were worried about entanglement is that an AI with too much information about the environment might try to mess with you. You haven't solved the problem — the AI may have produced an algorithm that will try to mess with you on its behalf, and that algorithm has exactly the same amount of entanglement.7
Minimum entanglement is about a class of tasks
There is one more way you could try to cheat, and it's closely analogous to how you might try to cheat in computational complexity.
In computational complexity, we might try to ask: "How hard is it to sort this particular array?" But if you pose the question exactly like this, the answer is trivial — you can always "solve" any specific task in constant time by using the algorithm which takes empty input and outputs the correct answer for that specific task. That's why we don't ask about the complexity of specific instances but about the complexity of a class of problems.
The same thing matters here. If you fixed the exact task ahead of time, you could design a solver with zero entanglement — one that simply outputs the pre-determined correct answer without looking at the environment at all. It only makes sense to ask about minimum entanglement if you have uncertainty about which task you're facing — which instance from a class of tasks.89
Graceful vs. sharp degradation
With these properties in hand, the next question is: how does this concept behave for different types of tasks?
I conjecture that there are tasks where performance degrades gracefully as you give the solver less information, and tasks where it degrades sharply. Some examples:
Graceful degradation. Take the problem of vacation recommendations. Suppose I'm going to a foreign country and want suggestions for which places to visit. Specifying the "full task" would mean giving the AI all the details about my trip and about me (what I like, what I dislike, my budget, my fitness level, etc.). But for practical purposes, it's enough to say "I'm going to this country in tourist season, and I like hiking. Give me some reasonable tips." It might not give me the perfect tips, but they'll be good enough. Minimum entanglement is low.
Sharp degradation. On the other hand, there are tasks where any single error in the information matters. If you have a self-driving car, giving it wrong information about the road could be catastrophic. If you want to get your taxes done, omitting some invoices or changing some numbers just gives you an incorrect result — there's no "approximately correct" version of a tax return.10
This distinction matters because it determines how sensitive a task is to getting some of the details wrong or omitting them. With graceful degradation, I can get away with only giving the AI the most important information, and telling it to make up the rest — this won't be suspicious, since the details don't really matter that much. And as a result, creating tests for these tasks can be cheap. But if we tried the same thing with a task where performance degrades sharply, this would be suspicious. ("Hey, human, what's going on here? It looks like you don't really care about getting the right answer! Is this perhaps a test?")
The trend: actual entanglement is increasing
One empirical observation worth noting: the actual entanglement of frontier AI systems has been steadily increasing. Longer context windows, tool use and web access, persistent memory, integration with real-world systems (code repositories, databases, email, calendars), richer and more open-ended tasks — each of these increases how much environmental information the AI can and does access during its work.
This matters because higher actual entanglement means more information that a test environment would need to replicate convincingly. As AI systems become more deeply embedded in real-world workflows, the gap between "what the AI sees in deployment" and "what an evaluator can afford to fake" is likely to grow.
The tasks that matter most
Ultimately, the applications that matter most to me are situations where a powerful AI might have a genuine opportunity to take over, or to significantly increase its future influence. Examples:
- Helping with AI development at a frontier AI company
- Working as a personal assistant for many people simultaneously, with the ability to coordinate between instances
- Assisting the government or military in high-stakes situations
- Helping run very influential companies
How exactly minimum entanglement scales in these cases is an open question. One approach would be to think directly about how (or whether) we could evaluate these specific scenarios. But this post is trying to gesture at a different approach: try to understand the concept of minimum entanglement in general — how it works for all kinds of tasks, what its mathematical properties are, how it behaves empirically for simpler tasks — and then use that understanding to answer the question for the tasks that actually matter.
Open questions
Several things seem worth pursuing:
Formalization. What is the right mathematical definition of minimum entanglement? Mutual information between the AI's outputs and the environment state is one candidate, but I'm not confident it's the right one. It might also make sense to look at formalizations from related fields — query complexity (how many queries to an oracle are needed to compute a function?), or distortion in social choice theory (how much does solution quality degrade when the solver gets less information about preferences?).11
Empirical measurement. For some of the concrete tasks we are trying to solve with AI, and some of the tasks we are trying to test for: how much entanglement do they require? How much effort went into creating various test environments? And when we compare minimum entanglement on one hand with how much testing effort was actually used on the other — is this predictive of how successful the tests were? Is it predictive of when AI systems seem to have or not have evaluation awareness?
Dollars as a proxy. Perhaps the "true name" of minimum entanglement involves mutual information. But it might also make sense to just measure some of this in dollars or person-hours. How long do I need to spend instructing my LLM if I want it to do a task for me? How much effort goes into crafting instructions and environments for AIs? And conversely, how much money went into creating datasets, test environments, and red teaming? Is the ratio between "effort spent on real use" and "effort spent on testing" predictive of tests being informative — for example, of AI not having evaluation awareness? This is less theoretically satisfying than a formal definition, but it might be more immediately actionable.
Footnotes
-
the previous post develops the testing implications in detail — both the concrete argument and potential countermeasures. This post focuses on the underlying concept. ↩
-
One might object: if nobody has all of the relevant information, wouldn't the AI also have a hard time telling a fake scenario from a real one? Probably not — much like it is much easier to spot plot holes in stories than it is to write stories without plot holes. You don't need to know all the details to notice that the details you do have are inconsistent with each other. ↩
-
I don't think that any of the existing concepts — e.g., in information theory, computational complexity, or some other field — give a perfect operationalization of the concept we are looking for here. However, I might easily be wrong about this; if so, please let me know! ↩
-
I'm not clear on how much this active-sensing part matters or how to formalize it, but it seems like it might be relevant. An AI that can choose what to look at might need to be treated differently from one that passively receives information. ↩
-
Reading this, you might be thinking about Shannon entropy and mutual information, and these may well be relevant concepts. But at this point I'm not clear on what the best operationalization is — whether it should be a single number, some more multi-dimensional thing, or something else entirely. I want to avoid getting anchored by treating, say, mutual information as if it's definitely the right notion here. ↩
-
This isn't a perfect solution, but it's pointing in a hopefully reasonable direction. The general issue is that you can always trade entanglement for computation, and we need to hold one fixed while varying the other. ↩
-
This connects to a broader issue in AI safety. When people talk about "decomposing tasks into smaller pieces" as a safety measure, the question is whether the decomposition actually reduces the entanglement of each piece, or just reshuffles it. If the pieces still need to be assembled into a coherent whole, the entanglement may not have gone anywhere. ↩
-
By uncertainty, I don't necessarily mean a probability distribution. It might be Knightian uncertainty, where you just know the task is one of several possibilities but don't have a prior over them. ↩
-
This parallel to computational complexity is more than just an analogy. In both cases, the concept becomes trivial for individual instances and only has bite when applied to classes. And in both cases, the "adversary" (who picks the specific instance) is what gives the concept its force. This suggests that the right formalization might borrow tools from complexity theory. ↩
-
I'm not fully confident in these particular examples — the self-driving car might handle some wrong inputs gracefully, and "approximately correct" taxes might be good enough for some purposes. What I'm pointing at is the distinction between these two modes of degradation, which I think is real even if my specific examples aren't perfect. ↩