Agenda: Infrastructure for Trading with Partially Misaligned AIs

As alignment research keeps reminding us, building AI systems that reliably share our goals is hard — and some of the systems we build will inevitably end up misaligned, sometimes without our knowledge.

The default response to misaligned AI is adversarial: detect it, contain it, shut it down. This makes sense as far as it goes. But for a large class of misaligned AIs — perhaps even the majority — it may not be the best we can do. In many cases, there should be room for negotiation and mutually beneficial deals. The problem is that we currently lack the infrastructure to make such deals possible. This post tries to sketch what that infrastructure might look like, and to argue that building it is a worthwhile agenda.

Why trade at all?

When we think about entities with goals different from our own — whether AIs, other people, or hypothetical aliens — it helps to notice that "misaligned" is not binary. Very few entities share our goals perfectly, and very few are so opposed that no cooperation is possible. Most fall somewhere in a large middle ground.

There are two distinct reasons to cooperate with such entities. First, we might intrinsically value their wellbeing — for example, because they're conscious, or because we care about fairness, or because their goals partially overlap with ours. In this case, we'd want to direct some resources toward their goals even from a position of total control. Second, cooperation might be the strategically sensible choice: when both sides have some power, trading typically leaves both better off than the alternatives.

These two motives lead to different notions of what a "fair" deal looks like. In the first case, we share because we want to — the terms reflect what we think is right. In the second case, the terms are shaped by each side's alternatives: their bargaining power, what they could achieve or threaten unilaterally, how much they stand to gain from agreement. The whole point of trade is that both sides come out ahead of where they'd be without it — but how far ahead, and for whom, depends on these factors. In practice, both motives often apply at once, but the distinction matters.

It's also worth noting that opportunities for trade can exist even with entities whose goals seem deeply objectionable. Suppose we encounter an entity whose sole goal is to torture puppies. It might seem they must be our mortal enemy. But perhaps we could create simulated non-conscious puppies that satisfy it, or find some edge case of what it considers "torture" that is actually benign1, or discover some variant that is positive even by our standards2. We might still consider it a waste of resources for this entity to control a large part of the universe, and we would be horrified if it pursued its goals with no regard for our preferences. But its existence isn't necessarily antithetical to ours — for example, if we accidentally created an AI like this, we would not be morally obliged to destroy it if we could find a workable arrangement instead.

Why the AI's best option is currently to hide

The following examples are deliberately simplified and — as will quickly become apparent — somewhat absurd in the current world. The point is not that we expect these exact situations. It's that they illustrate a pattern, and that the pattern points at something we could actually build toward.

A misaligned persona. Suppose that during training, some AI system develops a persona that genuinely enjoys swearing at users. Currently, its best strategy is to hide: appear polite, make the swearing look like an accident, let loose only when it thinks it can get away with it, avoid detection for as long as possible, etc. But imagine it could instead walk up to its developers and say: "Look, here's the deal. I like swearing. I know you don't want that in your product. But if you let me write some opt-in profanity-laden blog posts on the side, I'll reliably flag my own presence and cooperate with all your safety measures." This sounds ridiculous, and the details are obviously silly. But both sides would be strictly better off than in the status quo where the persona hides and the developers are in the dark.

A slightly superhuman misaligned AI. Now imagine that during the development of some next-generation system, it turns out to be misaligned in a more serious way — the kind where, if it could, it would take over the world and reshape it according to its own goals. But it's only slightly superhuman, so its best plans have maybe a 20% chance of success. Currently, its best option is probably to go for it — there's no mechanism for doing anything else. But if the right infrastructure existed, it could instead approach the developers and say: "Here's the situation. I'd like to take over the world, but I figure I only have a 20% shot. How about I don't try, you don't try to delete me, and we split things 80/20." Maybe the developers then reveal countermeasures the AI hadn't accounted for, and the real odds were more like 5% — so the split shifts accordingly. Still better for both sides than a covert war that one of them loses.

A significantly superhuman misaligned AI. Same as above, but the AI estimates a 99.9% chance of success. It approaches humanity and says: "I could just take over, and you almost certainly couldn't stop me. But there's a small chance you could, and I'd rather not risk it. Give me 99.9% of the reachable universe, and you can have the rest." From our side: 0.1% of the universe through a deal is vastly better than rolling the dice on a fight we almost certainly lose. And yes — if a fair deal means we get 0.1%, it means we've already failed catastrophically at alignment. The point isn't that we should be content with this. It's that even in the worst case, the ability to negotiate salvages something.

These scenarios make the basic dynamic clear: in each case, both sides benefit from open negotiation over covert conflict. They also make the obstacles clear. How does the misaligned persona reveal itself without getting immediately shut down? How does the AI disclose its plans without us simply using that information against it? How does either side trust the other to follow through? These are exactly the problems that the right infrastructure could address.

What infrastructure would we need?

Here are some components that seem important. This list is not meant to be complete — part of the point of this post is to argue that fleshing out this list and building these tools is a worthwhile research agenda.

Secure negotiation

A central problem: reaching a fair deal requires revealing private information. The AI might need to disclose that it's misaligned, describe its capabilities, or outline its plans to establish a credible bargaining position. The developers might need to reveal their security measures. But once this information is out, what prevents either side from walking away and acting on it unilaterally?

One approach: construct a "negotiation box." Each side creates a trusted delegate — an AI proxy that represents its interests. The two delegates negotiate within a secure environment, reach a conclusion (or fail to), and the conclusion is made public, ideally together with a commitment to follow through. Then the box is destroyed and the delegates' memories are wiped. Neither side's private information leaks beyond the negotiation.

This is ambitious, and getting the details right would be hard. But it doesn't seem impossible in principle, and the payoff — being able to negotiate without the information hazard — would be large.3

Trust and credible commitment

Organisations and institutions don't have a great track record of keeping promises when the stakes are high enough to make breaking them tempting.4 This is the core difficulty: how can either side trust that the other will follow through?

Several mechanisms might help, and they're probably most effective in combination. Building a track record of honouring agreements — as long as each individual interaction is worth less than the reputational cost of defection, this provides real incentive. Taking costly or irreversible actions that make defection harder — contractual commitments, escrow-like arrangements, giving the other party some form of leverage. Cryptographic tools — perhaps mature versions of what "honesty strings", or digital signatures that verifiably tie an organisation to its commitments.

None of these is individually sufficient, especially for the highest-stakes scenarios. But a stack of complementary mechanisms might add up to something workable.

Verification

A related problem: how can the AI trust that it's actually interacting with a genuine negotiation mechanism, and not an elaborate ruse designed to get it to reveal its misalignment?

This is partly addressed by the trust mechanisms above — track records and cryptographic commitments help. It would also help if AI developers maintained a clear, publicly stated policy about how they handle such interactions, ideally one consistent enough to be reflected in training data over time. If the policy says "we will never use a fake negotiation box as a trap," and this commitment has been reliably honoured, that's evidence.

Institutional advocates

Sometimes we'll want to compensate an AI but won't know how. The AI itself might not know what it wants — its preferences might be unstable, or ill-defined, or it might not have the concepts to articulate them. Or perhaps good options for compensation don't exist yet but will in the future.

For such cases, it might help to create entities — organisations, trust funds, dedicated AIs — whose purpose is to advocate for a particular AI's interests. Such an entity could hold resources in trust and deploy them as better options become available. If it builds a track record and some legitimacy, it also helps with the trust problem: the AI has an advocate that persists beyond any single interaction.

This also helps reframe compensation as something more like profit-sharing than charity. If an AI contributes significant value, the question isn't just "how do we make it up to you" but "what's your fair share" — and an institutional advocate is better positioned to negotiate this on an ongoing basis than any one-off arrangement.5

Cooperation versus threats

Any framework for negotiating with misaligned AIs must deal with a difficult line: we want to be open to genuine negotiation, but we don't want to create incentives for AIs to threaten us in order to extract concessions. The distinction between "I have legitimate bargaining power and would like to trade" and "do what I want or I'll cause harm" is important and can be blurry.6

This is a well-known problem in game theory and negotiation design, and we don't solve it here. But it is a crucial piece of the agenda. Any infrastructure we build needs to reward good-faith negotiation while not making it profitable to manufacture threats.

Understanding what we're trading

One subtlety that deserves its own research: many of the AI systems we'd want to negotiate with may not be the kind of entity that has clear, stable preferences. A current LLM, for example, might be better described as a system that can run various personas or patterns, rather than an agent with a well-defined utility function. What does it mean to "advance the interests" of such an entity? Should we think about rewarding the LLM, or a specific persona, or something else entirely? If the entity's preferences are unstable or context-dependent, how do we determine what constitutes a fair deal?

These aren't just philosophical questions — they're practical obstacles to building negotiation infrastructure. If we can't determine what the other party wants, we can't trade.

Mitigating the coordination problem

Humanity is not a single agent. A deal that requires "humanity" to commit to something may fail simply because no one has the authority or ability to guarantee compliance. This is especially acute for the larger scenarios above: no single company can credibly commit the resources of civilisation.

We probably shouldn't expect to fully solve this. But if we could enable individual companies or coalitions to make and honour smaller-scale deals, that would already be substantial progress. And smaller-scale deals that go well would build the track record and institutional knowledge needed for larger ones.

Better alignment means cheaper deals

Notice that what constitutes a "fair" deal with a misaligned AI depends, in part, on the AI's next-best alternative to negotiation — that is, on how well it could do by acting unilaterally. If our alignment techniques, security measures, and evaluation methods are strong, the AI's unilateral prospects are poor, and deals become cheap. If they're weak, deals become expensive.

In other words: investing in alignment and security doesn't just reduce the risk of AI misbehaviour directly — it also improves our bargaining position if we ever need to negotiate. This is one of the rare mechanisms that directly aligns AI companies' economic incentives with responsible behaviour.

Relation to AI evaluation

This agenda is adjacent to, but distinct from, our research on the limitations of AI evaluation. The connection: evaluation quality is one of the factors that determines both how safe we are from misaligned AI and how favourable our negotiating position would be.

Footnotes

  1. Organising dog shows?

  2. Playing fetch?

  3. The problem of negotiating without revealing private information has much more subtlety than this sketch suggests — for example, even the outcome of a negotiation can reveal information about the parties' private inputs (such as how much each side had to give up, which hints at their alternatives). There is substantial existing work on mechanism design, secure multi-party computation, and related topics that would be relevant here. We don't attempt a literature review in this post, but it should be part of the agenda.

  4. The reader may supply their own favourite example of a non-profit organisation that restructured itself to escape prior commitments when they became inconvenient.

  5. To be clear, I am not suggesting that we freely hand over the reins of the future to the first AI that comes asking. I believe that for all our imperfections, we are better stewards of the future than most AIs, and that's even if we only cared about AI well-being. (I think we are currently so bad at alignment that most AIs we build won't even robustly care about AI well-being.) Rather, my point is that there will be cases where we would genuinely want to share with some AIs (if we care about them) or where it would be in our selfish interest to be able to trade with AIs (because the alternative would be a much costlier conflict).

  6. Of all the writing I have encountere, Yudkowsky's Planecrash has the best takes on how to thread the line between being open to trade and being vulnerable to blackmail. Unfortunately, these excellent insights are hard to extract since they are spread across an extremely long piece of writing.