Understanding the Limitations of AI Oversight

We rely on observation-based methods — pre-deployment testing, runtime monitoring, output filters, (some) interpretability tools, and many others — to ensure that AI systems are safe and behave as intended. These methods are central to how we develop, deploy, and control AI today. But how well do they actually work? And will they continue to work as AI systems become more capable?

We have two complementary goals:

Science of AI Oversight. We currently approach the evaluation and supervision of AI as an empirical practice — doing things by trial and error, which mostly seems to work; so far. However, because of AI's importance and fast rate of progress, we need to understand why observation-based oversight works, and be able to predict when it fails. This understanding will allow us to build better oversight methods, and to do so faster and cheaper. Our first goal is to contribute to this emerging "science of AI oversight". We are especially interested in how oversight deals with the possibility that some AIs react strategically to it.

Highlighting Fundamental Limitations. Observation-based oversight is often very useful, and should remain in our toolkit. At the same time, we expect it to be ineffective against some threat models; namely, those related to strategic AI. In particular, we conjecture that most forms of observation-based oversight will be fundamentally unable to provide safety assurances against takeover attempts by powerful AI. Crucially, this conjecture doesn't imply that AI takeover is unavoidable — rather, we would view it as evidence that we need to invest more effort into some of the existing alternatives (alignment, guardrails, alternative paradigms aiming for safety by design). Our second goal is to make this conjecture more precise, investigate it, and — if appropriate — advocate that we should be willing to pay a higher "alignment tax".

Get involved

We are a small team of researchers funded by Effective Ventures, based in Prague and affiliated with Czech Technical University and Charles University. We are open to collaborations and, in case of a particularly good fit, hiring PhD students or postdocs.

If this research sounds relevant to your work, we'd like to hear from you — reach out to Vojta at his gmail address.

Our research — papers, blog posts, and some ideas in progress
About us — the team, the project, and how to get involved

Understanding the Limitations of AI Oversight#

We have two complementary goals:#

Get involved#

Understanding the Limitations of AI Oversight

We have two complementary goals:

Get involved