Skip to main content
Research

Scalable Oversight Research

AI SafetyOversightResearchConstitutional AI

Overview

As AI systems become more capable, they'll increasingly operate in domains where humans can't easily verify their outputs. I'm working on techniques to maintain meaningful oversight even when models surpass human-level performance.

My Role

I design and run experiments testing different oversight approaches. This includes training models with various supervision schemes and measuring how well we can catch subtle errors or deceptive behaviors.

my actual thoughts (the honest version)

the core question keeps me up at night: how do you supervise something smarter than you? it sounds philosophical but it's actually a very practical engineering problem. right now we can check claude's work. but what happens when the model is doing research we can't understand? writing code that's too complex to review? we're exploring approaches like: - having models critique each other (AI debate) - using weaker models to catch problems in stronger ones - training models to be transparent about uncertainty the weird part is designing experiments where the model has incentives to hide things from us. we're basically teaching models to deceive us on purpose, just to see if we can catch them. it feels backwards, but you can't build defenses without understanding attacks.

What I Learned

  • •Scalable oversight isn't one technique, it's many approaches working together
  • •The hardest problems are the ones where you can't check the answer
  • •Model organisms of misalignment are crucial for empirical safety research
  • •We need to solve this before we need it, not after

Note: This research feeds directly into Anthropic's Responsible Scaling Policy assessments.

šŸ”’ Specific technical details are confidential. This page shares what I can within NDA constraints.