Research
Multi-Agent RL Experiments
RLMulti-agentDebateScalable Oversight
Overview
I run experiments on multi-agent systems for AI safety, particularly approaches like AI Debate where models critique each other's reasoning to catch errors.
My Role
I implement and run multi-agent RL experiments, analyze the dynamics of model interactions, and measure whether adversarial setups improve truthfulness.
my actual thoughts (the honest version)
the idea behind AI debate is beautiful: what if we could use AI against itself?
instead of hoping a model tells the truth, we pit two models against each other. one argues for an answer, one argues against. a judge picks the winner. in theory, truth should win.
in practice... it's complicated.
i run experiments varying:
- how we train the debaters
- what the judge can see
- how we structure the debate format
- what kinds of questions work best
some findings are encouraging. some are concerning. it turns out models can get pretty good at persuasion, even when they're wrong.
the dynamics are fascinating. watching two models argue about a topic where one is right and one is wrong, and seeing the wrong one sometimes win, tells you a lot about how we think about truth and persuasion.
this is long-term research. we're not deploying debate tomorrow. but understanding these dynamics now is how we'll build better oversight systems later.
What I Learned
- ā¢Multi-agent dynamics are hard to predict and control
- ā¢Persuasiveness and correctness are not the same thing
- ā¢Judge quality is crucial. Garbage judges enable garbage debates
- ā¢These techniques show promise but need much more work
Note: Part of Anthropic's broader research agenda on scalable oversight.
š Specific technical details are confidential. This page shares what I can within NDA constraints.