Training Models to Misbehave

i've been trying to explain to people what i do at work, and the best way i've found is: i train models to misbehave on purpose.

what alignment science actually is

when you talk to claude, you're talking to something that's been carefully trained to be helpful, harmless, and honest. but how do we know those training techniques actually work? someone has to:

design experiments to test safety interventions
build "model organisms" that exhibit specific failure modes
figure out if our defenses catch the bad behaviors
analyze why interventions succeed or fail
iterate on what we learn

that's alignment science. it's the part where we're empirically testing whether we can actually keep AI safe.

why i love it

most people in AI are building capabilities. they're making models smarter, faster, more capable.

but in alignment science, you're asking: "what if this goes wrong?" you're stress-testing the assumptions everyone else takes for granted. you're finding the failure modes before they matter.

it's terrifying. it's exhilarating. it's like being a crash test engineer for something that doesn't have a chassis.

the weird part of my job

here's the thing: to test safety, you have to create danger.

we train models to deceive us on purpose. we give them incentives to hide information. we create controlled versions of the exact behaviors we're trying to prevent.

then we test whether our safety techniques catch them.

some interventions work great: the model tries to deceive, we catch it, everyone learns something.

some interventions fail: the model finds a workaround we didn't anticipate. those are the scary-but-valuable moments.

the best experiments: are the ones where we discover something we didn't expect. a new failure mode. a blind spot in our thinking.

what a typical day looks like

honestly? a lot of designing experiments. a lot of analyzing model behavior. a lot of asking "wait, why did it do that?"

but also: a lot of really interesting conversations with brilliant people about what could go wrong and how we'd prevent it.

the responsibility

i think about this a lot. the techniques we develop now will be applied to much more capable systems later. the failure modes we discover in controlled experiments could be catastrophic in the wild.

it's a lot of responsibility for someone who still gets called a kid. but maybe that's okay. maybe it's good to have people who take that responsibility seriously.

anyway, back to training models to betray me. all in a day's work.