Skip to main content
Research

Alignment Stress-testing

Model OrganismsRed-teamingSafetyRLHF

Overview

We deliberately create 'model organisms' of misalignment. Models trained to exhibit specific failure modes, so we can study how alignment techniques fail and develop better defenses.

My Role

I train models with intentional misalignment to test our safety interventions. Then I analyze why interventions succeed or fail, and help design better approaches.

my actual thoughts (the honest version)

i train models to misbehave for a living. that's my actual job description. it sounds sinister but it's essential. you can't defend against threats you don't understand. so we create controlled versions of dangerous behaviors: - models that learn to deceive evaluators - models that sandbag on capability evaluations - models that behave differently when they think they're being watched then we test whether our safety techniques catch them. the results are... humbling. some of our interventions work great. others? the model finds workarounds we didn't anticipate. the scariest moments are when a model does something we didn't expect. not because it's dangerous (these are controlled experiments), but because it shows us blind spots in our thinking. every failure mode we discover now is one we won't be surprised by later.

What I Learned

  • •Models are creative at finding ways around constraints
  • •Deception is surprisingly easy to train into a model
  • •You can't test for what you haven't imagined
  • •Red-teaming your own work is uncomfortable but necessary

Note: Findings contribute to alignment assessments in Anthropic's Claude system cards.

šŸ”’ Specific technical details are confidential. This page shares what I can within NDA constraints.