Research
Safeguards & Adversarial Robustness
AdversarialJailbreaksRobustnessEvaluation
Overview
I develop and test defenses against adversarial attacks on language models, from novel jailbreak techniques to more sophisticated manipulation attempts.
My Role
I build tooling to automatically generate and test jailbreaks, analyze patterns in successful attacks, and help design more robust safeguards.
my actual thoughts (the honest version)
jailbreaking is a cat and mouse game, and we're trying to be a very smart cat.
every day, people find creative ways to bypass model safeguards. my job is to find those ways before they do, and build defenses that actually work.
the tooling I've built can generate thousands of novel attack variations automatically. we test them against our models, see what works, and then figure out why.
some patterns:
- attacks that work on one model often work on others
- defenses that seem bulletproof break in weird edge cases
- the attack surface is way larger than you'd think
the hardest part isn't finding jailbreaks. it's building defenses that don't break normal use cases. you can make a model super safe by making it refuse everything, but then it's useless.
the goal is surgical precision: catch the bad stuff, allow the good stuff, have clear boundaries.
What I Learned
- ā¢Adversarial robustness is a distribution problem, not a point fix
- ā¢Automated red-teaming scales way better than manual testing
- ā¢Every defense creates a new attack surface
- ā¢User experience and safety are in tension. Navigating that is the job
Note: This work directly improves Claude's production safety systems.
š Specific technical details are confidential. This page shares what I can within NDA constraints.