Skip to main content
Research

AI Control Experiments

ControlSafetyExperimentsAgentic

Overview

As AI systems gain more agency (tool use, code execution, multi-step tasks), we need new approaches to maintain control. I run experiments on ensuring AI systems remain safe even in adversarial or unfamiliar scenarios.

My Role

I design experiments where AI agents operate with real capabilities and test whether our control mechanisms hold up under pressure.

my actual thoughts (the honest version)

agentic AI is where things get really interesting (and scary). when claude can browse the web, write code, and take actions in the world, the safety considerations multiply. you can't just worry about what it says. you have to worry about what it does. i run experiments where models have actual capabilities: - can they be tricked into taking harmful actions? - do they correctly refuse requests that seem okay but aren't? - what happens when they're operating without human oversight? we create scenarios that are deliberately adversarial. someone is trying to manipulate the model. information is misleading. the "right" action isn't obvious. the goal is to build systems that fail safely. when something goes wrong (and it will), the damage should be limited and recoverable. it's like testing car safety by crashing cars. except the car is an AI and we're inventing new types of crashes.

What I Learned

  • •Agency multiplies both capability and risk
  • •Control is easier to maintain than to regain
  • •Sandboxing and monitoring are essential, not optional
  • •We need to assume adversarial conditions even in normal use

Note: This work informs safety evaluations for Claude's tool use and computer use features.

šŸ”’ Specific technical details are confidential. This page shares what I can within NDA constraints.