Projects

Things I've Built

A collection of research projects, engineering work, and tools I've contributed to. Most of this work is internal to Anthropic, but here's what I can share.

Research

Scalable Oversight Research

Developing techniques to keep highly capable models helpful and honest, even as they surpass human-level intelligence in various domains.

AI SafetyOversightResearch

Read my thoughts

Research

Alignment Stress-testing

Building model organisms of misalignment to improve our empirical understanding of how alignment failures might arise.

Model OrganismsRed-teamingSafety

Read my thoughts

Research

Safeguards & Adversarial Robustness

Developing robust defenses against adversarial attacks and comprehensive evaluation frameworks for model safety.

AdversarialJailbreaksRobustness

Read my thoughts

Research

AI Control Experiments

Creating methods to ensure advanced AI systems remain safe and harmless in unfamiliar or adversarial scenarios.

ControlSafetyExperiments

Read my thoughts

Tools

Safety Evaluation Tooling

Building tooling to efficiently evaluate model safety, generate evaluation questions, and test reasoning abilities in safety-relevant contexts.

ToolingEvaluationPython

Read my thoughts

Research

Multi-Agent RL Experiments

Running multi-agent reinforcement learning experiments to test scalable oversight techniques like AI Debate.

RLMulti-agentDebate

Read my thoughts

Most of my work is internal research at Anthropic and can't be shared publicly.
If you're interested in learning more about what we do, check out our published research.