Interactive Learning

Alignment Playground

Learn how AI alignment works through interactive demos. No API calls, no costs - just educational examples showing real alignment challenges.

Based on techniques I work with at Anthropic. All examples are pre-written for education.

RLHF Preference Game

Pick the better response and learn how humans train AI models. Can you think like a human rater?

Identify when AI responses go wrong. Sycophancy, deception, harmful advice - can you catch them all?

Watch how principles transform harmful responses into helpful ones. Interactive critique and revision.

Explore famous jailbreak techniques and learn why they work (or don't). Adversarial robustness 101.

How human preferences shape AI behavior through reinforcement learning.

Common ways AI can go wrong: sycophancy, deception, harmful outputs.

Constitutional AI, red-teaming, and adversarial robustness.

Want to learn more about my work on these topics?