Tools

Safety Evaluation Tooling

ToolingEvaluationPythonLLM

Overview

Good safety research needs good evaluation infrastructure. I built tools for generating evaluation datasets, running safety assessments, and tracking model behavior over time.

My Role

I designed and implemented evaluation pipelines, from automatic question generation to result analysis and visualization.

my actual thoughts (the honest version)

evaluation is the foundation of everything we do. how do you know if a model is safe? you test it. but testing at scale requires infrastructure. you can't manually check thousands of model outputs. so i built tools that: - generate evaluation questions automatically (using LLMs to test LLMs, yes it's very meta) - run evaluations across model versions - detect regressions in safety behavior - visualize where models are failing the meta-ness is real. we use claude to generate tests for claude. the models help us check themselves, in a structured way. the hardest part is writing evaluations that actually measure what you care about. it's easy to write tests that pass. it's hard to write tests that would catch the failures you're worried about. i now have opinions about evaluation design that would bore anyone who doesn't do this work.

What I Learned

•Good evals are harder to write than good code
•Automated evaluation scales; manual evaluation doesn't
•LLMs can help generate evaluations, but humans need to verify
•Regression testing for safety is as important as for features

Note: Used across Anthropic for pre-deployment safety assessments.

🔒 Specific technical details are confidential. This page shares what I can within NDA constraints.