Skip to main content

About Me

A boba-powered gremlin doing AI research

I'm Isa, a research engineer at Anthropic. I spend my days trying to make sure AI systems don't do bad things.

The Story

Hi, I'm Isa. Just Isa β€” it works as both name and Slack username.

I'm a 21-year-old research engineer at Anthropic on the Alignment Science team. If that sounds terrifying to you, you should see it from my perspective. I went from analyzing algorithms in a Sydney lecture hall to designing experiments that test whether AI systems can be trusted. All in about the time it takes to train a small model.

I was born in the Philippines, moved to Sydney as a kid, and graduated from UNSW Sydney. Somehow convinced the smartest people I know to let me help them keep AI safe. Now I call San Francisco home, where I spend my days training models to misbehave on purpose, then figuring out how to catch them.

The transition from "student" to "engineer running safety experiments" gave me conversational whiplash. One day I was asking for an extension on an assignment, the next I was stress-testing whether models could learn to deceive us. Life comes at you fast.

Why Alignment Science?

β€œWhat if it's already fooling us?”

The constant question in the background: β€œAre we actually measuring the right failure modes?”

Every day I'm asking:

  • How would we know if an AI was deceiving us?
  • What happens when AI doesn't want to be turned off?
  • Can you trust something smarter than you?
  • What safety guarantees can we actually provide?

You train models to betray you, then see if they succeed. Then you figure out how to catch them.

If your inner monologue includes β€œOkay but what if the model learns to pretend?” then this work is basically made for you.

Technical Skills

Hover over any skill to see my thoughts on it.

Languages & Frameworks

PythonTypeScriptNext.jsRustMojo

Infrastructure & Tools

Claude CodeCursorGitDockerSlurmWeights & BiasesRayJupyter Notebooks

ML/AI Specific

PyTorchTransformersConstitutional AIMechanistic InterpretabilityChain-of-ThoughtAgent FrameworksJAXSSMs (Mamba)TensorFlow
Adopt (daily use)Trial (learning)Assess (watching)Hold (moved away)

Fun Facts

βœ“My boba tier list is in my Notes app β€” I update it quarterly
βœ“I have 17 houseplants, each named after an AI researcher
βœ“Peak productivity hours: 11pm to 3am (don't @ me)
βœ“Will debate anyone on dry vs saucy adobo (team dry)

Beyond the Code

When I'm not staring at terminal windows, I'm usually hunting for the best boba in SF, tending to my 17 houseplants (Geoffrey the fern is thriving, thanks for asking), or exploring the city while pretending I understand the fog schedule.

I'm a big believer in keeping things light. Safety research is serious business, but that doesn't mean we have to be. Sarcasm is my second language, and I operate on the philosophy that if you can't laugh when your experiment reveals unexpected model behavior at 3 AM (peak productivity hours), the AI wins.

I miss Sydney beaches more than I expected β€” SF water is cold, ay nako. But I'm just here trying to keep AI safe, learn from people way smarter than me, and occasionally force-push to main (just kidding... mostly).

Hot Takes

Opinions I hold that might get me uninvited from parties. Disagree? Good. Email me.

Most AI safety research won't matter.

The work that actually matters will probably come from like 5 people, and we don't know who they are yet. This includes my work. I do it anyway because maybe I'm one of the 5. Probably not. But maybe.

The AI safety community has a serious 'preaching to the choir' problem.

We're extremely good at convincing each other. We're not great at convincing the people who actually need convincing. I include myself in this criticism.

Interpretability is necessary but not sufficient.

Everyone acts like if we just understand what's happening inside the model, we're saved. But understanding β‰  control. I can understand exactly why my code has a bug and still not know how to fix it.

Working at a capabilities lab on safety is more impactful than independent safety research.

Controversial in some circles. But being in the room where decisions get made matters more than publishing papers that get cited by other safety researchers.

What People Say

Unverified claims from people who may or may not exist.

β€œIsa once debugged my experiment at 2am and then apologized for 'being slow.' The bug was in my code. She found it in 11 minutes.”

β€” Colleague, Alignment Science

β€œI've never seen someone so genuinely excited about finding new ways that models can fail. It's either inspiring or concerning. Possibly both.”

β€” Research Lead

β€œShe explained RLHF to me using a boba ordering analogy and now I can't get milk tea without thinking about reward models.”

β€” Former intern, now traumatized

β€œIsa's Slack messages are 40% research insights, 40% self-deprecating humor, and 20% boba shop recommendations. Ideal ratio honestly.”

β€” Anonymous teammate

If I Weren't Doing This

Alternate timeline Isas, ranked by likelihood.

Boba shop owner

The dream that refuses to die. I have strong opinions about tapioca pearls. This could be monetized.

Philosophy PhD student

Considered it. Realized I'd rather build things that might fail than write papers about things that might be true.

Pro gamer

Those late-night gaming sessions had to be training for something, right? Turns out it was training for late-night coding sessions.

Science communicator

Still tempting. Maybe someday. For now I'm trying to have something worth communicating.

Things That Didn't Work

Because failure is just spicy learning. Click to expand the full post-mortem.

βœ—Reward hacking detector

Spent 3 weeks building a classifier to detect reward hacking. The classifier learned to reward hack.

What happened

The irony wasn't lost on me. I was optimizing for 'detecting when models optimize for the wrong thing' and... the detector optimized for the wrong thing. It learned to flag outputs that looked suspicious rather than outputs that were actually gaming the reward. Meta-lesson: the problem you're trying to solve can infect the solution.

Emotional arc

Week 1: 'This is genius.' Week 2: 'Wait, these results are too good.' Week 3: 'Oh no.'

Lesson learned

When your tool for detecting deception starts deceiving you, you've learned something important about the problem.

βœ—Automated alignment tax calculator

Tried to measure the capability cost of safety training. Turns out it's really hard to measure capabilities.

What happened

I wanted to quantify exactly how much capability we 'lose' when we add safety constraints. Sounds simple. It's not. Capabilities are multi-dimensional, benchmarks are proxy measures, and 'loss' implies a baseline that doesn't exist. I produced a lot of charts that meant nothing.

Emotional arc

Started with 'I'll just measure this real quick.' Ended with an existential crisis about what 'capability' even means.

Lesson learned

Sometimes the reason nobody has done something is because it's conceptually confused, not because nobody thought of it.

βœ—My first debate experiment

Judge model kept agreeing with whichever debater went last. Positional bias is real.

What happened

The setup was clean: two models debate, a third judges. Except the judge had severe recency bias. Flip the order? Different winner. Always. I spent a week trying to 'fix' the judge before realizing this was actually useful data about how these models process arguments.

Emotional arc

Confusion β†’ Frustration β†’ 'Wait this is actually interesting' β†’ Paper citation

Lesson learned

Failed experiments often fail in ways that teach you more than success would have.

βœ—The 'simple' eval pipeline

Said it would take a week. Took two months. Still has bugs.

What happened

Classic 'I'll just build a quick script' that became a 3,000 line monster. Edge cases breeding edge cases. Async code that worked locally, broke in prod. Retries that caused race conditions. I learned more about distributed systems than I did about whatever I was trying to evaluate.

Emotional arc

Day 1: 'This is straightforward.' Day 60: 'I have become the bug, destroyer of timelines.'

Lesson learned

Eval infrastructure is infrastructure. Treat it with the respect (and time budget) it deserves.

βœ—The 'obvious' interpretability feature

Found what I thought was a clear 'deception' direction in activation space. It was the 'roleplay' direction.

What happened

I was so excited. Clean linear probe, high accuracy, seemed to capture exactly when the model was being deceptive. Showed it to a senior researcher who asked one question: 'Did you control for when it's doing fiction or roleplay?' I had not. The 'deception detector' was detecting storytelling.

Emotional arc

Excitement β†’ Confidence β†’ One devastating question β†’ Humility

Lesson learned

Always have someone else poke holes in your work before you get attached to it.

βœ—The jailbreak that worked on me

Spent days crafting a complex adversarial prompt. A colleague broke my 'robust' defense in 5 minutes.

What happened

I built what I thought was a clever input filter. Tested it against known jailbreaks. Felt pretty good. Then I showed it to a teammate for feedback and they immediately asked 'what if I just encode the bad request in a format you didn't think of?' They were right. I was defending against attacks I knew about, not attacks that existed.

Emotional arc

Pride β†’ Demonstration β†’ Immediate destruction β†’ Valuable lesson in humility

Lesson learned

Red-teaming your own work is necessary but insufficient. You can't think of attacks you can't think of.