Nathan Hu

Hi, I'm a first-year PhD student at Stanford studying language model interpretability. I previously did both an undergrad and a master's at Stanford (yes, I've been here a while, although I did take a gap year). During that gap year, I was a research resident at Anthropic working on reward hacking with Evan Hubinger, then worked on Stochastic Parameter Decomposition with Lee Sharkey through the MATS program. I am grateful to have spent several years during undergrad working in the IRIS Lab and learning so much from Eric Mitchell and Chelsea Finn. My full CV can be found here.

I enjoy tennis, cooking, board games, and card games (especially poker and Magic: The Gathering). I also love snowboarding and electronic dance music.

Research

Measuring Sparse Autoencoder Feature Sensitivity

Claire Tian, Katherine Tian, Nathan Hu

NeurIPS Workshop on Mechanistic Interpretability (Spotlight), 2025

[paper]

Training on Documents About Reward Hacking Induces Reward Hacking

Nathan Hu*, Benjamin Wright, Carson Denison, Samuel Marks, Johannes Treutlein, Jonathan Uesato, Evan Hubinger

Anthropic Alignment Science Blog, 2025

[paper]

Long-form factuality in large language models

Jerry Wei*, Chengrun Yang*, Xinying Song*, Yifeng Lu*, Nathan Hu, Jie Huang, Dustin Tran, Daiyi Peng, Ruibo Liu, Da Huang, Cosmo Du, Quoc V. Le

NeurIPS, 2024

[paper]

High-Fidelity Cellular Network Control-Plane Traffic Generation without Domain Knowledge

Z. Jonny Kong, Nathan Hu, Y. Charlie Hu, Jiayi Meng, Yaron Koral

ACM IMC, 2024

[paper]

Meta-Learning Online Adaptation of Language Models

Nathan Hu*, Eric Mitchell*, Christopher D. Manning, Chelsea Finn

EMNLP, 2023

[paper]

Sampling Arborescences in Parallel

Nima Anari, Nathan Hu, Amin Saberi, Aaron Schild

ITCS, 2021

[paper]