Nathan Hu

Nathan Hu

Hi, I'm a first-year PhD student at Stanford studying language model interpretability. I previously did both an undergrad and a master's at Stanford (yes, I've been here a while, although I did take a gap year). During that gap year, I was a research resident at Anthropic working on reward hacking with Evan Hubinger, then worked on Stochastic Parameter Decomposition with Lee Sharkey through the MATS program. I am grateful to have spent several years during undergrad working in the IRIS Lab and learning so much from Eric Mitchell and Chelsea Finn. My full CV can be found here.

I enjoy tennis, cooking, board games, and card games (especially poker and Magic: The Gathering). I also love snowboarding and electronic dance music.

Research

Measuring Sparse Autoencoder Feature Sensitivity
Claire Tian, Katherine Tian, Nathan Hu
NeurIPS Workshop on Mechanistic Interpretability (Spotlight), 2025
Training on Documents About Reward Hacking Induces Reward Hacking
Nathan Hu*, Benjamin Wright, Carson Denison, Samuel Marks, Johannes Treutlein, Jonathan Uesato, Evan Hubinger
Anthropic Alignment Science Blog, 2025
Long-form factuality in large language models
Jerry Wei*, Chengrun Yang*, Xinying Song*, Yifeng Lu*, Nathan Hu, Jie Huang, Dustin Tran, Daiyi Peng, Ruibo Liu, Da Huang, Cosmo Du, Quoc V. Le
NeurIPS, 2024
High-Fidelity Cellular Network Control-Plane Traffic Generation without Domain Knowledge
Z. Jonny Kong, Nathan Hu, Y. Charlie Hu, Jiayi Meng, Yaron Koral
ACM IMC, 2024
Meta-Learning Online Adaptation of Language Models
Nathan Hu*, Eric Mitchell*, Christopher D. Manning, Chelsea Finn
EMNLP, 2023
Sampling Arborescences in Parallel
Nima Anari, Nathan Hu, Amin Saberi, Aaron Schild
ITCS, 2021