Currently focused on figuring out how to evaluate and modify the capabilities, values, and goals of AI (especially LLMs) at the ground-truth level and in unsupervised fashion when we cannot apply human judgement to evaluate their output.

You can contact me at Especially contact me if you are interested in these questions, even if you feel like you don’t have a good grasp on them (it’s not clear if anyone has!..).

You can find my twitter here.

To receive updates from me, enter your email here:

May 9, A Two sentence Jailbreak for GPT-4 and Claude & Why Nobody Knows How to Fix It

April 25, AI Alignment Is Turning from Alchemy Into Chemistry

March 8, lifehacks

February 7, My 2022 self (I don't know them) was very wrong about meditation, huge monitors, and... sleep.

January 27, Why AI experts' jobs are always decades from being automated

December 3, Best of Holden Karnofsky and Sam Altman

Most Notable




Fiction & Art


Full archive sorted by date