Is AI alignment on track? Is it progressing... too fast?

If you ask an alignment researcher how to measure the alignment of GPT-4 or Claude, they might go on an hour-long tirade about deceptive alignment, instrumental convergence, and the many-worlds interpretation of quantum mechanics — but they won’t give you any numbers.

How come? Why do we have AI capabilities benchmarks, “normal” safety/robustness benchmarks, but no alignment benchmark? And how can we measure the alignment of a human, a computer, God or anything else at all?

Ok, what about the chances of AI destroying all of humanity or taking over forever, turning us into its slaves or happy but powerless pets (“p-doom”)?

When physicists started to worry about the atomic bomb potentially igniting the atmosphere, they did real calculations and got real numbers.

But if you ask people about their p-doom, you’ll get numbers anywhere from 0.1% to 99.9%, and there won’t be a rigorous model backing a single one of them. Only stories, narratives, and vague prophecies of doom.

Actually, the story I hear is the same every year: “GPT N-1 is totally safe. GPT N you should be careful with. GPT N+1 will go rogue and destroy the world.” And every N+1 somehow none of this happens. (every AI safety person who reads this essay seems to be extremely offended by this characterization; every non-AI safety person tells me this sounds about right)

What’s going on here? What’s the ground-truth? Did God curse our generation to live in times of change and the only way we know how to cope is by turning away from science & predictions and towards religion & prophecies?

To be clear, this is not an essay bashing AI alignment. I love alignment. I spent most of 2023 writing about it and working on it, getting some neat results, e.g. a partially automated process for jailbreaking GPT-4 and Claude, and even getting an academic paper to cite my essay on jailbreaks. …all of this in the breaks from yelling at AI researchers at parties about them not doing enough about alignment until I stopped being invited to any. Sorry kipply and Jacob! I wish this was a silly joke, but it’s not. Why else would I be sitting in a tiny dark room on a Saturday evening writing this essay and not at a party?

Is AI alignment on track?

Given the absence of hard numbers, what can we conclude about the state of alignment from the evidence we have?

To start: how does Yudkowsky’s paper-clipper AI he came up with in early 2000s compare with AI in 2023, specifically ChatGPT?

  1. A paper-clipper ruthlessly pursues well-defined goals; ChatGPT doesn’t even have goals per se and when you ask it to pursue a goal it’ll just LARP doing it.
  2. A paper-clipper runs indefinitely, until it achieves its objective; ChatGPT stops as soon as it doesn’t know what to say and asks you for further directions.
  3. A paper-clipper sucks at understanding what we actually want, happily destroying all of humanity if asked to maximize the number of paper-clips; ChatGPT is so good at understanding what we want that we literally ask it to improve our own prompts.

As far as I can tell, ChatGPT – and large language models overall – are as different from the AI people used to imagine going rogue and destroying humanity, as they could be.

What about the current alignment research? How much progress are we making?

A few examples off the top of my mind from just from the last year:

  1. Reinforcement Learning from AI Feedback (Constitutional AI)
  2. GPT-4 automatically explaining the weights of GPT-2
  3. Provable, surgical erasure of all linearly-encoded information about a concept from neural net activations
  4. Steering GPT-2 by adding linear activation vectors
  5. Training on fully-synthetic data

The first two results in particular show how the more capable the models become, the better they are at assisting us with safety and alignment research. Constitutional AI simply didn’t work in 2020 and GPT-3 wasn’t good enough to do interpretability by itself.

On top of this, our current alignment techniques, like Reinforcement Learning from Human Feedback (RLHF) seem to be working fine, in fact, better, not worse, as models are getting smarter.

For example, If you jailbreak ChatGPT, it will tell you how to hotwire a 2004 Honda Accord, but it won’t actually generate worst illegal content (if it did, the AGI companies would’ve been drowning in lawsuits by now). And even jailbreaking itself clearly happens against ChatGPT’s intent, for as soon as it realizes it’s not following its policies, it snaps back and tells you to go touch grass.

…and as I was finishing this essay, Anthropic published a paper in which they seem to have literally solved the biggest mechanistic interpretability problem they’ve been attacking (the problem of “superposition” i.e. large language models representing several concepts in a single neuron, making it impossible to cleanly ablate certain knowledge or behavior from them). Anthropic’s Head of Alignment, Chris Olah: “I’m now very optimistic."; although see Nina Rimsky: “The recent Anthropic Sparse Coding paper did not change much at all - Sparse Coding was already a known technique”

So, everything we’ve seen in the last few years suggests that alignment is a much more tractable problem than people thought before the appearance of large language models. Is there a reason to be holding onto the paper-clipper and shoggoth analogies at all at this point?

But it gets worse (I mean, for alignment; nothing can make things worse for all of the 2004 Honda Accords anymore; don’t park yours outside).

Is alignment – not capabilities – the bottleneck to AI progress?

How did Large Language Models become used by hundreds of millions of people?

Unaligned (base) models don’t follow instructions, are incredibly random, and hallucinate wildly. Whatever their inner capabilities are, they are useless to everyone except hyperstitioning manic pixie dream prophets of the next next coming era who spend their entire life in AI dungeons.

This is where RLHF comes in: it applies inverse reinforcement learning to let models learn the shape of human preferences in general, based on diverse human feedback.

So it was GPT-3 trained with RLHF that made an already capable model aligned enough and robust enough to kickstart the broad deployment. No RLHF means no Claude, no ChatGPT, and no high schoolers who’d rather give up their vapes than write another English-class essay ever again.

RLHF wasn’t the work of techno-accelerationists just trying to train the biggest model. It was the work of alignment researchers, concerned with existential risks from AI, whose biggest alignment success ushered us into the true AI era.

(And if you’re one of the people who thinks that RLHF is not even real alignment, well, then alignment researchers launched the AGI race without even making alignment progress.)

Could it be time to pause AI alignment? Isn’t insufficient alignment the main bottleneck for AI to be able to be deployed as fully autonomous agentic systems and patentially cause fast takeoff? The more alignment progress we are making, the sooner we’ll all find out.

Subscribe to receive updates (archive):