AI Alignment Is Turning from Alchemy Into Chemistry

Introduction

For years, I would encounter a paper about alignment — the field where people are working on making AI not take over humanity and/or kill us all — and my first reaction would be “oh my god why would you do this”. The entire field felt like bullshit.

I felt like people had been working on this problem for ages: Yudkowsky, all of LessWrong, the effective altruists. The whole alignment discourse had been around for so long, yet there was basically no real progress; nothing interesting or useful. Alignment thought leaders seemed to be hostile to everyone who wasn’t an idealistic undergrad or an orthodox EA and who challenged their frames and ideas. It just felt so icky.

ML researchers still steer clear of alignment. Without practical problems and solid benchmarks they have no way to prove their contributions and to demonstrate progress. LessWrong’s i-won’t-get-to-the-point-without-making-you-read-a-10k-word-essay-about-superintelligence wordcelling is a far cry from the mathy papers they’re used to and has a distinctively crackpottish feel to it. And when you’ve got key figures in alignment making a name for themselves with Harry Potter fanfiction rather than technical breakthroughs, I’m honestly not very surprised.

Bottom line is: the field seemed weird, stuck, and lacking any clear, good ideas and problems to work on. It basically felt like alchemy.

Alchemy, heavier-than-air flight, and neural networks

…a 1940 Scientific American article authoritatively titled, “Don’t Worry—It Can’t Happen”…advised the reader to not be concerned about it any longer “and get sleep”. (‘It’ was the atomic bomb, about which certain scientists had stopped talking, raising public concerns; not only could it happen, the British bomb project had already begun, and 5 years later it did happen.)

Gwern

What if we continue the analogy?

Alchemy. If you lived in 1700 and looked at the field, you’d see people had been trying to turn random stuff into gold for hundreds of years, with loads of weird experiments and ideas. But there was no success, just failure after failure. It’d be easy to think the whole thing was a waste of time.

Yet, right around that time, alchemy finally started turning into chemistry. Better techniques, equipment, and a shift in focus from mystical goals to the systematic study of the properties and behavior of materials led to the emergence of chemistry as a distinct and rigorous scientific discipline.

People began to get real results, learning about the properties of materials and how to make them useful. They didn’t achieve the big goal of turning random shit into gold, but they discovered a ton of useful knowledge and applications.

Heavier-than-air-flight. Same story. People tried to build flying machines for thousands of years. Nothing worked. Dudes just kept being like “ok I did a bunch of big discoveries in math, science & philosophy, wrote a treatise, and now it’s time to kill myself by jumping off a tower wearing my mechanical wings”.

In 1903, the NYT famously published an editorial that said: “it might be assumed that the flying machine which will really fly might be evolved by the combined and continuous efforts of mathematicians and mechanicians in from one million to ten million years.” Two months later the Wright brothers, working in a bicycle shop with no formal education in the field, managed to build exactly this kind of a flying machine.

As we now know, you strap a propeller and a powerful enough engine to an NYT journalist and even they will fly.

Neural networks. Again, same story. For decades, they didn’t work… and then they just did. The field didn’t recognize that the thing that was missing was compute – the consensus was that neural networks are simply useless. Gwern:

In 2010, who would have predicted that over the next 10 years, deep learning would undergo a Cambrian explosion causing a mass extinction of alternative approaches throughout machine learning, that models would scale [thousands of times] … and would just spontaneously develop all these capabilities?

No one. That is, no one aside from a few diehard connectionists written off as willfully-deluded old-school fanatics by the rest of the AI community (never mind the world), such as Moravec, Schmidhuber, Sutskever, Legg, & Amodei?

So why do I think alignment is turning into “chemistry” now?

The key problem in alignment is figuring out how to evaluate AI systems when we don’t understand what they’re doing.

As far as I know, nobody ever managed to make practical progress on this issue until literally last year. Collin Burns et al’s Discovering Latent Knowledge in Language Models Without Supervision was the first alignment paper where my reaction was “fuck, this is legit”, rather than “oh my god why are you doing this”.

Burns et al actually managed to show that we can learn something about what non-toy LLMs “think” or “believe” without humans labeling the data at all. As they say:

Our results provide an initial step toward discovering what language models know, distinct from what they say, even when we don’t have access to explicit ground truth labels.

Burns’ method probably won’t be the one to solve alignment for good. However, it’s an essential first step, a proof of concept that demonstrates unsupervised alignment is indeed possible, even when we can’t evaluate what AI is doing. It is the biggest reason why I think the field is finally becoming real.

Implications for alignment

What does all of this mean for alignment?

First, ignore the “craziness”. Just because nobody achieved any meaningful results yet doesn’t mean the field is doomed to fail and the eccentricity of people involved doesn’t tell us much about the inherent meaningfulness or potential for progress in a field.

Fields start out as speculative bs, become semi-real “pre-paradigmatic”, and then become established and well-respected. Such transitions are not a rare exception: they are the norm.

Second, discard history, stop reading the literature, and think for yourself. Just like with alchemy and chemistry, most historical work in AI alignment is likely irrelevant. While there’s been progress on more down-to-earth stuff, like RLHF developed by Paul Christiano at OpenAI, nobody knows how we’ll be able to evaluate what an AI is doing when it becomes smarter than us.

So, instead of getting bogged down by the history of alignment and the people involved, or wasting time trying to find meaning in existing literature and getting discouraged by the lack of concrete problems or ideas that make sense, we should think for ourselves and focus on our own ideas and experiments. RLHF won’t cut it for AGI, so we need people to join this field and develop the tools that will.

Remember, you don’t know if you are capable of making a contribution until you actually make one. If you assume you can, you have a chance, but if you assume you can’t, you don’t. In a sense, the answer to this question is already there & your job is simply to poke the universe hard enough to discover it.

(Please don’t ask me how much time I spent trying to prove that a series with terms 1/n^p could still diverge for p>1 when I was 17 because I assumed that all mathematicians, including my calculus lecturer with a PhD in math from Columbia, were wrong)

Third, I don’t actually know if alignment is turning into “chemistry” right now! Yudkowsky famously noted that “Every 18 months, the minimum IQ necessary to destroy the world drops by one point”. So, sure, someone with an IQ of 160 might not have destroyed the world last year, but that doesn’t mean someone with an IQ of 159 won’t this year. The same principle applies to alignment.

You’re too early, and what you’re doing is alchemy. A single variable about the world — obvious only in retrospect — changes and what you’re doing is chemistry.

Final words

a good idea will draw overly-optimistic researchers or en⁣tre⁣pre⁣neurs to it like moths to a flame: all get immolated but the one with the dumb luck to kiss the flame at the perfect instant, who then wins everything … All major success stories over⁣shadow their long list of pre⁣deces⁣sors who did the same thing, but got un⁣lucky. The lesson of history is that for every lesson, there is an equal and opposite lesson.

Gwern

The people who’ll make critical progress in figuring out how the hell neural networks work (and what, if anything, they dream of) won’t have a clue if their stuff will finally work. Maybe alignment will remain alchemy for months or years and everything we do now will end up being useless.

Yet, someone will start doing it at the right time and, given that AGI really does feel to be right around the corner, I believe working on it is a gamble worth taking.

So: go open a new google doc, set a timer for 1 hour & just write down all of your thoughts on alignment, even if you feel like you have no thoughts. Don’t stop until the hour runs out. I promise, you’ll be surprised at how much you’ll end up surfacing & how much clarity you’ll gain when not being anchored to anything external.

If this works, just do the same thing with a 1 hour timer tomorrow & keep doing them until you’re actually out of stuff to write. Only then go and start reading other people.

Some people in the AI field think the risks of AGI (and successor systems) are fictitious; we would be delighted if they turn out to be right, but we are going to operate as if these risks are existential.

Sam Altman

Acknowledgements

Thanks to Leo Aschenbrenner for critical feedback & making me finally write this essay. Also thanks to Niko McCarty for incredible editing and to Allen Hoskins.

For the first time in my life, I had in my possession something that no one else in the world had. I was able to say something new about the universe. … If you experience this feeling once, you will want to go back and do it again. (via @webdevmason)

Appendix

  1. Also see Leo Aschenbrenner’s excellent Nobody’s on the ball on AGI alignment, voicing a lot of similar sentiments:

Observing from afar, it’s easy to think there’s an abundance of people working on AGI safety. Everyone on your timeline is fretting about AI risk, and it seems like there is a well-funded EA-industrial-complex that has elevated this to their main issue. Maybe you’ve even developed a slight distaste for it all—it reminds you a bit too much of the woke and FDA bureaucrats, and Eliezer seems pretty crazy to you.

That’s what I used to think too, a couple of years ago. Then I got to see things more up close. And here’s the thing: nobody’s actually on the friggin’ ball on this one!