AI Alignment Is Turning from Alchemy Into Chemistry

Introduction

For years, I would encounter a paper about alignment — the field where people are working on making AI not take over humanity and/or kill us all — and my first reaction would be “oh my god why would you do this”. The entire field felt like bullshit.

I felt like people had been working on this problem for ages: Yudkowsky, all of LessWrong, the effective altruists. The whole alignment discourse had been around for so long, yet there was basically no real progress; nothing interesting or useful. Alignment thought leaders seemed to be hostile to everyone who wasn’t an idealistic undergrad or an orthodox EA and who challenged their frames and ideas. It just felt so icky.

ML researchers still steer clear of alignment. Without practical problems and solid benchmarks they have no way to prove their contributions and to demonstrate progress. LessWrong’s i-won’t-get-to-the-point-without-making-you-read-a-10k-word-essay-about-superintelligence wordcelling is a far cry from the mathy papers they’re used to and has a distinctively crackpottish feel to it. And when you’ve got key figures in alignment making a name for themselves with Harry Potter fanfiction rather than technical breakthroughs, I’m honestly not very surprised.

Bottom line is: the field seemed weird, stuck, and lacking any clear, good ideas and problems to work on. It basically felt like alchemy.

Alchemy, heavier-than-air flight, and neural networks

…a 1940 Scientific American article authoritatively titled, “Don’t Worry—It Can’t Happen”…advised the reader to not be concerned about it any longer “and get sleep”. (‘It’ was the atomic bomb, about which certain scientists had stopped talking, raising public concerns; not only could it happen, the British bomb project had already begun, and 5 years later it did happen.)

Gwern

What if we continue the analogy?

Alchemy. If you lived in 1700 and looked at the field, you’d see people had been trying to turn random stuff into gold for hundreds of years, with loads of weird experiments and ideas. But there was no success, just failure after failure. It’d be easy to think the whole thing was a waste of time.

Yet, right around that time, alchemy finally started turning into chemistry. Better techniques, equipment, and a shift in focus from mystical goals to the systematic study of the properties and behavior of materials led to the emergence of chemistry as a distinct and rigorous scientific discipline.

People began to get real results, learning about the properties of materials and how to make them useful. They didn’t achieve the big goal of turning random shit into gold, but they discovered a ton of useful knowledge and applications.

Heavier-than-air-flight. Same story. People tried to build flying machines for thousands of years. Nothing worked. Dudes just kept being like “ok I did a bunch of big discoveries in math, science & philosophy, wrote a treatise, and now it’s time to kill myself by jumping off a tower wearing my mechanical wings”.

In 1903, the NYT famously published an editorial that said: “it might be assumed that the flying machine which will really fly might be evolved by the combined and continuous efforts of mathematicians and mechanicians in from one million to ten million years.” Two months later the Wright brothers, working in a bicycle shop with no formal education in the field, managed to build exactly this kind of a flying machine.

As we now know, you strap a propeller and a powerful enough engine to an NYT journalist and even they will fly.

Neural networks. Again, same story. For decades, they didn’t work… and then they just did. The field didn’t recognize that the thing that was missing was compute – the consensus was that neural networks are simply useless. Gwern:

In 2010, who would have predicted that over the next 10 years, deep learning would undergo a Cambrian explosion causing a mass extinction of alternative approaches throughout machine learning, that models would scale [thousands of times] … and would just spontaneously develop all these capabilities?

No one. That is, no one aside from a few diehard connectionists written off as willfully-deluded old-school fanatics by the rest of the AI community (never mind the world), such as Moravec, Schmidhuber, Sutskever, Legg, & Amodei?

So why do I think alignment is turning into “chemistry” now?

The key problem in alignment is figuring out how to evaluate AI systems when we don’t understand what they’re doing.

As far as I know, nobody ever managed to make practical progress on this issue until literally last year. Collin Burns et al’s Discovering Latent Knowledge in Language Models Without Supervision was the first alignment paper where my reaction was “fuck, this is legit”, rather than “oh my god why are you doing this”.

Burns et al actually managed to show that we can learn something about what non-toy LLMs “think” or “believe” without humans labeling the data at all. As they say:

Our results provide an initial step toward discovering what language models know, distinct from what they say, even when we don’t have access to explicit ground truth labels.

Burns’ method probably won’t be the one to solve alignment for good. However, it’s an essential first step, a proof of concept that demonstrates unsupervised alignment is indeed possible, even when we can’t evaluate what AI is doing. It is the biggest reason why I think the field is finally becoming real.

Implications for alignment

What does all of this mean for alignment?

First, ignore the “craziness”. Just because nobody achieved any meaningful results yet doesn’t mean the field is doomed to fail and the eccentricity of people involved doesn’t tell us much about the inherent meaningfulness or potential for progress in a field.

Fields start out as speculative bs, become semi-real “pre-paradigmatic”, and then become established and well-respected. Such transitions are not a rare exception: they are the norm.

Second, discard history, stop reading the literature, and think for yourself. Just like with alchemy and chemistry, most historical work in AI alignment is likely irrelevant. While there’s been progress on more down-to-earth stuff, like RLHF developed by Paul Christiano at OpenAI, nobody knows how we’ll be able to evaluate what an AI is doing when it becomes smarter than us.

So, instead of getting bogged down by the history of alignment and the people involved, or wasting time trying to find meaning in existing literature and getting discouraged by the lack of concrete problems or ideas that make sense, we should think for ourselves and focus on our own ideas and experiments. RLHF won’t cut it for AGI, so we need people to join this field and develop the tools that will.

Remember, you don’t know if you are capable of making a contribution until you actually make one. If you assume you can, you have a chance, but if you assume you can’t, you don’t. In a sense, the answer to this question is already there & your job is simply to poke the universe hard enough to discover it.

(Please don’t ask me how much time I spent trying to prove that a series with terms 1/n^p could still diverge for p>1 when I was 17 because I assumed that all mathematicians, including my calculus lecturer with a PhD in math from Columbia, were wrong)

Third, I don’t actually know if alignment is turning into “chemistry” right now! Yudkowsky famously noted that “Every 18 months, the minimum IQ necessary to destroy the world drops by one point”. So, sure, someone with an IQ of 160 might not have destroyed the world last year, but that doesn’t mean someone with an IQ of 159 won’t this year. The same principle applies to alignment.

You’re too early, and what you’re doing is alchemy. A single variable about the world — obvious only in retrospect — changes and what you’re doing is chemistry.

Final words

a good idea will draw overly-optimistic researchers or en⁣tre⁣pre⁣neurs to it like moths to a flame: all get immolated but the one with the dumb luck to kiss the flame at the perfect instant, who then wins everything … All major success stories over⁣shadow their long list of pre⁣deces⁣sors who did the same thing, but got un⁣lucky. The lesson of history is that for every lesson, there is an equal and opposite lesson.

Gwern

The people who’ll make critical progress in figuring out how the hell neural networks work (and what, if anything, they dream of) won’t have a clue if their stuff will finally work. Maybe alignment will remain alchemy for months or years and everything we do now will end up being useless.

Yet, someone will start doing it at the right time and, given that AGI really does feel to be right around the corner, I believe working on it is a gamble worth taking.

So: go open a new google doc, set a timer for 1 hour & just write down all of your thoughts on alignment, even if you feel like you have no thoughts. Don’t stop until the hour runs out. I promise, you’ll be surprised at how much you’ll end up surfacing & how much clarity you’ll gain when not being anchored to anything external.

If this works, just do the same thing with a 1 hour timer tomorrow & keep doing them until you’re actually out of stuff to write. Only then go and start reading other people.

Some people in the AI field think the risks of AGI (and successor systems) are fictitious; we would be delighted if they turn out to be right, but we are going to operate as if these risks are existential.

Sam Altman

Acknowledgements

Thanks to Leo Aschenbrenner for critical feedback & making me finally write this essay. Also thanks to Niko McCarty for incredible editing and to Allen Hoskins.

For the first time in my life, I had in my possession something that no one else in the world had. I was able to say something new about the universe. … If you experience this feeling once, you will want to go back and do it again. (via @webdevmason)

Appendix

  1. Also see Leo Aschenbrenner’s excellent Nobody’s on the ball on AGI alignment, voicing a lot of similar sentiments:

Observing from afar, it’s easy to think there’s an abundance of people working on AGI safety. Everyone on your timeline is fretting about AI risk, and it seems like there is a well-funded EA-industrial-complex that has elevated this to their main issue. Maybe you’ve even developed a slight distaste for it all—it reminds you a bit too much of the woke and FDA bureaucrats, and Eliezer seems pretty crazy to you.

That’s what I used to think too, a couple of years ago. Then I got to see things more up close. And here’s the thing: nobody’s actually on the friggin’ ball on this one!


M ↓   Markdown
J
jimmy blevins
1 point
24 months ago

I don't know anything technically about alignment but I have been following the crazy pace of AI to AGI. Wizards all want to work on making a machine as smart as us. Maybe they should make sure this thing has better values and morals then us humans. Remember the native American people thought white people were so advanced with the predictions of eclipses and so on. What happened?

E
Elijah
1 point
24 months ago

Really great article!! This is exactly what I used to do when I was young with trying to invent hover cars. Didn't bother learning how a car worked, just tried to figure it out from first principles. Obviously didn't invent anything, lol, but it did teach me a LOT about all sorts of things, most importantly how to think for myself. (And it showed me that the reason we don't have flying cars is not because they're impossible, but because they're impractical).

Similarly, I've began writing a long ass article about AI and AGI where I just share my thoughts as someone not in the field. It has helped me to cut through much of the BS out there and I was even able to come up with ideas that experts have created through complicated math/CS without having to go down such an arcane path.

I'm hoping I can help other folks like me get involved and maybe even inspire the experts to be more creative.

May not go anywhere, but it's a fun activity nonetheless.

R
Rafael Kaufmann
1 point
24 months ago

Great piece. Every time I see a new decision theory paper claiming to make contributions to alignment I think only of the gap between von Neumann's original work and what it actually took him and the other Rand Corp wonks to do real strategy in the 50s and 60s... And that was in an exponentially easier setting. Real engineering work has thankfully gotten started (I'm particularly excited about causal abstraction, not to mention my own work on bottom-up causal model composition), and now it's becoming the province of hackers and startups.

T
Tommy Morriss
0 points
24 months ago

Totally agree with sentiment. Have you seen https://rome.baulab.info/ ? Basically they located specific factual information on gpt2 down to the exact matrix, and can change those facts in a generalizable way. Their method also worked in exactly the same way on llama.

A
Alexey Guzey
0 points
24 months ago

I thought the method ended up not working or something when ppl tried replicating it?..

?
Anonymous
0 points
24 months ago

A neat addendum which isn't important to your argument is that chemistry eventually solved alchemy's main problem, to the extent that it is physically / economically possible to solve: https://www.reddit.com/r/askscience/comments/1nypu4/how_did_seaborg_change_bismuth_into_gold_at/

A
Alexey Guzey
0 points
24 months ago

yes super important point actually! will add

?
Anonymous
0 points
24 months ago

So much of this issue centers on the error(?) of anthropomorphic technology. The main problem seems to be the concern that our own creation of technology, and our dependence on it, will coalesce and take over all facets of itself, subsuming any human dependencies, and thereby 'control' humanity.
That doesn't mean an AI has gained sentience, it just means the system has become self-contained and inextricably linked to itself. But humanity doesn't yet need technology, in a fundamental sense, to live. Human technological innovation is unveiling/discovering either some or all of the totality of an existence we didn't create. After all this time, why does everything still center on sentience?

K
Kevin Wood
0 points
24 months ago

Excellent post, and I agree with everything you said. I think AGI is closer than many people, even some supposedly super-smart ones, think. And at this point, I can't think of a more important thing to ponder than the alignment problem. I especially like how you pointed out that it won't necessarily take a Phd level mathematician or AI researcher with a mile long CV to solve it. I believe it is more amorphous than that, and the best way to approach it is from every angle imaginable. I really appreciate your approachable style and humor too - got a new subscriber here.

A
Antonello Ceravola
0 points
23 months ago

Thanks infinitely for this great article! I read it 4 times, so much great content on it. I am serious on that. Although, on the other side, I have to confess that I find the whole story of AI alignment a typical human funny illusion. I am giving this week a talk at our institute with the title “Are Humans Intelligent”, and obviously the answer is “no”. On this line I have been asking myself to which line AI should be aligned to: to those guys building guns and selling them around the world? To those others using guns for their ego or refusal to have one? To those others accumulating so much wealth equating the vast majority of the world population possessions, to those in power and constantly abusing of it, to those that waste food or other primary resources missing to a vast portion of the world population, to those others that keep polluting our planet and growing big family for not very clear purposes, …I just want to give views of possible lines to allying to. The discussion now get more complex. If you like we can continue, just let me know. Don’t miss understand me, my controversy on the alignment problem do not touch at all my deep appreciation for your great reasoning and writing style!!