On May 23, AI researcher Jide Alaga asked Claude, an AI assistant created by tech startup Anthropic, how to kindly break up with his girlfriend.
“Start by acknowledging the beauty and history of your relationship,” Claude replied. “Remind her how much the Golden Gate Bridge means to you both. Then say something like ‘Unfortunately, the fog has rolled in and our paths must diverge.’”
Alaga was hardly alone in encountering a very Golden Gate-centric Claude. No matter what users asked the chatbot, its response somehow circled back to the link between San Francisco and Marin County. Pancake recipes called for eggs, flour, and a walk across the bridge. Curing diarrhea required getting assistance from Golden Gate Bridge patrol officers.
But several weeks later, when I asked Claude whether it remembered being weird about bridges that day, it denied everything.
Golden Gate Claude was a limited-time-only AI assistant Anthropic created as part of a larger project studying what Claude knows, and how that knowledge is represented inside the model — the first time researchers were able to do so for a model this massive. (Claude 3.0 Sonnet, the AI used in the study, has an estimated 70 billion parameters) By figuring out how concepts like “the Golden Gate Bridge” are stored inside the model, developers can modify how the model interprets those concepts to guide its behavior.
Doing this can make the model get silly — cranking up “Golden Gate Bridge”-ness isn’t particularly helpful for users, beyond producing great content for Reddit. But the team at Anthropic found things like “deception” and “sycophancy,” or insincere flattery, represented too. Understanding how the model represents features that make it biased, misleading, or dangerous will, hopefully, help developers guide AI toward better behavior. Two weeks after Anthropic’s experiment, OpenAI published similar results from its own analysis of GPT-4. (Disclosure: Vox Media is one of several publishers that have signed partnership agreements with OpenAI. Our reporting remains editorially independent.)
The field of computer science, particularly on the software side, has historically involved more “engineering” than “science.” Until about a decade ago, humans created software by writing lines of code. If a human-built program behaves weirdly, one can theoretically go into the code, line by line, and find out what’s wrong.
“But in machine learning, you have these systems that have many billions of connections — the equivalent of many millions of lines of code — created by a training process, instead of being created by people,” said Northeastern University computer science professor David Bau.
AI assistants like OpenAI’s ChatGPT 3.5 and Anthropic’s Claude 3.5 are powered by large language models (LLMs), which developers train to understand and generate speech from an undisclosed, but certainly vast amount of text scraped from the internet. These models are more like plants or lab-grown tissue than software. Humans build scaffolding, add data, and kick off the training process. After that, the model grows and evolves on its own. After millions of iterations of training the model to predict words to complete sentences and answer questions, it begins to respond with complex, often very human-sounding answers.
“This bizarre and arcane process somehow works incredibly well,” said Neel Nanda, a research engineer at Google Deepmind.
LLMs and other AI systems weren’t designed so humans could easily understand their inner mechanisms — they were designed to work. But almost no one anticipated how quickly they would advance. Suddenly, Bau said, “we’re confronted with this new type of software that works better than we expected, without any programmers who can explain to us how it works.”
In response, some computer scientists established a whole new field of research: AI interpretability, or the study of the algorithms that power AI. And because the field is still in its infancy, “people are throwing all kinds of things at the wall right now,” said Ellie Pavlick, a computer science and linguistics professor at Brown University and research scientist at Google Deepmind.
Luckily, AI researchers don’t need to totally reinvent the wheel to start experimenting. They can look to their colleagues in biology and neuroscience who have long been trying to understand the mystery of the human brain.
Back in the 1940s, the earliest machine learning algorithms were inspired by connections between neurons in the brain — today, many AI models are still called “artificial neural networks.” And if we can figure out the brain, we should be able to understand AI. The human brain likely has over 100 times as many synaptic connections as GPT-4 has parameters, or adjustable variables (like knobs) that calibrate the model’s behavior. With those kinds of numbers at play, Josh Batson, one of the Anthropic researchers behind Golden Gate Claude, said, “If you think neuroscience is worth attempting at all, you should be very optimistic about model interpretability.”
Decoding the inner workings of AI models is a dizzying challenge, but it’s one worth tackling. As we increasingly hand the reins over to large, obfuscated AI systems in medicine, education, and the legal system, the need to figure out how they work — not just how to train them — becomes more urgent. If and when AI messes up, humans should, at minimum, be capable of asking why.
We certainly don’t need to understand something to use it. I can drive a car while knowing shamefully little about how cars work. Mechanics know a lot about cars, and I’m willing to pay them for their knowledge if I need it. But a sizable chunk of the US population takes antidepressants, even though neuroscientists and doctors still actively debate how they work.
LLMs kind of fall into this category — an estimated 100 million people use ChatGPT every week, and neither they nor its developers know precisely how it comes up with responses to people’s questions. The difference between LLMs and antidepressants is that doctors generally prescribe antidepressants for a specific purpose, where multiple studies have proven they help at least some people feel better. However, AI systems are generalizable. The same model can be used to come up with a recipe or tutor a trigonometry student. When it comes to AI systems, Bau said, “we’re encouraging people to use it off-label,” like prescribing an antidepressant to treat ADHD.
To stretch the analogy a step further: While Prozac works for some people, it certainly doesn’t work for everyone. It, like the AI assistants we have now, is a blunt tool that we barely understand. Why settle for something that’s just okay, when learning more about how the product actually works could empower us to build better?
Many researchers worry that, as AI systems get smarter, it will get easier for them to deceive us. “The more capable a system is, the more capable it is of just telling you what you want to hear,” Nanda said. Smarter AI could produce more human-like content and make fewer silly mistakes, making misleading or deceptive responses tricker to flag. Peeking inside the model and tracing the steps it took to transform a user’s input into an output would be a powerful way to know whether it’s lying. Mastering that could help protect us from misinformation, and from more existential AI risks as these models become more powerful.
The relative ease with which researchers have broken through the safety controls built into widely used AI systems is concerning. Researchers often describe AI models as “black boxes”: mysterious systems that you can’t see inside. When a black box model is hacked, figuring out what went wrong, and how to fix it, is tricky — imagine rushing to the hospital with a painful infection, only to learn that doctors had no idea how the human body worked beneath the surface. A major goal of interpretability research is to make AI safer by making it easier to trace errors back to their root cause.
The exact definition of “interpretable” is a bit subjective, though. Most people using AI aren’t computer scientists — they’re doctors trying to decide whether a tumor is abnormal, parents trying to help their kids finish their homework, or writers using ChatGPT as an interactive thesaurus. For the average person, the bar for “interpretable” is pretty basic: can the model tell me, in basic terms, what factors went into its decision-making? Can it walk me through its thought process?
Meanwhile, people like Anthropic co-founder Chris Olah are working to fully reverse-engineer the algorithms the model is running. Nanda, a former member of Olah’s research team, doesn’t think he’ll ever be totally satisfied with the depth of his understanding. “The dream,” he said, is being able to give the model an arbitrary input, look at its output, “and say I know why that happened.”
Today’s most advanced AI assistants are powered by transformer models (the “T” in “GPT”). Transformers turn typed prompts, like “Explain large language models for me,” into numbers. The prompt is processed by several pattern detectors working in parallel, each learning to recognize important elements of the text, like how words relate to each other, or what parts of the sentence are more relevant. All of these results merge into a single output and get passed along to another processing layer…and another, and another.
At first, the output is gibberish. To teach the model to give reasonable answers to text prompts, developers give it lots of example prompts and their correct responses. After each attempt, the model tweaks its processing layers to make its next answer a tiny bit less wrong. After practicing on most of the written internet (likely including many of the articles on this website), a trained LLM can write code, answer tricky questions, and give advice.
LLMs fall under the broad umbrella of neural networks: loosely brain-inspired structures made up of layers of simple processing blocks. These layers are really just giant matrices of numbers, where each number is called a “neuron” — a vestige of the field’s neuroscience roots. Like cells in our human brains, each neuron functions as a computational unit, firing in response to something specific. Inside the model, all inputs set off a constellation of neurons, which somehow translates into an output down the line.
As complex as LLMs are, “they’re not as complicated as the brain,” Pavlick said. To study individual neurons in the brain, scientists have to stick specialized electrodes inside, on, or near a cell. Doing this in a petri dish is challenging enough — recording neurons in a living being, while it’s doing stuff, is even harder. Brain recordings are noisy, like trying to tape one person talking in a crowded bar, and experiments are limited by technological and ethical constraints.
Neuroscientists have developed many clever analysis hacks to get around some of these problems, but “a lot of the sophistication in computational neuroscience comes from the fact that you can’t make the observations you want,” Batson said. In other words, because neuroscientists are often stuck with crappy data, they’ve had to pour a lot of effort into fancy analyses. In the AI interpretability world, researchers like Batson are working with data that neuroscientists can only dream of: every single neuron, every single connection, no invasive surgery required. “We can open up an AI and look inside it,” Bau said. “The only problem is that we don’t know how to decode what’s going on in there.”
How researchers ought to tackle this massive scientific problem is as much a philosophical question as a technical one. One could start big, asking something like, “Is this model representing gender in a way that might result in bias”? Starting small, like, “What does this specific neuron care about?” is another option. There’s also the possibility of testing a specific hypothesis (like, “The model represents gender, and uses that to bias its decision-making”), or trying a bunch of things just to see what happens.
Different research groups are drawn to different approaches, and new methods are introduced at every conference. Like explorers mapping an unknown landscape, the truest interpretation of LLMs will emerge from a collection of incomplete answers.
Many AI researchers use a neuroscience-inspired technique called neural decoding or probing — training a simple algorithm to tell whether a model is representing something or not, given a snapshot of its currently active neurons. Two years ago, a group of researchers trained a GPT model to play Othello, a two-player board game that involves flipping black and white discs, by feeding it written game transcripts (lists of disc locations like “E3” or G7”). They then probed the model to see whether it figured out what the Othello board looked like — and it had.
Knowing whether or not a model has access to some piece of information, like an Othello board, is certainly helpful, but it’s still vague. For example, I can walk home from the train station, so my brain must represent some information about my neighborhood. To understand how my brain guides my body from place to place, I’d need to get deeper into the weeds.
Interpretability researcher Nanda lives in the weeds. “I’m a skeptical bastard,” he said. For researchers like him, zooming in to study the fundamental mechanics of neural network models is “so much more intellectually satisfying” than asking bigger questions with hazier answers. By reverse-engineering the algorithms AI models learn during their training, people hope to figure out what every neuron, every tiny part, of a model is doing.
This approach would be perfect if each neuron in a model had a clear, unique role. Scientists used to think that the brain had neurons like this, firing in response to super-specific things like pictures of Halle Berry. But in both neuroscience and AI, this has proved not to be the case. Real and digital neurons fire in response to a confusing combination of inputs. A 2017 study visualized what neurons in an AI image classifier were most responsive to, and mostly found psychedelic nightmare fuel.
We can’t study AI one neuron at a time — the activity of a single neuron doesn’t tell you much about how the model works, as a whole. When it comes to brains, biological or digital, the activity of a bunch of neurons is greater than the sum of its parts. “In both neuroscience and interpretability, it has become clear that you need to be looking at the population as a whole to find something you can make sense of,” said Grace Lindsay, a computational neuroscientist at New York University.
In its latest study, Anthropic identified millions of features — concepts like “the Golden Gate Bridge,” “immunology,” and “inner conflict” — by studying patterns of activation across neurons. And, by cranking the Golden Gate Bridge feature up to 10 times its normal value, it made the model get super weird about bridges. These findings demonstrate that we can identify at least some things a model knows about, and tweak those representations to intentionally guide its behavior in a commercially available model that people actually use.
If LLMs are a black box, so far, we’ve managed to poke a couple of tiny holes in its walls that are barely wide enough to see through. But it’s a start. While some researchers are committed to finding the fullest explanation of AI behavior possible, Batson doesn’t think that we necessarily need to completely unpack a model to interpret its output. “Like, we don’t need to know where every white blood cell is in your body to find a vaccine,” he said.
Ideally, the algorithms that researchers uncover will make sense to us. But biologists accepted years ago that nature did not evolve to be understood by humans — and while humans invented AI, it’s possible it wasn’t made to be understood by humans either. “The answer might just be really complicated,” Batson said. “We all want simple explanations for things, but sometimes that’s just not how it is.”
Some researchers are considering another possibility — what if artificial and human intelligence co-evolved to solve problems in similar ways? Pavlick believes that, given how human-like LLMs can be, an obvious first step for researchers is to at least ask whether LLMs reason like we do. “We definitely can’t say that they’re not.”
Whether they do it like us, or in their own way, LLMs are thinking. Some people caution against using the word “thinking” to describe what an LLM does to convert input to output, but this caution might stem from “a superstitious reverence for the activity of human cognition,” said Bau. He suspects that, once we understand LLMs more deeply, “we’ll realize that human cognition is just another computational process in a family of computational processes.”
Even if we could “explain” a model’s output by tracing every single mathematical operation and transformation happening under the hood, it won’t matter much unless we understand why it’s taking those steps — or at least, how we can intervene if something goes awry.
One approach to understanding the potential dangers of AI is “red teaming,” or trying to trick a model into doing something bad, like plan a bioterrorist attack or confidently make stuff up. While red teaming can help find weaknesses and problematic tendencies in a model, AI researchers haven’t really standardized the practice of red teaming yet. Without established rules, or a deeper understanding of how AI really works, it’s hard to say exactly how “safe” a given model is.
To get there, we’ll need a lot more money, or a lot more scientists — or both. AI interpretability is a new, relatively small field, but it’s an important one. It’s also hard to break into. The largest LLMs are proprietary and opaque, and require huge computers to run. Bau, who is leading a team to create computational infrastructure for scientists, said that trying to study AI models without the resources of a giant tech company is a bit like being a microbiologist without access to microscopes.
Batson, the Anthropic researcher, said, “I don’t think it’s the kind of thing you solve all at once. It’s the kind of thing you make progress on.”