While language models have improved and been widely implemented, our knowledge of how they function on the inside still needs to be improved. For instance, it could be hard to tell if they utilize biased heuristics or are dishonest based on their outputs. Interpretability studies aim to get insight into the model from within. The most recent work in artificial intelligence interpretability at OpenAI employs the GPT-4 large-scale language model to produce behavioral explanations for neurons in the large-scale language model. Then it scores these explanations to evaluate their quality.
Studying the interpretability of AI systems will help users and developers better understand their inner workings and the processes AI employs to arrive at conclusions, which will boost user confidence in these systems. It is also possible to improve model performance and further fortify human-AI collaboration by examining the behavior of AI models in order to better understand model bias and faults.
In deep learning, neurons and attention heads are important components of both the neural network and the self-attention process. A key component of studies on interpretability is examining the function of each component. The time-consuming and labor-intensive process of manually analyzing neurons to verify the properties of the data these neurons represent is prohibitive for neural networks with tens of billions of parameters.
Learning how the parts (neurons and attention heads) work is a clear starting point for study into interpretability. In the past, this has necessitated a human inspection of neurons to determine the data properties they represent. Scalability issues prevent this method from using neural networks with hundreds of billions of parameters. To apply GPT-4 to neurons in another language model, researchers offer an automated process to generate and evaluate natural language descriptions of neuron function.
This endeavor aims to automate the alignment research process, the third pillar of the strategy. The fact that this method can be expanded to keep up with progress in AI is encouraging. As future models become more sophisticated and useful as helpers, one will learn to understand them better.
To produce and evaluate the performance of additional language model neurons, OpenAI currently proposes an automated approach that employs GPT-4. This research is crucial because AI is rapidly evolving, and keeping up with it requires the use of automated methods; furthermore, when new models are built, the quality of the explanations they produce will increase.
Neuronal behavior can be explained in three stages: explanation generation, simulation using GPT-4, and comparison.
- First, by providing a GPT-2 neuron and demonstrating the relevant text sequence and activity to GPT-4, one may ask it to write natural language text that can explain the neuron’s function.
- The next stage involves using GPT-4 to mimic the actions of virtual neurons. To test whether the interpretation is consistent with the behavior of activated neurons, one needs to deduce why the neurons in the explanation are active.
- Finally, the explanation is graded based on how well it accounts for the differences between the simulation and the real situation.
Unfortunately, GPT-4’s automated generation and assessment of neuron behavior is not yet useful for more complex models. The scientists wonder if the neural network is more complicated than the last network layers, where most explanations focus. It’s quite low, but OpenAI thinks it can be raised with the help of advances in machine learning technology. The quality of interpretation may be enhanced, for instance, by employing a more comprehensive model or by altering the structure of the interpretation model.
The OpenAI API now includes code for interpreting and scoring data from public models, visualization tools, and the 300,000-neuron GPT-2 interpretation data set created by GPT-4. OpenAI has expressed the desire that other AI projects will. The community can contribute to the investigation by creating more effective methods for high-quality justifications.
Challenges that can be overcome with additional research
- Although scientists attempted to describe neuronal behavior using only normal language, the behavior of some neurons may be too complex to be described in such a small space. Neurons, for instance, might represent single notions humans don’t understand or have words for or be extremely polysemantic (representing many unique concepts).
- Scientists want to one day have computers automatically discover and explain the neuronal and attentional circuits that underpin complicated behavior. The current approach explains neuron behavior relative to the initial text input but does not comment on the subsequent impacts. For instance, a neuron that fires on periods might be incrementing a sentence counter or signaling that the following word should begin with a capital letter.
- Researchers need to attempt to understand the underlying mechanics to describe the actions of neurons. Since high-scoring explanations merely report a connection, they may need to do better on out-of-distribution texts.
- The process as a whole is very computationally intensive.
The research suggests that the methods help fill in some gaps in the big picture of transformer language model functioning. By aiming to identify sets of interpretable directions in the residual stream or by trying to find various explanations that describe the behavior of a neuron across its complete distribution, the methods may assist in increasing the knowledge of superposition. Explanations can be made better using improved tool use, conversational assistants, and chain-of-thought approaches. Researchers envision a future where the explainer model can generate, test, and iterate on as many hypotheses as a human interpretability researcher does now. This would include speculations regarding circuit functionality and non-normal behavior. Researchers could benefit from a more macro-focused approach if they could view hundreds of millions of neurons and query explanatory databases for commonalities. Simple applications may quickly see development, such as identifying prominent characteristics in reward models or comprehending qualitative differences between a tuned model and its starting point.
The dataset and source code can be accessed at https://github.com/openai/automated-interpretability