It is common to think of neural networks as adaptable “feature extractors” that learn by progressively refining appropriate representations from initial raw inputs. So, the question arises: what characteristics are being represented, and in what way? To better understand how high-level, human-interpretable features are described in the neuronal activations of LLMs, a research team from the Massachusetts Institute of Technology (MIT), Harvard University (HU), and Northeastern University (NEU) proposes a technique called sparse probing.
Standardly, researchers will train a basic classifier (a probe) on the internal activations of a model to predict a property of the input and then examine the network to see if and where it represents the feature in question. The suggested sparse probing method probes for over 100 variables to pinpoint the relevant neurons. This method overcomes the limitations of prior probing methods and sheds light on the intricate structure of LLMs. It limits the probing classifier to using no more than k neurons in its prediction, where k is variable between 1 and 256.
The team uses state-of-the-art optimal sparse prediction techniques to demonstrate the small-k optimality of the k-sparse feature selection subproblem and tackle the confusion between ranking and classification accuracy. They use sparsity as an inductive bias to ensure their probes can keep a strong simplicity prior and pinpoint key neurons for granular examination. Furthermore, the technique can generate a more reliable signal of whether a specific characteristic is explicitly represented and used downstream because a capacity shortage prevents their probes from memorizing correlation patterns connected with features of interest.
The research group used autoregressive transformer LLMs in their experiment, reporting on classification results after training probes with varying k values. They conclude as follows from the study:
- The neurons of LLMs contain a wealth of interpretable structure, and sparse probing is an efficient way for locating them (even in superposition). Still, it must be used cautiously and followed up with analysis if rigorous conclusions are to be drawn.
- When many neurons in the first layer are activated for unrelated n-grams and local patterns, the features are encoded as sparse linear combinations of polysemantic neurons. Weight statistics and insights from toy models also lead us to conclude that the first 25% of completely connected layers extensively use superposition.
- Although definitive conclusions about monosemanticity remain methodologically out of reach, mono-semantic neurons, especially in middle layers, encode higher-level contextual and linguistic properties (such as is_python_code).
- While representation sparsity tends to rise as models become bigger, this trend doesn’t hold across the board; some features emerge with dedicated neurons as the model gets bigger, while others split into finer-grained features as the model gets bigger, and many others either don’t change or arrive rather randomly.
A Few Benefits of Sparse Probing
- The potential risk of conflating classification quality with ranking quality when investigating individual neurons with probes is addressed further by the availability of probes with optimality guarantees.
- In addition, sparse probes are intended to have a low storage capacity, so there’s less cause for alarm about the probe being able to learn the task by itself.
- To probe, you’ll need a supervised dataset. Still, once you’ve built one, you can use it to interpret any model, which opens the door to research into things like the universality of learned circuits and the natural abstractions hypothesis.
- Instead of relying on subjective assessments, it can be used to automatically examine how different architectural choices affect the occurrence of polysemantic and superposition.
Sparse probing has its limitations
- Strong inferences can only be made from probing experiment data with an additional secondary investigation of the identified neurons.
- Because of its sensitivity to implementation details, anomalies, misspecifications, and misleading correlations in the probing dataset, probing provides only limited insight into causation.
- Particularly in terms of interpretability, sparse probes cannot recognize features constructed across multiple layers or differentiate between features in superposition and features represented as the union of numerous distinct, more granular features.
- Iterative pruning may be required to identify all significant neurons if sparse probing misses some due to redundancy in the probing dataset. Using multi-token characteristics necessitates specialized processing, commonly implemented using aggregations that might further dilute the result’s specificity.
Using a revolutionary sparse probing technique, our work unveils a wealth of rich, human-understandable structures in LLMs. Scientists plan to build an extensive repository of probing datasets, possibly with the help of AI, that record details especially pertinent to bias, justice, safety, and high-stakes decision-making. They encourage other researchers to join in exploring this “ambitious interpretability” and argue that an empirical approach evocative of the natural sciences can be more productive than with typical machine learning experimental loops. Having vast and diverse supervised datasets will allow for improved evaluations of the next generation of unsupervised interpretability techniques that will be required to keep up with AI advancement, in addition to automating the assessment of new models.
Check out the Paper. Don’t forget to join our 21k+ ML SubReddit, Discord Channel, and Email Newsletter, where we share the latest AI research news, cool AI projects, and more. If you have any questions regarding the above article or if we missed anything, feel free to email us at [email protected]
???? Check Out 100’s AI Tools in AI Tools Club
Dhanshree Shenwai is a Computer Science Engineer and has a good experience in FinTech companies covering Financial, Cards Payments and Banking domain with keen interest in applications of AI. She is enthusiastic about exploring new technologies and advancements in today’s evolving world making everyone’s life easy.