Mechanistic Interpretability - Decoding Neural Networks Might Need a Physics Degree - Part 1
... To bridge this gap, Mechanistic Interpretability (MI) comes into play which is a systematic approach that dissects neural networks much like physics dissects the natural world.
As someone who is trained in traditional hard sciences, how people do “science” in computer science has always bothered me. It always reminds me of the rivalry we’ve had with chemists and biologists about how their “science” is just empirical “throwing things at the wall and see what sticks” kind of research. Of course, as I grew older in physics and started to face more and more complex problem statements my point of view started to become more modest as I saw really interesting problems that can’t be solved by nicely compact formula and could only be tackled by numerical simulations and data driven approaches.
One of the things that you sacrifice as you start looking at more complex systems as chemists, biologists, computer scientists, and physicists do, is the ability to neatly explain how the system behaves and why. Fortunately, most of these sciences value causal inference significantly, which means that we end up with much more generalizable and transparent statements about nature. However, a mechanistic understanding of how systems behave in computer science is at best a secondary consideration.
The need for transparency and explainability of how complex neural nets work, however, especially with the exponential rate of adoption of LLMs and the agentic systems they enable, is urgent particularly in high-stakes domains like healthcare and finance, where trust hinges on understanding the reasoning behind AI-driven choices.
So, you can imagine my delight when I heard about Neel Nada’s work on MLST.
A core challenge in this pursuit lies in how neural networks encode information. In an ideal world neurons represent singular, well-defined concepts (mono-semantic), but in reality, for efficiency purposes, they represent multiple overlapping ideas (poly-semantic)! While this boosts efficiency, it complicates interpretability, forcing a trade-off between performance and transparency. According to Neel, to bridge this gap, Mechanistic Interpretability (MI) comes into play which is a systematic approach that dissects neural networks much like physics dissects the natural world.
In this article series, we’ll explore how I understand frameworks used to advance MI towards transparent AI with a physics lens.
1. Introduction to Mechanistic Interpretability
Mechanistic Interpretability (MI) is an emerging field focused on reverse-engineering neural networks to understand how they operate at a fundamental level. We can imagine it as disassembling a complex machine - like a car engine - to examine each gear, spring, and bolt, observing how they interact to produce the final output. Similarly, MI seeks to decode neural networks layer by layer, neuron by neuron, to identify the specific features they recognize, the circuits that process information, and the interpretability bases that map these abstract computations to human-understandable concepts. The goal is not just to observe what a model does but to explain how and why it does it, down to the smallest actionable components.
Foundational Concepts
To understand MI, it’s really important to grasp foundational concepts that describe how neural networks store and process information:
Features:
Features are the building blocks of a neural network’s understanding. They represent specific attributes or patterns in the input data that the model has learned to detect. For example:
In image recognition, a feature might be a horizontal edge, a texture like fur or scales, or even higher-level concepts like "eyes" or "wheels."
In language models, features could correspond to grammatical structures (e.g., verb tenses), semantic categories (e.g., "scientific terms" or "emotional language"), or even abstract relationships (e.g., cause-and-effect).
Features are not hand-coded by humans; they emerge organically during training as the model optimizes to solve its task.
Circuits:
These are groups of a model’s weights and non-linearities that connect one set of features to another. Think of circuits as the pathways that determine how information flows and is processed within the network. For instance:
A circuit in a vision model might link a feature for "edges" to a feature for "shapes," which then activates a feature for "faces."
In a language model, a circuit could route a feature for "question words" (e.g., who, what, where) to a feature for "answer structure," ensuring the response matches the query.
Crucially, circuits are not just linear chains of neurons - they involve non-linear transformations (e.g., activation functions like ReLU) and interactions between multiple layers.
Interpretability Bases:
Interpretability bases are mathematical tools that help researchers "decode" a model’s internal activations. Neural networks process data in high-dimensional spaces (e.g., thousands of dimensions), which are inherently unintuitive to humans. Interpretability bases project these activations onto specific directions in the space that correspond to human-interpretable features.
For example, in a sentiment analysis model, one direction in the activation space might align with "positive sentiment," while another aligns with "negative sentiment." By analyzing these bases, researchers can quantify how much each interpretable feature contributes to the model’s predictions.
Neurons vs. Layers
Neurons: Individual units that activate in response to specific input patterns (e.g., a neuron in a vision model firing for diagonal edges).
Layers: Hierarchical collections of neurons. Early layers detect simple patterns (edges, textures), while deeper layers assemble these into complex concepts (objects, sentences).
Attention Heads (in Transformers)
Transformers, which power modern language models, process data using attention heads - specialized sub-circuits that determine which parts of the input to prioritize. Each head can be thought of as a "mini-circuit" with a specific role:
Query-Key-Value Operations: Attention heads compute relationships between words (e.g., linking pronouns like "he" to their antecedents).
Specialization: Some heads focus on syntax (e.g., subject-verb agreement), while others track semantic coherence (e.g., ensuring "bank" refers to a river, not a financial institution, based on context).
Why this matters: Reverse-engineering attention heads is a cornerstone of MI in transformers, as their behavior directly impacts model outputs.
Superposition
Neural networks often use superposition - a phenomenon where a single neuron or activation encodes multiple unrelated features. For example, a neuron might activate for both "cat ears" and "scientific terminology" in a multimodal model.
Polysemantic Neurons: Neurons that respond to many distinct features (common in large models due to sparse feature space).
Monosemantic Neurons: Neurons that activate for a single, specific feature (rarer but easier to interpret).
Why this matters: Superposition complicates MI by obfuscating the "clean" mapping between neurons and features, requiring advanced techniques to disentangle overlapping signals.
Activation Functions
These mathematical operations (e.g., ReLU, sigmoid) determine how neurons transform inputs into outputs. In MI, they act as "gates" that shape information flow:
Non-Linearity: Functions like ReLU introduce non-linear decision boundaries, enabling networks to learn complex patterns.
Saturation: Functions like sigmoid can "saturate" (e.g., outputting 0 or 1), which MI researchers study to identify when a circuit stops responding to input variations.
Why this matters: Activation functions define the "rules" for how circuits combine features, influencing everything from robustness to adversarial attacks to generalization.
Probing vs. Intervening
Two key methodologies in MI:
Probing: Training a simple model (e.g., linear classifier) on a network’s activations to test if a specific feature (e.g., "sentiment") is present in its representations.
Intervening: Actively modifying activations (e.g., ablating a neuron, amplifying a circuit) to observe causal effects on outputs. For example, silencing a circuit might reveal it was responsible for suppressing biased language.
Why this matters: Probing identifies correlations ("Feature X is here"), while intervening establishes causality ("Circuit Y causes Behavior Z").
Causal Scrubbing
A technique to validate hypothesized circuits by "scrubbing" (resetting) certain activations and observing if the model’s output degrades. If the hypothesis is correct, scrubbing should disrupt specific behaviors (e.g., failing math problems if a "number detection" circuit is scrubbed).
Why this matters: Causal scrubbing bridges the gap between observational and experimental science in MI, enabling rigorous falsification of theories.
How MI Differs from General Interpretability
While general interpretability aims to provide broad explanations of model behavior (e.g., "The model classifies cats by focusing on fur texture"), MI demands a mechanistic, step-by-step account. It asks questions like:
Which exact neurons detect "fur texture"?
How do these neurons communicate with others to trigger the "cat" classification?
What happens if we disrupt this circuit?
This granular approach allows researchers to rigorously test hypotheses about a model’s behavior, similar to how a biologist might study a cell by isolating and manipulating individual proteins. By contrast, general interpretability methods (e.g., attention visualization or feature importance scores) often provide correlational insights rather than causal explanations.
2. Drawing Parallels with Physics
To understand mechanistic interpretability (MI), it helps to borrow frameworks from physics - a field that has spent centuries decoding the universe’s most complex systems. Physics and MI share a common goal: to explain how systems work at their most fundamental level. Whether studying particles or neural networks, both fields rely on observation, hypothesis, and experimentation to move from mystery to mechanistic understanding.
Just as physicists decompose natural phenomena into fundamental principles, MI researchers deconstruct neural networks into interpretable components. This process mirrors the scientific method:
Observation
In physics: Galileo observed pendulum swings to infer laws of motion; astronomers mapped planetary orbits to deduce gravity’s role.
In MI: Researchers track how neurons activate when a model processes inputs. For example, in a vision model, you might notice a neuron firing every time the input contains a spiral shape (like a galaxy or a seashell).
Hypothesis
In physics: Newton proposed that gravity governs both falling apples and orbiting moons.
In MI: A researcher hypothesizes that a specific circuit in a language model resolves pronouns (e.g., linking “it” to “the cat” in the sentence “The cat sat down because it was tired”).
Testing and Validation
In physics: Young’s double-slit experiment tested whether light behaves as a wave or particle by observing interference patterns.
In MI: To validate the pronoun-resolution hypothesis, researchers might “ablate” (disable) the suspected circuit. If the model then fails to link “it” to “the cat,” the hypothesis gains support.
This iterative cycle will allow MI to build causal explanations, much like physics constructs theories to predict celestial motion or particle interactions.
Physics Concepts as Tools for MI
Beyond methodology, specific principles from physics illuminate how neural networks operate:
Classical Mechanics and Deterministic Systems
Classical mechanics predicts outcomes from initial conditions. For example, knowing a ball’s position and velocity lets you calculate its trajectory.
MI parallel: MI researchers trace input-to-output pathways in neural nets looking for ones that behave deterministically in response to particular input properties, much like calculating a ball’s path.
Example: If a vision model always activates Neuron #512 when it “sees” a cat’s eye, you can reverse-engineer how this neuron contributes to the final “cat” classification.
Superposition
In wave mechanics (quantum, electromagnetism, etc), particles / waves exist in multiple states simultaneously (superposition) and share correlated behaviors.
MI parallel: Polysemantic neurons activate for multiple unrelated features. For instance, a single neuron might fire for both “cat ears” and “mathematical integrals,” creating ambiguity.
Why it matters: Just as measuring a quantum particle collapses its state, intervening on a polysemantic neuron (e.g., silencing it) can disrupt seemingly unrelated model behaviors.
Statistical Mechanics and Emergent Behavior
Macroscopic phenomena like temperature emerge from countless microscopic interactions (e.g., molecules colliding).
MI parallel: High-level model capabilities (e.g., storytelling) emerge from low-level neuron interactions. No single neuron “knows” grammar, but circuits across layers collaborate to enforce syntax.
Example: A language model’s ability to write poetry isn’t stored in one neuron, it arises from how circuits combine words, rhythms, and emotions.
Symmetry Principles
Physical laws often remain unchanged under transformations (e.g., rotating a system doesn’t alter its energy conservation).
MI parallel: Convolutional Neural Networks (CNNs) use translational invariance, they detect edges or textures regardless of their position in an image.
Example: A CNN trained to recognize cats will identify a cat’s ear whether it’s in the top-left or bottom-right corner of an image.
Perturbation Theory
Physicists study systems by applying small perturbations (e.g., nudging a particle) to observe responses.
MI parallel: Researchers tweak neuron activations to test causality. For example, amplifying a “positive sentiment” neuron in a language model should make its output more optimistic.
Example: If silencing a circuit reduces a model’s accuracy on math problems, you’ve likely found a “number reasoning” module.
In the remaining parts of this series, we will look at how MI overlaps with physics and where it might go next.