Your Brain Rewrites the Rulebook — and Dopamine Keeps Score

Picture a toddler sorting blocks. One game rewards color. The next rewards shape. The kid figures it out fast — not because they learned new values, but because they switched what they were paying attention to in the first place.

That cognitive trick — deciding which features of the world are even worth tracking — is called representation learning. And for decades, neuroscience mostly ignored it. We knew dopamine tracked reward prediction errors. We knew it drove learning. But nobody had shown dopamine itself changes what it's computing based on context.

This paper changes that. A team in Paris put mice through three different reward games, each with completely different rules, and watched dopamine signals in real time. What they found wasn't just that mice adapted their behavior. The dopamine signal itself rewrote its own logic — encoding different features of the task depending on which game was being played.

The setup: a circular open field with three reward locations. Mice collected intracranial self-stimulation rewards by visiting the locations in sequence. Simple enough. Then the rules changed — twice.

🔬 Method Note

N=49 mice (23 male, 26 female) ran through all three rules sequentially. Each rule lasted 15-20 daily sessions. Reward was delivered via electrode stimulation of the medial forebrain bundle — a direct hit to the brain's reward circuitry. No food deprivation, no external cues. Pure self-generated, goal-directed behavior.

In the Deterministic (Det) rule, every location always paid out. Mice quickly settled into smooth circular laps, keeping U-turns low (around 20%). In the Complex (Cplx) rule, reward only came if the mouse's recent choice sequence was sufficiently variable — repetitive circles got punished with omissions. Mice had to randomize. In the Probabilistic (Proba) rule, the three locations paid out at 100%, 50%, and 25% respectively. Now the smart play was to exploit the best location, driving U-turns way up.

Three rules. Three completely different optimal strategies. And the mice found all three. Success rates, U-turn rates, and sequence complexity all shifted significantly across contexts — this wasn't noise, it was adaptation.

Mice display distinct reward-seeking strategies adapted to each rule. Panel E shows that success rate, sequence complexity, and U-turn rate all differ significantly across the three task contexts. Panel F shows that in the Proba rule, mice strongly prefer the higher-probability option at each choice point.

〰️The Dopamine Signal〰️

To watch dopamine in real time, the team injected an AAV expressing GRABDA2m into the nucleus accumbens lateral shell and implanted a fiber optic above it. Every reward visit, every omission — captured as a fluorescence trace. Rewards drove peaks. Omissions drove dips. Classic reward prediction error territory.

But here's where it gets interesting. The amplitude of those peaks wasn't fixed. Unexpected rewards — delivered off-target during the task, or in a rest cage after conditioning — triggered larger DA peaks than expected rewards. The signal was being modulated by learned expectation, not just by the stimulation itself.

Then the team asked: which task features are actually driving those fluctuations? They ran generalized linear models on the DA data, testing predictors like movement direction, previous trial outcome, and target identity. The answer was different for every rule.

In Det: direction was the dominant predictor (p = 0.0012). U-turns drove bigger DA peaks than forward movements. In Cplx: previous trial outcome dominated (p = 0.0003) — a reward following an omission hit harder than a reward following a reward. In Proba: target probability drove the signal, with higher-probability locations producing smaller peaks and deeper dips. The feature the dopamine system was tracking had completely changed.

Key Takeaway

Dopamine doesn't just encode "reward happened." It encodes "reward happened, given what I expected, based on the feature I'm currently tracking" — and that last part is flexible.

NAc DA release dynamics reveal expectations built upon rule-specific features. Panels G, H, and I show GLM results: the significant predictor shifts from 'Uturn' in Det, to 'Prev_omi' in Cplx, to 'Target_proba' in Proba. Panels J, K, and L confirm these effects in direct DA trace comparisons.

Observing that different features predict DA is suggestive. Proving it requires a formal model. The team built three reinforcement learning agents, each with a different hypothesis about what the mouse was tracking.

Model 1 (M1) treated every trial the same — one value, updated after every visit, regardless of direction or location. Model 2 (M2) maintained separate values for forward movements and U-turns. Model 3 (M3) kept independent values for each of the three spatial targets.

Each model generated theoretical reward prediction errors from the mice's actual choice sequences. Those RPEs were then used to fit the observed DA data. The result was clean: only one model explained DA variation in each rule, and the winning model changed across rules.

In Det, only M2 was significant (p = 0.0024). In Cplx, only M1 (p = 0.0044). In Proba, only M3 (p = 0.0111). The other two models contributed nothing. This wasn't a gradual blend — it was a switch. And the team validated it with a clever manipulation: at the end of Proba, they changed the p100 location to p50. Mice kept expecting full reward there, and omissions triggered even deeper DA dips than at the actual p50 location — the old representation was still running.

Key Takeaway

Three rules, three winning models: action-based (M2) in Det, trial-based (M1) in Cplx, state-based (M3) in Proba. The dopamine system wasn't just learning values — it was selecting the right cognitive framework to learn from.

DA signal embeds an RPE component modeled from distinct value representations specific to each rule. Panels B, C, and D show that only M2 explains DA in Det, only M1 in Cplx, and only M3 in Proba. Panel H shows that when the p100 location was changed to p50, DA dips deepened — the old representation persisted.

The final piece was tracking how these representation switches unfolded over time. The team extended their RL-plus-GLM pipeline across every phase of every rule, feeding the final learned values from one phase into the starting state of the next — mimicking how a real brain carries knowledge forward.

The pattern was striking. The M2 weight (action-based) rose through the end of Det. At the transition to Cplx, it collapsed, and M1 (trial-based) took over for the entire Cplx block. Then, across Proba sessions, M3 (state-based) climbed steadily to dominance. The transitions between rules were the moments of maximum discrepancy — the animal's old model was suddenly wrong, and the dopamine dips that followed may have been the signal that forced a search for a new representation.

In the Proba rule, the behavioral and neural changes were tightly coupled. As mice learned to exploit the high-probability target, the difference in DA responses between high- and low-probability locations grew. Larger DA divergence between targets correlated with higher success rates and stronger exploitation across both individuals and sessions. The dopamine signal wasn't just reflecting the strategy — it was tracking its emergence in real time.

The Cplx rule told a different story. DA showed a persistent effect of previous trial outcome throughout the entire rule, but that gap never grew or correlated with any behavioral parameter. Mice improved their performance by spreading out their choices, not by using local omission signals as a heuristic. The dopamine signal was stable; the behavior adapted around it through a different mechanism entirely.

DA reflects switches in task representations, fostering strategy adaptation. Panel A (top) shows the evolution of model weights across all three rules: M2 dominates at the end of Det, M1 through Cplx, and M3 rises across Proba. Panel C shows that in Proba, the divergence in DA responses between targets (ΔDA) correlates strongly with exploitation index and success rate.