Model sharing in the human medial temporal lobe

Effective planning involves knowing where different actions take us. However natural environments are rich and complex, leading to an exponential increase in memory demand as a plan grows in depth. One potential solution is to filter out features of the environment irrelevant to the task at hand. This enables a shared model of transition dynamics to be used for planning over a range of different input features. Here, we asked human participants (13 male, 16, female) to perform a sequential decision-making task, designed so that knowledge should be integrated independently of the input features (visual cues) present in one case but not in another. Participants efficiently switched between using a low (cue independent) and a high (cue specific) dimensional representation of state transitions. fMRI data identified the medial temporal lobe as a locus for learning state transitions. Within this region, multivariate patterns of BOLD responses as state associations changed (via trial-by-trial learning) were less correlated between trials with differing input features in the high compared to the low dimensional case, suggesting that these patterns switched between separable (specific to input features) and shared (invariant to input features) transition models. Finally, we show that transition models are updated more strongly following the receipt of positive compared to negative outcomes, a finding that challenges conventional theories of planning. Together, these findings propose a computational and neural account of how information relevant for planning can be shared and segmented in response to the vast array of contextual features we encounter in our world.


Introduction
Effective goal-directed behaviour requires an agent to learn an accurate model of the world. Theories of reinforcement learning (RL) conceive of this model as a function p(s′|s, a) that encodes the probability of transitioning to a new state, s', given the current state, s, and action, a. Explicitly learning a state transition function permits agents to plan over possible futures (Sutton and Barto, 1998). This computational framework has been widely used to model simple laboratory behaviours that involve a limited number of state transitions , Doll et al., 2015, Gläscher et al., 2010, Wunderlich et al., 2012. However, This work is licensed under a CC BY 4.0 International license. * correspondence: n.garrett@uea.ac.uk, leonie.glitz@psy.ox.ac.uk.

Competing interests
The authors declare no competing interests. it has well-known limitations, foremost among which is that computational cost grows exponentially with the number of states.
One way agents can reduce this computational cost is to selectively discard informationsuch as intervals of time (minutes, hours, days, etc.) and sensory cues -that can be used to segment experiences into separate states (Niv, 2019). For example, when planning a journey to work, travel delays when travelling by car (traffic jams), rail (train track repairs) or bike (getting wet) can all change from day to day. One way to reduce the cost of planning is to share knowledge of travel delays over multiple days where this is appropriate. For example, train delays might be invariant to whether one is travelling on a weekday or at the weekend.
Here, we developed an experimental paradigm that allowed us to test how the brain adapts the state representations it uses in order to plan efficiently. Our first question was whether, participants would flexibly adapt how information was recruited and updated, switching between low (cue independent) and high (cue specific) dimensional representations. Our question and approach here are similar to those described in a recent paper by Baram and colleagues (Baram et al., 2021), with a key difference being that our work examines how the transition function (how states of the world are associated), rather than the value function (the value of states and actions), is shared across or kept specific to the presence of different sensory cues. Our second question was posed at the neural level and addressed by recording fMRI data whilst participants performed the task. We focussed on the medial temporal lobe (MTL) which has previously been shown to be important for forming new associations between states (Eichenbaum et al., 1999, Miyashita, 1988, Yokose et al., 2017, Rey et al., 2018 and involved in bridging past memories to make new inferences on the basis of paired associations or transitive relations (Bunsey and Eichenbaum, 1996, Wimmer and Shohamy, 2012, Zeithamova et al., 2012, Kumaran et al., 2016, Koster et al., 2018, Park et al., 2019. We find that a cluster of regions in the MTL, including the hippocampus, amygdala and entorhinal cortex, display patterns of BOLD activity encoding transition probabilities that are more similar between sensory cues when model sharing is possible compared to when it is not. This suggests that the MTL maintains separable encoding patterns corresponding to each sensory cue in cases where state associations are cue specific, but uses a single cue independent encoding pattern when they are not. Finally, we designed our paradigm such that the transition function (the probability of moving from s to s' under action a) and the value function (when in state s, the expected value of taking the action a that led to s') were theoretically independent. This allowed us to ask whether state transition learning depends on whether an outcome is positive or negative. We show that belief updating of state transition knowledge occurs to a greater degree following positive outcomes compared to negative. This learning asymmetry is reflected by an interaction in the MTL whereby state prediction errors are expressed with greater fidelity for positive compared to negative outcomes. These findings nuance conventional models of planning that assume state transitions and outcomes are tracked and maintained separately from one another. and the block type they were in (dependent/independent) was clearly signalled to them at the start of a new block of trials. The context of the current trial was signalled to participants by the colour of a gemstone presented in the centre of the screen (either green, yellow, blue or red). The assignment of gemstone to context was different for each participant but (after assignment) remained the same throughout the experiment. Alongside this contextual cue, during door and response presentation participants were also shown a stimulus (either swag bag or police) indicating whether they would receive a gain (if swag bag shown) or incur a loss (if police shown) if they reached the heist state (this changed randomly on every trial). They were also shown a sofa stimulus which indicated they would get 0 upon reaching the sofa state (this was the case in every trial). Since whether a gain or a loss was possible in the heist state alternated on each trial and was signalled to participants meant participants would want to try to reach the heist state on 50% of trials (when the swag bag was shown) and avoid this state (i.e. reach the sofa state) the other 50% (when they police was shown). Explicitly providing participants with this information was done to remove the need to actively learn the value of each bottom level state, emphasise the need to track the transition function and use current beliefs about this function to plan. After indicating their choice, participants were shown the state they transitioned to and the resulting outcomeeither a gain or a loss if they transitioned to the heist state (depending on whether the police or swag bag stimuli had been presented at the time of choice) or zero if they transitioned to the neutral state.
The task took place in sessions of trials (2 blocks of 32 trials per session, 5 sessions total during the experiment, 320 trials total). The first session took place outside of the scanner. Each session contained one block of trials in the dependent condition and one block of trials in the independent condition. The order of the blocks was counterbalanced across sessions. Participants indicated their response using a computer keyboard (outside the scanner) or MRI compatible button box (inside the scanner). Participants were paid a base-rate bonus of 2.50 Euro plus 2.5 times their percentage of correct free choice trials (up to 5 Euro total). The task was programmed in MATLAB using Psychtoolbox (Kleiner et al., 2007).

Behavioural analysis (adapting information integration between contexts)
To examine the extent to which participants updated beliefs about state transitions within and between contexts, logistic regression analyses were conducted (mixed-effects models using the fitglme fitting routine in MATLAB, version 2020 [https://www.mathworks.com/]). Models tested to what extent subjects' choice behaviour on each trial (coded as: select dark door = 1; select light door = 0) was influenced by transitions experienced over the previous 5 trials.
To examine this, we first constructed 5 variables that coded the evidence received from the state transition n trials back (relative to the current trial, t), where n ranged from 1 to 5. When trial t was a gain trial, previous transitions to the heist state were coded 1 (-1) if the dark (light) door was selected t-n trials back and participants transitioned to the heist state and coded -1 (1) if the transition encountered was to the neutral state. This coding was reversed for loss trials (Figure 2). The intuition implicit in this coding scheme is that participants would aim to repeat choices that previously transitioned to the heist state on gain trials but to switch choices on loss trials (in an attempt to transition to the neutral state and avoid incurring a loss). We also partitioned trials according to whether evidence was received in the same or alternate context as the current trial t. This lead to a total of 10 variables -5 encoding evidence received 1 to 5 trials back from the same context and 5 encoding evidence received 1 to 5 trials back from the alternate context. 0 was entered as a value for cases where a variable did not apply for a particular trial (for example, if 3 trials back a subject's choice was executed in the alternate context, evidence 3 trials back in the same context would be assigned a value of 0 for this trial).
Next, to assess qualitatively whether the degree of information integration from each context (same and other) changed between conditions, we entered all 10 variables in separate mixed effects models: one for the dependent condition and one for the independent condition. Only choices from free choice trials were entered in the model as the dependent variable (however the information encoded in the independent variables used to predict choice could come from free or forced trials as participants could use transition information from both trial types). All regressors and the intercept were taken as random effects, i.e. allowed to vary across subjects.
The model was specified in the syntax of MATLABs fitglme routine as: DarkDoor ∼ oneBackSame + twoBackSame + threeBackSame + fourBackSame + fiveBackSame + oneBackOther + twoBackOther + threeBackOther + fourBackOther + fiveBackOther + (1 + oneBackSame + twoBackSame + threeBackSame + fourBackSame + fiveBackSame + oneBackOther + twoBackOther + threeBackOther + fourBackOther + fiveBackOther subject) To the extent that participants are using information from each context to a similar degree (which ought to be the case in dependent blocks), coefficient estimates ought to have a similar magnitude for same and other context. To the extent that participants ignore information from an alternate context to a similar degree (which ought to be the case in independent blocks), there ought to be separation between coefficient estimates from same versus other. Note that by controlling multiple trials back we guard against the possibility that information used in the alternate context can have an effect in the dependent condition by virtue of the fact the feedback of received is similar in the two contexts.
Finally, to assess quantitively whether differences in information integration between conditions were significant, we averaged each condition's streams of evidence for picking the dark door on the current trial over the past five trials. This resulted in two quantities: average_evidence_Same = (oneBackSame + twoBackSame + threeBackSame + fourBackSame + fiveBackSame) /5 average_evidence_Other = (oneBackOther + twoBackOther + threeBackOther + fourBackOther + fiveBackOther) /5 differential_evidence = average_evidence_Same − average_evidence Other The differential evidence score reflects a relative preference in updating beliefs for information received from the same context over information received from the other context. When equal to 0, individuals are indifferent between evidence from the same and evidence from the other context. When greater than 0, individuals prefer (i.e. update beliefs to a greater degree) information received in the same context compared to the other context. When less than 0, individuals prefer information received in the other context compared to the same context. We used this differential evidence score in a third mixed effects model to test whether preferences for the context in which information was received shifted with condition (captured in the model as a Differential Evidence by Condition interaction). The model was specified as follows: DarkDoor ∼ differential_evidence * Condition + (1 + differential_evidence * Condition subject Condition was again coded as 1 = Dependent condition, -1 = Independent condition.

Computational Model
Our model is not intended primarily as an account of the computations that humans undertake, but as an analytic tool. Participants are assumed to track the task's underlying state transition structure in the form of p, an estimate of the probability that selection of one of the two doors (which of the two is arbitrary, but in our modelling this is taken to be the dark door) transitions to the heist state. This is assumed (as is the actual case in the experimental design) to be equal to the probability that the alternate door transitions to the neutral state. It is also assumed (as is the case) that 1-p is equal to the probability of each door going to the alternate state (dark goes to neutral and light to heist). Under these assumptions, maintaining a belief about a single quantity, p, enables computation of estimates for each door going to each 2 nd level (terminating) state. Importantly, participants are assumed to maintain two sets of beliefs about p: p specific i and p independent ⋅ p specific i maintains separate estimates of p, exclusive to each context where i indexes the 2 contexts in each block (i.e. p specific i = 1 , p specific i = 2 , ). p independent maintains a single estimate of p which updates across contexts (within the same block). All estimates of p were initialised to 0.5 at the start of the experiment in all models. Estimates of p were allowed to carry over between blocks (i.e., p did not reset to 0.5 at the start of a new block).
At the time of choice, participants then combine the two sets of beliefs p specific i , p independent into a single estimate, p c , according to: We tested a baseline model in which w was held fixed between conditions. We refer to this as the fixed model. We tested this against a 2 nd model which was identical in all respects except that it allowed w to reverse in the independent condition. In other words, in the dependent condition p c was calculated as: p c = w * p independent + (1 − w) * p specific i In the independent condition p c was calculated as: p c = 1 − w * p independent + w * p specific i We refer to this as the flexible model.
Where the context indexing p specific i can be context 1 or context 2.
The w used in each update is identical to w used to compute p c and was either held fixed (fixed model) or allowed to reverse between conditions (flexible model).
To avoid probability estimates exceeding 1 or going below 0 (which in a small number of cases is possible in this setup), updates to beliefs were bounded to within this range.
The probability of choosing the dark door was then estimated using a softmax choice rule, as follows: p cℎoice = dark door = 1 1 + exp β Q ligℎt_door − Q dark_door Altogether each model has 3 parameters: α, β and w. For each participant, we estimated the free parameters of the model by maximizing the likelihood of their sequence of choices, jointly with group-level distributions over the entire population using an Expectation Maximization (EM) procedure Daw, 2020, Huys et al., 2011) which maximises the joint likelihood of each participants sequences of choices where each individual's parameter estimates are random effects drawn from group level Gaussian parameter distributions whose means and variances and also estimated) implemented in the Julia language (Bezanson et al., 2012), version 0.7.0. Note, similar to the behavioural analysis reported above, all trials (forced and free) were included in the model but only free choice trials were included in the likelihood calculation. Models were compared by first computing unbiased per subject log marginal likelihoods (using the Laplace approximation) via subjectlevel cross-validation (iteratively holding out each subject and estimating the free parameters of the model for the remaining participants using the EM optimisation algorithm then using these estimates as a gaussian prior to optimise the left out subject choices) and then comparing these likelihoods (one per participant) between models (Flexible versus Fixed) using paired sample t-tests (two sided).

Computational Simulations
To examine the qualitative fit of each learning model to the data we ran separate simulations for the Fixed Model (in which w was held constant across conditions) and the Flexible Model (in which w was allowed to vary with condition). For each simulation (n = 504 for each model), we ran a group of 29 virtual participants. For each virtual participant, we randomly selected (with replacement) a set of parameters (ϐ, α and w) from the best fit parameters generated by the computational model (fit to actual participants choices). We then simulated the learning process by which estimates of p evolved (given door selection and state encountered), exactly as described for the respective computational models. To mimic the task as closely as possible, 25% of a virtual agents trials were free choice trials in which we simulated which of the two doors were selected (given current beliefs about p, and whether a gain or a loss was available in the heist state) and 75% were forced choice trials where the door selected was chosen for them (as a coin flip).
We then entered choices made by each virtual agent as the dependent variable in a binomial mixed effects model with regressors coding evidence received 1 to 5 trials back from the same and alternate context (10 regressors in total). This was run separately for each condition replicating the analysis conducted on the data (i.e., actual subjects' choices) with the same model specification (as before, all regressors and the intercept were taken as random effects). This generated a set of fixed effect parameter estimates for each simulation for each condition. We then averaged each fixed parameter estimate over the simulations and compared these to the parameter estimates generated from the data.
Finally, we used the fixed model to run a permutation test to estimate the extent to which an interaction between differential evidence and condition (our third mixed effects model) could arise under agents that did not change information integration between contexts which might occur due to feedback being more similar in the dependent condition compared to the independent condition. Specifically, we simulated choices for 500 groups made up of 29 agents each, performing the task. For each agent, we randomly selected (with replacement) a set of parameters (β, α) from the best fit parameters generated by the fixed model (fit to actual participants choices). W could take any value between 0 and 1 (uniformly distributed) and could not reverse between contexts. For each group we then calculated differential evidence scores on each trial for each participant and entered these into a mixed effects model to predict choices (along with condition and their interaction) exactly as we did using participants data. This generated a distribution of fixed effects estimates and t statistics which we used to calculate a 95% confidence interval and compare against the estimates found in the data. fMRI image acquisition, pre-processing and reporting MRI data were acquired on a 3T Siemens Magnetom Trio MRI Scanner (Erlangen, Germany) scanner. A whole brain high-resolution T1-weighted anatomical structural scan was collected before participants commenced the four in-scanner blocks of the task (imaging parameters: 1mm 3 voxel resolution, TR = 1900ms, TE = 2.52ms, inversion time (Tl) = 900ms, slice thickness = 1mm, voxel resolution = 1mm 3 ). During the task, axial echo planar functional images with BOLD-sensitive contrast were acquired in descending sequence (imaging parameters: 32 axial slices per image; voxel size = 3.5mm 3 , slice spacing = 4.2mm, TR = 2000ms, flip angle = 80°, TE = 30ms). 462 volumes were collected per participant per session (total number of volumes over the 4 sessions = 1848), resulting in a scanning time of approximately 1 hour. Image analysis was performed using SPM12 (http://www.fil.ion.ucl.ac.uk/spm). The following procedures were used for preprocessing of the raw functional files. Slice-time correction referencing was applied with reference to the middle slice to correct for/avoid interpolation errors due to the descending image acquisition sequence (Sladky et al., 2011in Juechems et al. (2017). Then, realignment of the images from each session with the first image within it was performed. The crosshair was adjusted to the anterior commissure manually to improve coregistration. After coregistration of the functional with the structural images was performed, segmentation, normalisation and smoothing of the epi files was undertaken. We then checked for motion artefacts and flagged scans as well as warping manually.
In all fMRI analysis (univariate and RSA searchlights) we report activation that survives small volume correction at peak level within an anatomical or functional ROI mask (see below for how these were defined). Other brain regions were only considered significant at a level of p < 0.001 uncorrected if they survived whole-brain FWE correction at the cluster level (p < 0.05).

Anatomical masks
Anatomical masks were generated using the automated anatomical labelling (AAL) atlas (Tzourio-Mazoyer et al., 2002) and Talairach Daemon Atlas (Lancaster et al., 2000), which was used to define Brodmann area 28 as entorhinal cortex (Canto et al., 2008) and Brodmann area 17 as VI (Tootell et al., 1998) integrated in the WFU Pickatlas GUI (Maldjian et al., 2003): (1) A bilateral medial temporal lobe mask used for small-volume correction, which was defined as including the bilateral hippocampus, entorhinal cortex, parahippocampus and amygdala and dilated by a factor of 1 in the WFU Pickatlas GUI.
All masks were resliced to match the dimensions of our data using the SPM fMRI Realign (Reslice) function.

fMRI general linear model 1
For each participant, the blood oxygen level-dependent (BOLD) signal was modelled using a General Linear Model (GLM) with time of door presentation and time of outcome presentation as onsets. Events were modelled as delta (stick) functions (i.e. duration set to 0 seconds) and collapsed over our two experimental conditions (dependent and independent blocks).
To identify regions tracking state prediction errors, we extracted trial by trial estimates of unsigned state prediction errors, |δ|, from our computational model and entered these as parametric regressors, modulating the time of outcome for each participant. In addition, we also entered the following regressors: outcome received (1, 0 or -1), the interaction of outcome with unsigned state prediction error (i.e. the product of outcome received with |δ| on each trial) and trial type (1 = forced, -1 = free). Six movement parameters, estimated from the realignment procedure were added as regressors of no interest.

ROI definition
We identified region(s) in which the BOLD response was parametrically modulated by the magnitude of the unsigned state prediction error (|δ|), using a threshold of p < 0.001 uncorrected, with cluster size > 10 voxels. Clusters identified were saved as binary regions of interest (ROIs; in SPM) and then combined into a single ROI using the MarsBaR toolbox (http://marsbar.sourceforge.net/). This functional ROI was then used for subsequent representational similarity analysis (RSA; see below). We divided the number of voxels that fell within both our functional ROI and each anatomical mask by the total number of voxels in our functional ROI. This gave us the percentage with which our functional ROI was a conjunction of each anatomical region.

fMRI GLM 2a (door presentation)
For each participant, we created a design matrix in which each door presentation (32 per condition per session) was modelled as a separate event (without parametric regressors attached). Such a procedure has been used multiple times in the past (Charpentier et al., 2014, Garrett et al., 2016. Outcome onset was entered as an additional event. Events were modelled as delta functions and convolved with a canonical hemodynamic response function to create regressors of interest. Six motion correction regressors estimated from the realignment procedure were entered as covariates of no interest.

RSA (door presentation)
To examine whether BOLD responses were more similar between contexts in the dependent versus independent condition, we used GLM3a to extract estimates of BOLD response on each trial in our functional ROI (identified from GLM 1) and partitioned these estimates into four linearly spaced bins according to how likely the door presented was to go to the heist state (P(state = heist | door_presented)). This was inferred by extracting trial by trial estimate of p combined (from the flexible learning computational model) and using p combined or 1-p combined depending whether the dark or light door was presented respectively to estimate p(state=heist | door_presented).
We divided trials into quartiles based on p(heist state | door presented), resulting in the following average (standard deviation; SD) probability bins: This was done separately for each context that the participant (N = 29) encountered (2 in dependent blocks and 2 in independent blocks, 16 bins in total). We then averaged these estimates in each voxel in our functional ROI (collapsing across the 4 functional runs) for each bin generating an average BOLD response for each voxel.
To compare the similarity of responses between contexts we proceeded by first calculating the dissimilarity of BOLD responses in each of the 4 bins between contexts. This was computed using the pdist function in MATLAB using 1-pearson correlation as a measure of distance; hence high correlation indicates a low level of dissimilarity (conversely a high level of similarity). This generated an 8x8 dissimilarity matrix for each condition of which we subselected the 4x4 matrix displaying the dissimilarity of probability bins between the two contexts (i.e. context 1 vs context 2 for each level of p (heist state | door presented)) Dissimilarity scores were then converted into similarity scores (high scores indicating greater similarity) and Fisher transformed to allow inference at the group level. The four similarity scores along the diagonal of each RSA matrix (where identical bins are compared between contexts) were averaged for each participant creating an on-diagonal similarity score which quantifies the extent to which identical values of transition probabilities are encoded similarly between the two contexts in a condition. The 12 similarity scores on the off-diagonal of each RSA matrix (where different bins are compared between contexts) were separately averaged together to create off-diagonal similarity scores. Note that unlike in regular RSA analyses, all 12 scores were averaged across rather than just the upper or lower triangle as the values in the 4x4 RSM are not identical about the diagonal (off-diagonal 4x4 of a larger 8x8, see above). We then computed the difference between on and off diagonal scores separately for each condition. One sample ttests (versus 0) were conducted to assess whether significant differences between on an off diagonal similarity scores existed. Twotailed paired sample ttests were used to compare whether difference scores were greater for the dependent condition compared to the independent condition.
The same RSA procedure was applied to voxels within the four anatomical ROIs used to characterise the nature of the effect within the medial temporal lobe and the control regions V1 and M1 (see Figure 4). The interaction ANOVA result reported in-text are Greenhouse-Geisser corrected to adjust for violations of sphericity (both F value and degrees of freedom).
To check whether there is a relationship between the temporal proximity of trials between contexts and how similar the neural patterns are, we calculated the mean temporal distance between trials in the two contexts on the diagonal and the off-diagonal in each condition for each participant. We then correlated the difference in proximity between diagonal and off-diagonal trials with the difference in representational similarity between the diagonal and off-diagonal in the dependent and the independent condition.

Encoding model analysis
As a complementary approach, we built a linear encoding model, equivalent to a crossvalidated multinomial logistic regression, that mapped voxels (within an ROI) onto probabilities under different constraints. We evaluated this model in cross-validation, using independent held out data from across scanner runs. Briefly, we first extracted single-trial estimates of BOLD within the MTL ROI for each gem on each session, yielding data Y of size v × t, where v is the number of voxels and t the number of trials on which that gem was presented. We also recoded (scalar) single-trial, model-derived estimates of transition probability (converted to odds ratios) as input vectors in either a one-hot format (i.e. a one within the relevant bin and zeros elsewhere) or a Gaussian format (i.e. a Gaussian tuning curve that was maximal in the relevant bin but gradually tapered over adjacent bins). We used n bins falling within the range (in log odds) units of -2 to 2, where n varied exhaustively from 1-10. This yielded data X of size n × t. We estimated weights w by linear regression of X i onto Y i for scanner run i and evaluated the fit of the model to help out probabilities X j from multivariate patterns Y j acquired in scanner run j. We used a (mean) cross-entropy loss in validation. This exercise allowed us to verify, for each gem, the cross-validated loss when weights obtained with gem g were evaluated with gem g' with which it co-occurred, both in the independent condition (where the probabilities were different) and the dependent condition (where they were not). We tested whether there was stronger cross-validation between gems (and across runs) in the dependent than the independent condition, for varying number of bins n and with both one-hot and Gaussian input functions.

Searchlight RSA analysis (door presentation; whole-brain)
To assess whether our ROI was the only brain area with dependent and independent block transition probability representations and potential differences between them or whether this representation was distributed across the brain (and thus potentially less meaningful), we also conducted a whole-brain searchlight analysis. The searchlight analysis was conducted using a combination of scripts from the RSA toolbox (Nili et al., 2014) and our own parser script feeding in the single-trial onset events generated in GLM3a. The searchlight radius used was 10.5mm (corresponding to 3 voxels). Neural representational dissimilarity maps for the two block types were separately correlated with model representational dissimilarity matrices (RDMs, Figure 3d) using Spearman's correlation coefficient. The model RDM specified that the on-diagonal was more similar between contexts than the off-diagonal. This was done individually for each participant and the resulting maps of correlation coefficients were saved. Second-level analysis as described above was then applied to the r-maps to establish separate group-level effects for the two conditions, i.e. the dependent and independent blocks ( Figure 3e). We report any brain regions that survive whole brain correction at the cluster level after thresholding at p < 0.001.

fMRI GLM 2b (outcome presentation)
For each participant, we created a design matrix in which each outcome presentation (32 per condition per session) was modelled as a separate event (without parametric regressors attached). Door presentation onset was entered as an additional event.

RSA analysis (outcome presentation)
We used GLM2b to extract estimates of BOLD response on each trial in our functional ROI and partitioned these estimates into bins according to the combination of doors chosen and state encountered. These combinations (of which there are 4 in total) drive the direction and degree of update of beliefs (p) about state transitions in the computational model. Specifically, we divided responses into bins as follows: This was done separately for each context that the participant encountered (2 in dependent blocks and 2 in independent blocks, 16 bins in total). We then averaged these estimates in each voxel in our functional ROI (collapsing across the 4 functional runs) for each bin generating an average BOLD response for each voxel.
To compare the similarity of responses between contexts we followed a similar procedure to the RSA analysis conducted at door presentation. We first calculated the dissimilarity of BOLD responses in each of the 4 choice-outcome state combinations across the two conditions generating 2 separate 8*8 RSA matrices of which we subselected the off-diagonal 4x4 for further analyses (context 1 vs context 2 for each of the four choice-outcome state combinations computed separately for each condition). After conversion to similarity scores and Fisher transformation, the four on-diagonal similarity scores and the 12 off-diagonal similarity scores of each RSA matrix were averaged to create 2 sets of similarity scores per condition. The mean on-and off-diagonal similarity scores were then entered into a paired ttest to assess differences between identical choice-outcome bins and non-identical choice-outcome bins in the two contexts. Then, to assess whether there were meaningful differences between conditions, the difference between the mean on and off-diagonal scores for each participant in each condition was entered into a paired ttest (dependent on-diagonaloff-diagonal vs independent on-diagonal vs of-diagonal).
The same RSA procedure was applied to voxels within the four anatomical ROIs used to characterise the nature of the effect within the medial temporal lobe and the control regions V1 and M1 (see Figure 5). Again, the ANOVA results reported were Greenhouse-Geisser corrected due to violations of the assumption of sphericity.

Searchlight RSA analysis (outcome presentation; whole-brain)
The searchlight analysis was implemented in the same way as described above for the searchlight RSA at time of door onset. Here, the onset events read into the searchlight script were the outcome onset events generated in GLM3b. Again, the model RDMs specified that the on-diagonal (identical choice-outcome combinations for the two contexts within a condition) was more similar than the off-diagonal ( Figure 5a) and the analysis was conducted separately for the two conditions.

Searchlight interaction analysis (outcome presentation; whole-brain)
The interaction analysis was also conducted similarly to the analysis described above for the time of door onset. In this case, if there is a difference between the difference scores for the two conditions, this means that the difference between the similarity in encoding of identical choice-outcome combinations and different choice-outcome combinations across the two contexts is different between the two conditions. If this difference is positive (as this analysis is coded as encoding similarity), it means the same choice-outcome combinations are encoded more similarly between contexts than non-identical choice-outcome combinations in dependent than in independent blocks and vice versa. As for door presentation, we report any brain regions that survive whole brain correction at the cluster level after thresholding at p < 0.001.

fMRI general linear model 3
To visualise the parametric effect of our interaction term (|δ| * outcome) in GLM1, we ran a separate GLM which included onsets of door presentation and outcome presentation with outcome onsets separated into 3 separate events: time of outcome presentation when participants received an outcome of +1, an outcome of 0 and an outcome of -1. Each of the 3 outcome onsets was modulated by 2 parametric regressors: unsigned state prediction error (extracted from our flexible RL model) and trial type (force/free). Events were modelled as delta functions and collapsed over our two experimental conditions (dependent and independent blocks), just as for fMRI GLM1. Six movement parameters, estimated from the realignment procedure were added as regressors of no interest. We then extracted the parametric betas for the state prediction error regressors for each participant from the 3 outcome conditions using the Marsbar toolbox at the peak voxel of the |δ| * outcome cluster identified in GLM1.

Participants and task (behavioural pilot)
Thirty-one self-declared healthy individuals (18 female; M = 26.29 years, SD = 5.50) were recruited using opportunity sampling via the Oxford University Research Recruitment System. The task was the same as the fMRI cohort undertook described above) save for the following differences. Firstly, participants performed 8 blocks of 60 trials (480 trials total) and all trials in this design were free choice trials. This provided us with a higher powered design to detect differences in updating due to outcome received at the end of an episode. After an inter trial interval (0.3-0.5s) participants had up to 5 seconds to make their choice after which they received confirmation of their choice (0.5s) and feedback (1s). Second, participants were not informed about the differences between blocks. However, just as before each block had two different contexts, a dependent block in which transitions for the two contexts was the same and an independent block in which the transitions were independent.

Behavioural analysis (outcome valence and state transition updating)
To examine the effect of outcome valence on transition updating we calculated a consistency score for each participant. This is the percentage of times a participant's choices were consistent given both: (1) the previous trials state-action-state sequence, (2) whether the current trial was a gain or a loss trial. Since the same state action state sequence can lead to repeating or switching being the correct thing to do -depending whether the next trial is a gain or a loss trial -we first divided trials into two types -repeat and switch. Repeat trials are those for which participants would want to revisit the terminating state from the previous trial. For example, participants would want to repeat their choice if they picked the grey door on the last trial, went to the heist state and the next trial is a gain trial. These trials comprised: (i) Trials where they previously reached the heist state AND the current trial was a gain trial.

(ii)
Trials where they previously reached the neutral state AND the current trial was a loss trial.
Switch trials are those where participants would want to avoid the terminating state from the previous trial. For example, participants should want to swich their choice if they picked the grey door on the last trial, went to the heist state and the next trial is a loss trial. These trials comprised:

(i)
Trials where they previously reached the heist AND the current trial was a loss trial.
(ii) Trials where they previously reached the neutral AND the current trial was a gain trial.
For both repeat and switch trials, the outcome on the previous trial can be positive or negative. For instance, whilst a participant ought to want to repeat selection of grey door if that took them to the heist state on the last trial and the next trial is a gain trial, the outcome on the last trial (when they went to the heist state) could have been positive or negative, depending whether the last trial was a gain or a loss trial. Hence we then further divided each trial type -repeat, switch -into those where they received a positive (+1 on gain trials, 0 on loss trials) or negative (-1 on gain trials, 0 on loss trials) outcome at the end of the previous transition. This gave us 4 types of trials -repeat positive, repeat negative, switch positive and switch negative. We calculated the % of trials participants repeated or switched choices (as appropriate) for these 4 trial types for each participant. We then calculated a consistency score for positive trials by averaging together repeat positive and switch positive. We also did the same for negative trials.
For the behavioural experiment dataset, all trials were used. In the fMRI dataset, only free choice trials were included (but transition sequences from the previous trial could be from a free or a force trial). Participants' consistency scores for positive were compared to negative using paired sample ttests (two tailed). First we did this collapsing over contexts and conditions. This meant that the previous trial could have either been from the same or from the alternate context. Note that participants were not explicitly told of the conditions (i.e., whether to ignore or take notice of contextual cues) in the behavioural dataset. Although they were told this in the fMRI version of the task this ought not to bias this analysis. Nonetheless, we also repeated this analysis only using trials in the dependent condition.
Finally, we calculated a quantified each participants outcome valence effect as the difference between consistency scores for positive trials (i.e. repeat positive and switch positive trials) minus consistency scores for negative trials (i.e. repeat negative and switch negative trials). This indexed the degree to which participants updated state transitions preferentially following positive compared to negative outcomes over both types of trials. We then correlated each participants valence effect with their parametric betas extracted for the interaction regressor (|δ| * outcome) from GLM1.

Task and design
Participants (n = 29) performed a planning task in an fMRI scanner (the "heist" task; Figure  1a). The task was introduced to participants via a cover story that suggested they were a burglar involved in a heist at one of 4 contexts, each denoted by a unique coloured gem. Each trial occurred in one of these four (gem) contexts, and the relevant coloured gem icon remained on the screen throughout the trial to make this clear. After trial onset, participants chose one of two doors (light vs dark), which were respectively associated in context c with probabilities p c of transitioning to the (high-stakes) "heist" state and 1 − p c of transitioning to the "neutral" state ( Figure 1b). p c switched randomly between 0.2 and 0.8 across the course of the experiment, meaning that a door was always likely to transition to one of the outcome states and unlikely to transition to the other. Participants were told that the transitions could change but were not told that there were two possible values p could assume, what these values were, nor were they told how often p could change.
Before making their choice, participants were presented with an additional cue which signalled whether, in the heist state, the participant would be caught (signalled by police cue; incurring a loss) or commit a successful burglary (signalled by swag cue; incurring a gain), whereas no positive or negative outcomes occurred in the neutral state (outcome of zero). The optimal policy was thus to learn the transition probability in order to approach the heist state in the presence of the swag cue and avoid the heist state in the presence of the police cue. To decorrelate choices and probabilities for the scanner, 75% of trials were "forced" in which only a single door was available, but in which transition probabilities could still be updated on receipt of reward. In the remaining 25% of trials, participants could freely choose between the two doors. Participants were unaware during the initial door presentation whether the trial would be a forced or free choice and therefore needed to actively consider transition probabilities on every trial.
The task was performed in alternating blocks that we label "dependent" and "independent" conditions. In dependent blocks, the transitions probabilities associated with the two contexts (e.g. p 1 and p 2 ) were yoked so that p 1 = p 2 at all times (Figure 1c, top panel). In independent blocks, the transition probabilities associated with the other two contexts (e.g. p 3 and p 4 ) were unrelated (overlapping on average half of the time; Figure 1c, bottom panel).
The two contexts that made up each condition were randomly interleaved within a block, but the dependent and independent conditions themselves occurred in temporally distinct blocks of trials. Participants were told before starting the task about the two conditions and were told at the start of each new block whether they were entering a dependent or independent condition block (see Methods for full details about the task).

Behavioural analysis
We first asked whether behaviour differed between the dependent and independent conditions. If participants generalised knowledge of the transition structure across contexts, then they should be more prone to use learning from context j to inform subsequent decisions in context i when in the dependent than independent condition (note that this behaviour is expected because participants were instructed about the dependence or independence among transition probabilities for the two gems in each block).
We used a logistic mixed effects regression to measure this effect in a trial-history dependent fashion, asking how choices made on each trial t in context i depended on the history of state transitions observed over the previous 5 trials that had occurred in the contexts i and j, where j was the alternate context within the relevant condition (dependent or independent).
To conduct this analysis we recoded choices in a single frame of reference that removed the choice inversion between trials where police and swag cues were present. This was necessary because in our task, the transition history is relevant not for determining the specific response (light vs. dark door) but rather the choice contingent on the presence of the swag or police cue. We call the historic information that is predictive of this recoded choice "transition evidence".
The results are shown in Figure 2. In the dependent condition, transition evidence from the previous two trials (t-2) significantly predicted choice, both when it was experienced in the same (t-  Figure 2a). In contrast, in the independent condition, choices were only influenced by transition evidence when this was accrued in the same context (Figure 2b). This was the case going 1, 2, 3 and 4 trials back ( To directly compare the relative weight participants placed on past evidence received from the same and alternate context in each of the two conditions (dependent, independent) we ran an additional mixed effects model. We computed the difference in transition evidence between the two contexts (averaged over the past 5 trials; we call this "differential evidence") for each condition (dependent/independent) and their interaction as predictors in this model. This revealed a significant interaction between differential evidence and condition (β = -0.41 [-0.63, -0.18], SE = 0.11, p < 0.001) along with a main effect of differential evidence (β = 0.43 [0.20, 0.66], SE = 0.12, p < 0.001), but no main effect of condition (β = -0.05 [-0.13, 0.04], SE = 0.04, p = 0.27). The interaction between differential evidence and condition remained significant in a permutation test which guards against greater similarity of feedback (in the dependent condition compared to the independent condition) confounding the effect (β = -0.41, 95% range under the null distribution: [-0.25, 0.08], p < 0.001). Together, these results suggest that the relative preference for information received from the same (versus the alternate) context shifted between conditions. This was a result of participants increasing integration of information from the alternate context in the dependent condition.
We also analysed data from an additional pilot experiment (n = 31; see Methods) with an identical structure except for two important differences: firstly, there were no forced choice trials, and secondly, participants were not instructed about the dependence or independence of the transition structure but were left to discover it for themselves. In contrast to the fMRI cohort, in the independent condition, choices were influenced by transition evidence that accrued in both the same context ( Europe PMC Funders Author Manuscripts interaction between differential evidence and condition (β = -0.097 [-0.17, -0.02], SE = 0.04, p = 0.010) however this was not significant in the permutation test (β [95% range under the null distribution] = -0.097 [-0.25, 0.08], p=0.92). As discussed below, we think that together these results suggest that in the absence of instruction, participants may have a stronger prior that the two gems which co-occur in time belong to a shared latent context.

Computational model
Our modelling framework assumed that choices were determined by a mixture of associations learned in an independent and dependent fashion across contexts. We note at the outset that our model is not intended primarily as an account of the computations that humans undertake, but as an analytic tool that compactly parameterises human policy with just a few parameters which allows us to verify the degree to which humans share a model between contexts.
The model is composed of two learners, one which learns a shared transition function across a pair of contexts and another which learns a separate transition function for each context. On each trial, choices are determined by linearly mixing the estimated probabilities from each learner according to a weighting parameter w, and using the resulting probabilistic estimate p c to compute the relative expected value of heist and neutral states, according to which a choice was made via inverse temperature parameter β. The optimal policy (for an omniscient agent) would be to use w = 1 in the dependent condition and w = 0 in the independent condition. On each trial, participants updated the context specific and the context independent transition functions according to a state prediction error, δ, which quantifies the degree of surprise at reaching a state given the option chosen and current estimates of the transition function. δ was also weighted by w and the degree of update governed by a learning parameter, α. Our rationale for modelling learning of transitions as an incremental process (rather than beliefs fluctuating between p=0.2 and p=0.8) is that we did not explicitly instruct participants that there were two levels of p, what these levels were, nor how often they could change. We assume that learning this underlying structure in practice would therefore be difficult (due to the stochasticity of the transitions, the existence of four different contexts, the frequency with which transitioned change and the heist state fluctuating between gains and losses) but caveat that alternate learning models could be used to formally test this assumption.
We compared two versions of this learning model. A fixed model in which w was held constant across the experiment was compared to a flexible model in which w was allowed to reverse between experimental conditions (i.e., W independent = 1 − W dependent ). This feature of the flexible model gives it the capacity to shift between relying to a greater degree on separate transition functions in the independent condition (i.e., w towards 0) and relying on a shared transition function in the dependent condition (i.e., w towards 1). See Methods for full model specification.

Flexible model adapts information integration between conditions
We fit each model to single subject choices on a per-trial basis and compared fixed and flexible models by computing unbiased marginal likelihoods via subject-level Leave One Out cross validation (LOOcv) for each participant. Comparison of LOOcv scores revealed significantly lower scores (indicating superior performance in cross validation at predicting participant choices) for the flexible model compared to the fixed model (t(28) = 2.72, p < 0.01, paired sample ttest, see Table 1 for model parameters and LOOcv scores). 21 out of the 29 subjects (72% of subjects) were predicted better (had lower LOOcv scores) with the flexible model compared to the fixed model.
The best-fitting w parameter tended towards 1 in the dependent condition and 0 in the independent condition, consistent with the behavioural data. This indicates that participants learned a single transition function in dependent blocks but reverted to learning two different transition functions in independent blocks (by contrast, when w was held fixed across blocks it assumed an intermediate value of ~0.61). A flexible model with two separate w parameters (one per condition, fitted separately) did not account any better for participants choices than the flexible model with a single w that reversed between conditions (t(28) = -1.59, p = 0.12). Simulating choices using a population of subjects drawn according to best-fitting parameters of the flexible model showed that the flexible model qualitatively recapitulated the change in relative preference for information from the alternate versus same context between conditions (Figure 2a, b) to a greater degree than choices simulated from the fixed model (Figure 2c, d).

Neuroimaging data
Having established that participants behave differently in the dependent and independent conditions, we turned to the fMRI data to understand the neural mechanisms that supported this differential behaviour. Our goal was to use multivariate approaches (including representational similarity analysis, RSA) to examine how multivoxel patterns encoding transition probabilities (i.e. beliefs about the forthcoming state) were related in the dependent and independent conditions. However, we first adopted a univariate analysis to identify target sites for the coding of the state transition function, using the state prediction error (SPE) from the model. We expected that the MTL would be sensitive to SPEs, consistent with a long tradition implicating the hippocampus in the formation of state associations (Eichenbaum et al., 1999), and a detector of states that either match or violate the agent's expectations Maguire, 2007, Duncan et al., 2012).

Univariate analysis
We thus modelled BOLD responses at the time the transitioned-to state ("heist" or "neutral") was revealed using a parametric predictor encoding the unsigned state prediction error |δ| extracted from the flexible model. This analysis collapsed over conditions (dependent, independent). This modulator was included alongside other quantities coding for outcome, trial type (forced/free choice) and the interaction of outcome and |δ| (see Methods).
The negative direction of the parametric effect indicates greater change in BOLD response to expected (compared to unexpected) state transitions. We combined these clusters (extracted at p < 0.001 uncorrected) into a single bilateral functional region of interest (ROI) mask (Figure 3b) which we then used for subsequent multivariate analyses.

Representational Similarity Analysis (RSA)
Next, we used a multivariate approach to assess the mapping from BOLD responses in our functional ROI to transition probabilities, and to measure how this mapping changed over contexts. We began with an analysis of BOLD signals at the time of choice, i.e. when the door was presented. This is the timepoint during which participants needed to consider the transition probability to each prospective 2 nd level state. We first used representational similarity analysis (RSA), measuring the correlation distance across multivoxel patterns associated with transition probabilities p(heist state | door presented) derived from our flexible learning model into quartiles, both across blocks and across gems (Figure 3c). Note that our prediction is that neural patterns encoding transition probabilities should be more similar across contexts in dependent than in independent blocks. We thus computed a similarity score by averaging correlations in diagonal (same probability quartile) vs. off-diagonal (different probability quartile) cases, separately for the two contexts in the dependent and independent condition.
One interpretation of this finding is that in the dependent condition, the MTL encodes the state transition function for each context with a common neural pattern. However, we also considered some alternative possibilities. First, we examined whether the results held if we allocated trials to bins using fixed probabilities across the unity range (i.e., quartile 1: 0.00-0.25, quartile 2: 0.26-0.50, quartile 3: 0.51-0.75 & quartile 4: 0.76-1.00) rather than adapting bins for each participant according to the specific distribution of probabilities they used. This revealed the same pattern of results (condition*diagonal interaction: t(28) = 4.20, p < 0.001; 95% CI [.10, .30]; difference in similarity between on and off diagonal bins in the dependent condition: t(28) = 5.73; p < 0.001; 95% CI [.13, .28]; difference in similarity between on and off diagonal bins in the independent condition: t(28) = 0.04; p = 0.97; 95% CI [-.07, .08]). Second, we checked that the number of trials in each probability quartile were well matched between contexts, finding that they were (t(28) = 1.50, p = 0.55, 95% CI [-.02, .15]). Finally, we were concerned that the effect might arise as a spurious effect of closer temporal proximity between trials in the same transition probability quartile in dependent blocks. To address this, first we checked whether the average difference between the temporal distance of trials in on versus off diagonal quartile combinations was correlated with the difference in representational similarity (see Methods). This was neither the case in the dependent condition (r = -.20, p = 0.32), nor the independent condition (r = .15, p = 0.44).
We then repeated our analysis in cross-validation across sessions. In other words, we measured the similarity between quartile/bin n i and n j where i and j are drawn from different scanner runs and computed the average for each similarity bin across all possible c 1i and c 2j Combinations where i ≠ j. This revealed the same (albeit weaker) pattern of results with fixed probability bins (condition*diagonal interaction: t (28)  . In other words, whilst we were able to successfully decode probabilities between contexts in cross validation (Figure 3d) in the dependent condition, this was not the case within context (for either condition). We caution that this does question the robustness of the between context RSA analysis. Reassuringly however, we also did not observe the condition*diagonal interaction we had observed for the between context case (t(28) = -0.53, p = .60, 95% CI [-.03, .02]). Furthermore, we did not find evidence to suggest that decoding was stronger for the between context RSA compared to the within context RSA in the dependent condition (t(28) = -0.69, p=0.49, paired ttest comparing the difference in on and off diagonal similarity scores within vs between contexts), an effect which would have been at odds with participants using a shared transition model. Comparing scores between conditions also revealed a weak effect in the direction we would predict under the hypothesis that participants would switch to use of a context specific model in the independent condition with the difference in decoding accuracy being greater for the within versus the between context RSA in the independent condition compared to the dependent condition (t(28) = 1.69, p(one tailed) = 0.05).

Multivariate encoding model
Next, taking a complementary approach, we built an encoding model that mapped transition probabilities (in the frame of reference p(state = heist|door presented) derived from the flexible learning model as before) flexibly onto voxels within the MTL ROI, separately for each context c a . We then inverted this model to predict transition probabilities both for the same context c b and the other three (held out) contexts (contexts) c b where a ≠ b (see Figure 3e for a schematic of this analysis). This approach allowed us to train and test in cross-validation, by obtaining weights from session (scanner run) i and then using these to predict the probabilities for each context on session j. The model output was a 4 x 4 (context x context) matrix of predicted vs. true (model-derived) transition probabilities, which we compared via cross-entropy loss. This allowed us to measure whether, within the MTL, neural patterns coding for probabilities were more similar across contexts in the dependent condition (e.g. c 1 → c 2 and c 2 → c 1 ) than in the independent condition (e.g. c 3 → c 4 and c 4 → c 3 ). Unlike the RSA approach, this also allowed us to compare two different coding schemes. It could either be the case that state associations are encoded in a high-dimensional format in which probabilities map onto bins with no input structure. This can be implemented via a one-hot input function in the encoding model which also enables us to test various levels of granularity of binning, to verify that the RSA results were not specific to our choice of having 4 bins. Alternatively, it could instead be the case that probabilities are encoded in a low-dimensional format, whereby neural patterns are more similar for closer probabilities (e.g., bin 1 is more similar to bin 2 than to bin 4). This can be implemented via an Gaussian input function (effectively, a tuning curve for probability) in the encoding model. Probabilities were converted to odds ratios for this exercise (see Methods).
The results validated and extended those of the RSA. Using one-hot encoding of probability, we found stronger evidence of shared encoding of probability in the dependent condition compared to the independent condition. Furthermore, this effect was independent of the number of bins chosen, as long as there were more than 3 bins (Figure 3f). We obtained the most robust effects with ~6 bins (implying a psychologically plausible granularity to the estimation of transition probabilities), for which the cross-validated loss was substantially higher between contexts in the independent condition than those in the dependent condition (t(28) = -3.12, p = 0.002). When cross-validation was performed across sessions only, reconstructing both probabilities in the other context as well as in the same context using information from another session (e.g. c 1 session 1 → c 1 session 2) we found the same pattern of results (t(28) = -3.80, p < 0.001. Similar results were also obtained at different granularities. Interestingly, we were unable to recreate these effects under the additional constraint imposed by Gaussian encoding of probability ratios. This implies that whilst there is a consistent code for transition probabilities, its similarity structure does not map smoothly onto the one-dimensional axis given by probability.

Replication of results using anatomical ROI of the medial temporal lobe
To investigate whether the effects we observed were specific to our choice of functional ROI, we conducted a subsequent RSA analysis. This was exactly as described above (for between contexts) only this time we used voxels from a bilateral anatomical MTL mask comprising 4 subregions of the MTL. Specifically: hippocampus, parahippocampus, entorhinal cortex and amygdala. Replicating the effects we observed in our functional ROI, this revealed a significant condition (dependent, independent) x quartile (diagonal, off-diagonal) interaction (t(28) = 5.23, p<.001, 95% CI [.11,.25], paired sample t-test, Figure 4a). This was the result of a difference in similarity between on and off diagonal scores in the dependent condition (t(28) = 6.12, p<.001, 95% CI [.10, .21], one sample t-test versus 0) which was absent in the independent condition (t(28) = -0.96, p =.345, 95% CI [-.07, .03]). We also observed the same pattern of results (i.e. cross-validated loss substantially higher between contexts in the independent condition than those in the dependent condition) rerunning the multivariate encoding model using this anatomical ROI in place of the functional ROI (4 bin case: t(28) = -2.85, p = 0.02, 6 bin case: t(28) = -3.23, p=0.002).
Characterising the nature of the effect in the medial temporal lobe Next, to investigate whether the observed effects were specific to particular subregion(s) of the MTL, we conducted four further RSA analyses on voxels using separate anatomical masks for each of the 4 MTL subregions (hippocampus, parahippocampus, entorhinal cortex and amygdala). Fisher transformed similarity scores were then entered into a region by condition (dependent/ independent) 4*2 repeated measures analysis of variance (ANOVA). This revealed a main effect of condition (F(l,28) = 29.40, p < 0.001) with the difference in similarity (M) between on and off diagonal scores greater in the dependent than independent condition (M hippocampus = .26, M parahippocampus =.13, M entorhina cortex = .15, M amygdala = .16) as well as a region x condition interaction (F(2.45, 68.54) = 3.91, p = 0.018; Greenhouse-Geisser corrected).
To better understand the interaction, we proceeded to test the difference in similarity scores between conditions in each region with every other region (correcting for multiple comparisons). This revealed a larger difference between conditions in the hippocampus compared to each of the other 3 MTL subregions (entorhinal cortex, amygdala and parahippocampus, all p < 0.05, paired sample t-test) with the parahippocampus surviving correction for multiple comparisons (t(28) = 3.91, p = 0.001, significant at Bonferronicorrected threshold of p < 0.008). There was also a main effect of region (F(3, 84) = 3.33, p = 0.023), with the difference across both conditions being significantly greater in the amygdala than in both parahippocampus (t(28) = 3.07, p = 0.005) and entorhinal cortex (t(28) = 2.81, p = 0.009). Together, these results suggest that greater similarity in transition encoding in the dependent compared to the independent condition was not exclusive to a particular subregion of the MTL but was most pronounced in the hippocampus (see Figure  4a).

RSA whole-brain searchlight
Next, we repeated the same RSA as described above across the whole brain using a searchlight approach. In the dependent condition, this identified activity within our functional ROI (right peak: 22, -7, -18; t(28) = 4.31, family-wise error corrected at peak level within functional ROI mask, p FWE = 0.005; left peak: -24, -7, -18; t(28) = 4.92, family-wise error corrected at peak level within functional ROI mask, p FWE = 0.001). The cerebellum also survived family wise error correction for multiple comparisons at the cluster level (cluster-defining threshold p < 0.001, uncorrected). We did not find any evidence for differences in similarity in or outside our functional ROI in the independent condition, even at very lenient thresholds (p < 0.01 uncorrected).

Multivariate analysis during transition probability updating
The analyses described so far focus on the timepoint when planning takes place (door presentation). What happens during updating? To examine this, we conducted a related analysis at the time of transition outcome, i.e. when participants learned whether, conditional on their choice, they had reached the "heist" or the "neutral" state. We reasoned that in order to update the state action representations appropriately (in a shared or unshared manner across contexts) it would be necessary to re-encode both the selected action (light vs. dark door) and encountered state (heist vs. neutral). We thus partitioned data according to these factors and investigated whether BOLD signals were more similar when both state and action matched (i.e. on-diagonal elements) vs where they did not (off-diagonal elements) at the time of updating, separately for the dependent and independent conditions ( Figure  5a) in our functional ROI. This analysis also revealed a significant condition x diagonal interaction (t (28)  Once again, this effect was not specific to a particular subregion of the MTL. We entered similarity scores from RSAs conducted in subregions of the MTL into a condition (dependent, independent) by region (hippocampus, parahippocampus, entorhinal cortex, amygdala) ANOVA. This revealed a main effect of condition (F(l,27) = 10.05 p = 0.004). There was no main effect of region (F(3,81) = 0.95, p = 0.42), nor a region by condition interaction (F(3,81) = 1.21, p = 0.31). There was no difference between conditions in two control brain regions (V1: t (28) -.02,.15]). Finally, a whole brain searchlight comparing the difference in similarity scores between on and off diagonal between conditions revealed a significant interaction within our state prediction error ROI (left: [-16, -4, -32], t(28) = 3.47, p = 0.02 cluster level corrected within our SPE mask), as well as in a cluster comprised of right hippocampus extending into pons (t(28) = 5.05, p < 0.0001 uncorrected, [12, 18, -18], p = 0.04 cluster level corrected across the whole brain with cluster-forming threshold of p < .001 uncorrected). No other significant effects were observed.

Outcome valence modulates updating of state transitions
An interesting feature of our design is that the transition function changes (with reversals of p) in a way that is unrelated to outcomes. This means that in theory, any learning about the transition function should not depend on whether the outcome was positive or negative.
To test whether participants might be biased to update the transition function more or less according to the outcome, we calculated a consistency score (see Methods) for each participant. This measured the consistency of each choice given transitions experienced on the previous trial. A high consistency score indicates that a participant updates transitions strongly on the basis of feedback. This was calculated separately for trials in which participants received a positive outcome (+1 on a gain trials or 0 on a loss trial) and those where they received a negative outcome (-1 on a loss trial or 0 on a gain trial) on the previous trial. Notably, this is not the same as a win-stay, lose-switch bias, as a choice would be considered consistent only if it considered both past transitions and the current reward/loss incurred when reaching the heist state (i.e. if choosing the dark door on trial t-1 had resulted in monetary gain, but the current trial t was a police trial (monetary loss), the consistent choice would be to choose the light door on trial t).
We first conducted this analysis in a separate behavioural experiment (described as "pilot" above; n = 31, see Methods). This experiment included exclusively free choice trials, giving us greater power to be able to detect valence effects. In this version of the task, participants were not told about any structure between contexts and integrated information from each context in each condition.
Participants integrated evidence from the other context in the dependent condition (1-4 trials back) in this dataset but also did so in the independent condition -therefore we remain agnostic as to whether participants adjusted how they integrated feedback from state transitions between the two conditions and primarily use this dataset to examine how outcome interacts with learning the state transitions. This revealed that transition updating was greater following positive compared to negative outcomes (t(30) = 9.79, p < 0.001, paired sample ttest, Figure 6a). In other words, participants updated state transition knowledge and adjusted their subsequent behaviourto a greater degree when outcomes were positive compared to negative. Note that this analysis collapses over contexts (see Methods). However the main effect remains (t(30) = 8.14, p < 0.001) when we run this same analysis restricted to the dependent condition.
Next, we ran the same analysis on our fMRI participants (restricted to free choice trials, Figure 6b). We again observed a main effect of outcome with greater updating following positive compared to negative outcomes (t(28) = 2.24, p = 0.03). This effect remained when analysis was restricted to trials from the dependent condition (t(28) = 2.60, p = 0.015).
In two data sets, our participants' behaviour suggested that rewards received influenced the degree to which the transition function was updated with a greater update following positive compared to negative outcomes. If this is the case, we would predict that SPE signals in the MTL -which drive updates to the transition function -ought to be larger following positive outcomes compared to negative. To test this, we examined the interaction of the unsigned state prediction error regressor and outcome in a univariate whole-brain analysis (controlling for the main effects of each, see Methods). This revealed a negative effect in a cluster in the left MTL (peak [x,y,z]: -13, -10, -18, p = 0.018 whole brain FWE cluster level corrected after thresholding at p < 0.001) which included voxels within our functional ROI (peak [x,y,z]: -16, -7, -18, t(28) = 4.08, small volume corrected using functional ROI mask). No other regions survived whole brain correction. Note that since the main effect of SPE is also negative (Figure 3a), the sign of this interaction suggests a greater parametric effect of unsigned SPEs in the MTL following positive versus negative outcomes.
Finally, we examined whether there was a relationship between this interaction effect in the MTL (i.e., the degree to which unsigned SPEs were modulated by outcome valence) and participants' behaviour (the degree to which consistency scores were greater for positive outcomes compared to negative outcomes). We quantified each participant's behavioural outcome valence effect (Figure 6b) by taking the difference in consistency scores between positive and negative outcomes and correlated these with each participants parametric SPE *outcome interaction beta (this quantifies the degree to which the parametric effect of unsigned SPEs vary are modulated by outcome). This revealed a negative correlation that was robust to outliers (Spearman's rho = -0.41, p < 0.03); specifically, the greater participants showed a bias towards integrating information following transition sequences that ended in a positive outcome (versus negative) in their choices, the greater extent to which unsigned SPEs expressed in the MTL were greater for higher (versus lower) outcomes.

Discussion
We studied the neural and computational mechanisms by which humans combine or segment information about the transition structure of the world. For the fMRI experiment, we chose to directly instruct our participants, as our hypothesis was agnostic to whether model sharing occurred because of instruction or trial-and-error learning. Consequently, our computational model is an analytic tool and does not offer a process-level account of model sharing.
Previous structure learning tasks have suggested that participants are able to use the similarity of latent variables such as value estimates, prediction errors and their covariation over time to draw links between different contexts (Wunderlich et al., 2011, Acuña andSchrater, 2010). Bayesian inference models in which latent causes are inferred and used to group together experiences (Gershman and Niv, 2010, Gershman et al., 2015, Sanders et al., 2020, Niv, 2019 could also be a means that participants learn to group different contexts together in practice. Another possibility is that neural geometry actively represents the relational organisation of task elements (Bernardi et al., 2020, Luyckx et al., 2019, Sheahan et al., 2021. However, our encoding model did not find a smoothly-varying relationship between neural coding and probability, which seems like it would follow naturally from such a representation. To address the question of how neural population activity encoded transition probabilities within and between contexts, we began by identifying voxels that responded differentially according to whether a transition between states was expected or unexpected, using estimates of state prediction errors (SPEs) to parameterise this effect. We found that responses to SPEs correlated with BOLD responses in brain regions that overlapped with the bilateral hippocampus ( Figure 3a). Of note, the parametric effect of SPEs in the MTL were negative. This might seem surprising given the past implication of the (more anterior) hippocampus in novel or surprising stimuli (Strange et al., 2005). We speculate that such a signal might occur however if internal representations were strengthened following evidence that confirms prior beliefs.
We first focussed our analysis on voxels in this region and measured the consistency in neural patterns in these voxels in encoding transition probabilities between conditions. First, we adopted an RSA-based method to show that encoding of probability was more similar across contexts in the dependent condition than the independent condition. We caveat that we were not able to decode probabilities within context using this same approach and caution that this challenges the robustness of this analysis. It may be that there is a confound that gives rise to between context decoding that leaves within context decoding unaffected, or that the identical gem features present when decoding within gem (e.g., colour, shape of each gem) make decoding transition probabilities more difficult or better suited to a different RSA analysis than we use for the between context case (e.g., dividing probability into a different number of bins rather than 4 quartiles). Other explanations could also give rise to this discrepancy. Comparing the difference in within and between context decoding between conditions yielded a pattern (albeit weakly and with due caution in respect to the possibility of a type I error) of results in line with what we would expect. Namely, that the difference in within versus between context decoding was stronger in the independent compared to the dependent condition, consistent with a switch between context specific and context nonspecific models.
Next we adopted a more flexible encoding modelling pipeline as a complementary multivariate approach. This told a similar story: that the brain learned a representation that was similar across contexts when this was beneficial, but partitions probability encoding into different patterns when it is necessary to disambiguate the predictions for different contexts. The encoding model also enabled us to examine the pattern of results under two different coding schemes -a Gaussian input function and a one-hot input function. Interestingly, whilst the one-hot input function replicated the pattern of RSA and was robust to a range of different probability bins being used, the Gaussian input function did not. We are not entirely clear about why this is the case. Previous theories have emphasised that neural populations in cortex may encode probability distributions in smoothly-varying ways, permitting forms of function approximation or Bayesian inference (Ma et al., 2006, Orhan andMa, 2017), and there is even some support for this class of theory from studies involving BOLD recordings (Van Bergen et al., 2015). However, the nature of the coding scheme for transition probabilities in hippocampus remains unclear. Future work could potentially develop this encoding model approach to examine whether other task variables influence the representational structure encoded in the hippocampal formation (such as the level of uncertainty in beliefs, priors, anticipatory (state) prediction errors and the degree to which predictions diverge for different actions).
Observing these effects in the MTL is consistent with past findings that have identified the involvement of the MTL in learning state associations (Eichenbaum et al., 1999, Miyashita, 1988, Yokose et al., 2017, Rey et al., 2018, Schapiro et al., 2012, Schapiro et al., 2013, Deuker et al., 2016, Garvert et al., 2017, encoding relational knowledge that can be used to generalise and draw inferences across contexts (Bunsey and Eichenbaum, 1996, Wimmer and Shohamy, 2012, Zeithamova et al., 2012, Kumaran et al., 2016, Koster et al., 2018, Park et al., 2019 and its role in model based planning (Bradfield et al., 2020), including in two stage sequential planning tasks similar (Vikbladh etal., 2019, Miller et al., 2017, potentially via representation of task structure (Geerts et al., 2020). Our initial analysis focused on a region that included different subregions of the MTL. But when we repeated our RSA approach separately in 4 different anatomical subregions of the MTLhippocampus, entorhinal cortex, amygdala and parahippocampus -we found a significant effect in each of these (an effect which was absent in two control regions). This is suggestive that a network of MTL regions is involved in encoding the predictive relationships between states necessary for planning -consistent with past findings using a similar paradigm to ours (Boorman et al., 2016) -and that each component in this network has the capacity to flexibly adapt the representations it uses to facilitate the sharing of models between contexts when prudent to do so. The involvement of a number of subregions might account for why disabling a specific part of the MTL does not always lead to reductions in goal directed behaviour (Corbit andBalleine, 2000, Gaskin et al., 2005). Interestingly, the effect we observed was strongest in bilateral hippocampus, in line with its involvement in modulating pattern separation between contexts and memories via inputs from other MTL brain regions including the entorhinal cortex (Yassa and Stark, 2011). However, future work -ideally with higher resolution fMRI or direct recordings -is needed to help characterise the precise functional contribution each of these subregions.
We also examined whether there were other regions of the brain in which representations had a similar selective pattern similarity between contexts by running a whole brain searchlight analysis. In addition to confirming the involvement of the MTL, this detected a strong effect in the dorsal striatum and the left IFG. This analysis was exploratory and neither of these brain regions were hypothesised to be involved from the outset. Whilst the IFG and adjacent OFC have previously been shown to be involved in inferring task states using fMRI multivariate approaches (Niv, 2019, Schuck et al., 2016, the striatum was particularly unexpected given its well established role in model free learning (Geerts et al., 2020, Joel et al., 2002, Montague et al., 1996, O'Doherty et al., 2004 although (and with the necessary caveats with regard to retrospective inference), there is some evidence from fMRI and lesion studies that the dorsal striatum -along with prefrontal cortex O'Doherty, 2010, Niv, 2009) -may also play an important role in model based planning behaviour (Yin et al., 2005a, Yin et al., 2005b. Exactly what the functional role either region fulfils here in the service of our task though is unclear. Examining participants stay/switch behaviour revealed an effect of valence whereby following positive outcomes, participants updated transition probabilities to a greater degree than following negative outcomes. We note that these findings are unlikely to be accounted for by purely model free state-action learning since our task and updating metric includes cases where participants should (if using model based control and updating the transition function) repeat choices following negative outcomes and switch choices following positive outcomes. These cases would cancel out the effect of valence we actually observe in the data under a model free controller (which would repeat following positive and switch following negative outcomes). An effect of valence on updating was also observed in the fMRI data which revealed a greater parametric effect of SPEs for positive outcomes relative to negative in the MTL. Interestingly, this pattern of asymmetric updating is reminiscent of confirmation bias (Nickerson, 1998), a recent account of which (Lefebvre et al., 2020) has shown that this learning asymmetry can in fact be beneficial by driving apart the difference in value between the different options. Future theoretical work may help shed light on whether a similar normative account exists behind the asymmetry we observe here in planning.
Together these results shed important light on the computational processes by which the MTL maintains and adapts knowledge about the consequences of our choices and actions in the world. By relying on a common representational code, knowledge can be shared across different contexts that we interact with.

Significance Statement
Effective planning involves maintaining an accurate model of which actions take us to which locations. But in a world awash with information, mapping actions to states with the right level of complexity is critical. Using a new decision-making "heist task" in conjunction with computational modelling and fMRI we show that patterns of BOLD responses in the medial temporal lobe -a brain region key for prospective planningbecome less sensitive to the presence of visual features when these are irrelevant to the task at hand. By flexibly adapting the complexity of task state representations in this way, state-action mappings learned under one set of features can be used to plan in the presence of others. (a) Trial sequence in the fMRI experiment. Each trial begins with a fixation cross after which participants are shown one of two options (a dark and a light door) along with one of 4 contextual cues (red gem in the example) and two stimuli (sofa plus either police or swag) indicating the outcome if they transition to the heist state (if police shown: -1, if swag shown: +1) or the neutral state (always sofa = 0). In forced choice trials (75% of trials), participants are then required to select this option via a button press. In free choice trials (25% of trials) they can choose between this option and the alternate option. Participants were instructed to respond when an X appeared on one or both doors. Feedback -the subsequent state along with the outcome -is then revealed, (b) State transition dynamics: at the first stage, each option (framed as two doors) transitions participants to one of two 2 nd level states; a "neutral state" in which an outcome of 0 (sofa/chair stimuli) is always obtained, or a "heist state" in which an outcome of 1 (swag bag stimuli) or -1 (police stimuli) can be obtained (note which of these two outcomes will be obtained at the heist state is signalled to participants in advance by the presence of one of these cues at choice). One first stage option (light door in the figure) transitions with probability p to the neutral Parameter estimates predicting choice from state transitions experienced 1-5 trials back, separated according to whether transitions occurred in the same (blue) or alternate (red) context to the current trial context in (a) the dependent condition and (b) the independent condition. Bars represent fixed effects regression coefficients from a mixed effects logistic regression on participants' choices. Triangles represent the mean fixed effects regression coefficient estimates generated via the same mixed effects logistic regression as the data but for choice simulated for agents under a flexible computational learning model which enables evidence integration to adapt to the condition in which choices are being made (dependent or independent), (c, d) Plot the same parameter coefficients now with choices under simulated agents from the fixed learning model (in diamonds) which does not permit adaptation in evidence integration). *p < 0.05 (human data). Error bars express 95% confidence intervals. (a) The magnitude of (unsigned) state predictions errors related negatively to the degree of BOLD response in bilateral MTL. Image shown at p < 0.001 uncorrected, (b) Voxels in this contrast were converted to a bilateral mask and used as a functional ROI in subsequent analysis, (c) Schematic of the RSA analysis at the time of planning (door presentation). In each context, trials were divided into quartiles, according to participants' current estimates of p (heist state | door presented) extracted from the computational learning model (mean quartile ranges: bin 1: p <= 0.21, bin 2: 0.21 < p <= 0.51, bin 3: 0.51 < p <= 0.80; bin 4: 0.80 < p <= 0.96). (d) Difference scores were significantly greater for dependent than independent blocks. Dots represent individual participant data, grey lines indicate datapoints belonging to the same participant. Red line indicates the median, box represents the 25 th and 75 th percentile of data, whiskers extend to any data point that is not outside 1.5 times the interquartile range (e) Schematic of the encoding model analysis (example shown for one-hot case). (f) Difference in cross-entropy loss from the encoding model between dependent and independent blocks (predicting probability bins in one context in a condition using weights trained on the other context in that condition; in crossvalidation) for a range of probability bins (one hot case). Error bars show standard error of the mean; *significant at p <0.05, **significant at p < 0.01., ***significant at p < 0.001   Figure 3c was repeated using (a) anatomical mask of the entire MTL and (b) subregions of the MTL, specifically: bilateral hippocampus, parahippocampus, entorhinal cortex and amygdala, (c) as well as in VI and Ml (control regions), (d) illustration of the whole brain searchlight interaction analysis; the difference between on diagonal and off-diagonal similarity was contrasted between conditions, (e) whole-brain searchlight interaction analysis revealed greater similarity between on versus off diagonal in the dependent condition compared to the independent condition in our functional ROI, right dorsal striatum (top panel) and left inferior frontal gyrus/OFC (bottom panel). Brain images shown at p < 0.001 uncorrected, thresholded at t > 3. Error bars show standard error of the mean; * significant at p < 0.05, ** significant at p < 0.01., *** significant at p < 0.001   Outcome on the previous trial influenced the degree to which transition knowledge was updated. Specifically, when participants received a positive outcome (+1 on gain trials or 0 on loss trials) consistency scores (indexed as percentage repeat choices for desirable trials and percentage switch choices for undesirable trials) were higher compared to when they received a negative outcome. This was observed in (a) participants that completed a behavioural study outside the scanner (b) our fMRI cohort, (c) Unsigned state prediction errors were modulated by outcome valence in the MTL (peak [x,y,z]: -13, -10, -18), t(28) = 4.87, p = 0.018 FWE whole brain cluster level corrected), image displayed at p < 0.001 uncorrected, (d) The magnitude of the valance effect observed behaviourally (quantified as green minus red in (b)) correlated with the size of the interaction betas observed in the fMRI