diff --git a/.gitignore b/.gitignore
index fb60c22..1fab14d 100644
--- a/.gitignore
+++ b/.gitignore
@@ -28,8 +28,9 @@
 _region_.tex
 
 
-#docx temp files
+#docx and xlsx temp files
 *~$*.docx
+*~$*.xlsx
 
 # emacs lockfiles
 *.#*
@@ -39,3 +40,5 @@ paper/figs/tmp
 
 # huge (>11GB) directory of fMRI data from Chen et al. (2017)
 data/processed/fMRI
+
+paper/.failed-tests.txt
\ No newline at end of file
diff --git a/README.md b/README.md
index 79db7bf..906ae23 100644
--- a/README.md
+++ b/README.md
@@ -1,23 +1,37 @@
-# Geometric models reveal behavioral and neural signatures of transforming naturalistic experiences into episodic memories
+# Geometric models reveal behavioural and neural signatures of transforming naturalistic experiences into episodic memories
 
-This repository contains data and code used to produce the paper "[_Geometric models reveal behavioral and neural signatures of transforming naturalistic experiences into episodic memories_](https://www.biorxiv.org/content/10.1101/409987v3)" by Andrew C. Heusser, Paxton C. Fitzpatrick, and Jeremy R. Manning. The repository is organized as follows:
+<p align="center"
+  <a href="https://www.nature.com/articles/s41562-021-01051-6">
+    <img src="https://img.shields.io/badge/paper-Nature%20Human%20Behaviour-blue.svg" alt="Paper (Nature Human Behaviour)">
+  </a>
+  <a href="https://rdcu.be/cfsYs">
+    <img src="https://img.shields.io/badge/paper-PDF-blue.svg" alt="Paper (Direct PDF link)">
+  </a>
+  <a href="https://socialsciences.nature.com/posts/how-is-experience-transformed-into-memory">
+    <img src="https://img.shields.io/badge/Behind%20the%20Paper-blog%20post-blue.svg" alt="Behind the Paper (blog post)">
+  </a>
+</p>
 
-```
+This repository contains data and code used to produce the paper "[_Geometric models reveal behavioural and neural signatures of transforming naturalistic experiences into episodic memories_](https://rdcu.be/cfsYs)" by Andrew C. Heusser, Paxton C. Fitzpatrick, and Jeremy R. Manning.
+
+The repository is organized as follows:
+
+```yaml
 root
-├── code : all code used in the paper
+├── code : all analysis code used in the paper
 │   ├── notebooks : Jupyter notebooks for paper analyses
-│   ├── scripts : python scripts used to perform various analyses on a cluster
-│   │   ├── embedding : scripts used to optimize the UMAP embedding for the trajectory figure
-│   │   └── searchlights : scripts used to perform the brain searchlight analyses
-│   └── sherlock_helpers : package with assorted helper functions and variables for analyses
-├── data : all data used in the paper
-│   └── raw : raw data before processing
+│   ├── scripts : Python scripts for running analyses on a HPC cluster (Moab/TORQUE)
+│   │   ├── embedding : scripts for optimizing the UMAP embedding for the trajectory figure
+│   │   └── searchlights : scripts for performing the brain searchlight analyses
+│   └── sherlock_helpers : Python package with support code for analyses
+├── data : all data analyzed in the paper
+│   └── raw : raw video annotations and recall transcripts
 │   └── processed : all processed data
 └── paper : all files to generate paper
-    └── figs : pdf copies of each figure
+    └── figs : pdf copies of all figures
 ```
 
-We also include a Dockerfile to reproduce our computational environment. Instruction for use are below (copied and modified from the [MIND](https://github.com/Summer-MIND/mind-tools) repo):
+We also include a `Dockerfile` to reproduce our computational environment. Instruction for use are below (copied and modified from the [MIND](https://github.com/Summer-MIND/mind-tools) repo):
 
 ## One time setup
 1. Install Docker on your computer using the appropriate guide below:
@@ -26,7 +40,7 @@ We also include a Dockerfile to reproduce our computational environment. Instruc
     - [Ubuntu](https://docs.docker.com/engine/installation/linux/docker-ce/ubuntu/)
     - [Debian](https://docs.docker.com/engine/installation/linux/docker-ce/debian/)
 2. Launch Docker and adjust the preferences to allocate sufficient resources (e.g. >= 4GB RAM)
-3. To build the Docker image, open a terminal window, navigate to your local copy of the repo, and enter `docker build -t sherlock .`  
+3. To build the Docker image, open a terminal window, navigate to your local copy of the repo, and run `docker build -t sherlock .`  
 4. Use the image to run a container with the repo mounted as a volume so the code and data are accessible.
     - The command below will create a new container that maps the repository on your computer to the `/mnt` directory within the container, so that location is shared between your host OS and the container. Be sure to replace `LOCAL/REPO/PATH` with the path to the cloned repository on your own computer (you can get this by navigating to the repository in the terminal and typing `pwd`).  The below command will also share port `9999` with your host computer, so any Jupyter notebooks launched from *within* the container will be accessible at `localhost:9999` in your web browser
     - `docker run -it -p 9999:9999 --name Sherlock -v /LOCAL/REPO/PATH:/mnt sherlock `
diff --git a/code/notebooks/main/searchlight_analyses.ipynb b/code/notebooks/main/searchlight_analyses.ipynb
index 15ee3aa..6017a71 100644
--- a/code/notebooks/main/searchlight_analyses.ipynb
+++ b/code/notebooks/main/searchlight_analyses.ipynb
@@ -4,7 +4,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "This notebook replicates the brain-related analyses. Note: the fMRI data can be downloaded into the data folder from here: https://dataspace.princeton.edu/jspui/handle/88435/dsp01nz8062179"
+    "This notebook loads the output of the searchlight analyses run using the scripts in [`code/scripts/searchlights`](https://github.com/ContextLab/sherlock-topic-model-paper/tree/master/code/scripts/searchlights). The fMRI data used in the searchlight analyses can be downloaded using the script at [`code/scripts/download_neural_data.sh`](https://github.com/ContextLab/sherlock-topic-model-paper/blob/master/code/scripts/download_neural_data.sh)."
    ]
   },
   {
diff --git a/code/notebooks/main/searchlight_figs.ipynb b/code/notebooks/main/searchlight_figs.ipynb
index a9aa2fc..93315f4 100644
--- a/code/notebooks/main/searchlight_figs.ipynb
+++ b/code/notebooks/main/searchlight_figs.ipynb
@@ -4,7 +4,7 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "Note: This notebook loads in some data from the fMRI dataset collected by Chen et al. (2017). If you want to run this notebook, you can download the dataset using script at code/helpers/download_neural_data.sh."
+    "Note: This notebook loads in some data from the fMRI dataset collected by Chen et al. (2017). If you want to run this notebook, you can download the dataset using the script at [`code/scripts/download_neural_data.sh`](https://github.com/ContextLab/sherlock-topic-model-paper/blob/master/code/scripts/download_neural_data.sh)."
    ]
   },
   {
diff --git a/paper/admin/NHB Reviewer responses (rev 2).pdf b/paper/admin/NHB Reviewer responses (rev 2).pdf
index 251a27d..dea3146 100644
Binary files a/paper/admin/NHB Reviewer responses (rev 2).pdf and b/paper/admin/NHB Reviewer responses (rev 2).pdf differ
diff --git a/paper/admin/NHB cover letter (rev 3).pdf b/paper/admin/NHB cover letter (rev 3).pdf
new file mode 100644
index 0000000..9798f1d
Binary files /dev/null and b/paper/admin/NHB cover letter (rev 3).pdf differ
diff --git a/paper/admin/formatting_checklist.docx b/paper/admin/formatting_checklist.docx
index 99ab700..40d724c 100644
Binary files a/paper/admin/formatting_checklist.docx and b/paper/admin/formatting_checklist.docx differ
diff --git a/paper/admin/inventory_supporting_information.docx b/paper/admin/inventory_supporting_information.docx
new file mode 100644
index 0000000..3c5639d
Binary files /dev/null and b/paper/admin/inventory_supporting_information.docx differ
diff --git a/paper/admin/license_to_publish.pdf b/paper/admin/license_to_publish.pdf
new file mode 100644
index 0000000..7e04daf
Binary files /dev/null and b/paper/admin/license_to_publish.pdf differ
diff --git a/paper/admin/nr-reporting-summary.pdf b/paper/admin/nr-reporting-summary.pdf
index fb93ffb..c6a6797 100644
Binary files a/paper/admin/nr-reporting-summary.pdf and b/paper/admin/nr-reporting-summary.pdf differ
diff --git a/paper/admin/snl-ltp.docx b/paper/admin/snl-ltp.docx
new file mode 100644
index 0000000..500e60a
Binary files /dev/null and b/paper/admin/snl-ltp.docx differ
diff --git a/paper/admin/third-party-rights.doc b/paper/admin/third-party-rights.doc
new file mode 100644
index 0000000..e621c7f
Binary files /dev/null and b/paper/admin/third-party-rights.doc differ
diff --git a/paper/diff.pdf b/paper/diff.pdf
index 1d79239..31f1665 100644
Binary files a/paper/diff.pdf and b/paper/diff.pdf differ
diff --git a/paper/diff.tex b/paper/diff.tex
index 2929ed5..dd66ac2 100644
--- a/paper/diff.tex
+++ b/paper/diff.tex
@@ -1,16 +1,12 @@
 \documentclass[10pt]{article}
 %DIF LATEXDIFF DIFFERENCE FILE
-%DIF DEL old.tex    Wed Oct 14 13:36:56 2020
-%DIF ADD main.tex   Fri Oct 16 17:18:25 2020
+%DIF DEL old.tex    Mon Nov 23 15:30:14 2020
+%DIF ADD main.tex   Thu Nov 26 01:40:48 2020
 \usepackage[utf8]{inputenc}
 \usepackage[english]{babel}
 \usepackage[font=small,labelfont=bf]{caption}
 \usepackage{geometry}
-%DIF 6c6
-%DIF < \usepackage[sort]{natbib}
-%DIF -------
-\usepackage[sort&compress, numbers, super]{natbib} %DIF > 
-%DIF -------
+\usepackage[sort&compress, numbers, super]{natbib}
 \usepackage{pxfonts}
 \usepackage{graphicx}
 \usepackage{newfloat}
@@ -18,38 +14,41 @@
 \usepackage{hyperref}
 \usepackage{lineno}
 \usepackage{placeins}
-%DIF 14a14-15
-\usepackage[nofiglist, fighead]{endfloat} %DIF > 
-\renewcommand{\includegraphics}[2][]{} %DIF > 
+%DIF 14c14
+%DIF < \usepackage[nofiglist, fighead]{endfloat}
+%DIF -------
+\usepackage[nofiglist, nomarkers, fighead]{endfloat} %DIF > 
 %DIF -------
+\renewcommand{\includegraphics}[2][]{}
 
 \newcommand{\argmax}{\mathop{\mathrm{argmax}}\limits}
 
-%DIF 17-22c19-25
-%DIF < \newcommand{\topicopt}{S1}
-%DIF < \newcommand{\topics}{S2}
-%DIF < \newcommand{\featureimportance}{S3}
-%DIF < \newcommand{\corrmats}{S4}
-%DIF < \newcommand{\matchmats}{S5}
-%DIF < \newcommand{\kopt}{S6}
-%DIF -------
-\newcommand{\topicopt}{1} %DIF > 
-\newcommand{\topics}{2} %DIF > 
-\newcommand{\featureimportance}{3} %DIF > 
-\newcommand{\corrmats}{5} %DIF > 
-\newcommand{\matchmats}{6} %DIF > 
-\newcommand{\kopt}{7} %DIF > 
-\newcommand{\arrows}{4} %DIF > 
+\newcommand{\topicopt}{1}
+\newcommand{\topics}{2}
+\newcommand{\featureimportance}{3}
+%DIF 22-25d22
+%DIF < \newcommand{\corrmats}{5}
+%DIF < \newcommand{\matchmats}{6}
+%DIF < \newcommand{\kopt}{7}
+%DIF < \newcommand{\arrows}{4}
 %DIF -------
 
+%DIF 27a23-28
+\newcommand{\arrows}{1} %DIF > 
+\newcommand{\corrmats}{2} %DIF > 
+\newcommand{\matchmats}{3} %DIF > 
+\newcommand{\kopt}{4} %DIF > 
+ %DIF > 
+ %DIF > 
+%DIF -------
 \doublespacing
 \linenumbers
 
-\title{Geometric models reveal behavioral and neural signatures of transforming \DIFdelbegin \DIFdel{naturalistic experiences into episodic }\DIFdelend \DIFaddbegin \DIFadd{experiences into }\DIFaddend memories}
+\title{Geometric models reveal behavioral and neural signatures of transforming experiences into memories}
 
-\author{Andrew C. Heusser\textsuperscript{1, 2, \textdagger}, Paxton C. Fitzpatrick\textsuperscript{1, \textdagger}, and Jeremy R. Manning\textsuperscript{1, *}\\\textsuperscript{1}Department of Psychological and Brain Sciences\\Dartmouth College, Hanover, NH 03755, USA\\\textsuperscript{2}Akili Interactive Labs\\Boston, MA 02110\DIFaddbegin \DIFadd{, USA}\DIFaddend \\\textsuperscript{\textdagger}Denotes equal contribution\\\textsuperscript{*}Corresponding author: Jeremy.R.Manning@Dartmouth.edu}
-%DIF < 
-%DIF < \bibliographystyle{apa}
+\author{Andrew C. Heusser\textsuperscript{1, 2, \textdagger}, Paxton C. Fitzpatrick\textsuperscript{1, \textdagger}, and Jeremy R. Manning\textsuperscript{1, *}\\\textsuperscript{1}Department of Psychological and Brain Sciences\\Dartmouth College, Hanover, NH 03755, USA\\\textsuperscript{2}Akili Interactive Labs\\Boston, MA 02110, USA\\\textsuperscript{\textdagger}Denotes equal contribution\\\textsuperscript{*}Corresponding author: Jeremy.R.Manning@Dartmouth.edu}
+ %DIF > 
+\date{} %DIF > 
 %DIF PREAMBLE EXTENSION ADDED BY LATEXDIFF
 %DIF UNDERLINE PREAMBLE %DIF PREAMBLE
 \RequirePackage[normalem]{ulem} %DIF PREAMBLE
@@ -61,8 +60,6 @@
 \providecommand{\DIFaddend}{} %DIF PREAMBLE
 \providecommand{\DIFdelbegin}{} %DIF PREAMBLE
 \providecommand{\DIFdelend}{} %DIF PREAMBLE
-\providecommand{\DIFmodbegin}{} %DIF PREAMBLE
-\providecommand{\DIFmodend}{} %DIF PREAMBLE
 %DIF FLOATSAFE PREAMBLE %DIF PREAMBLE
 \providecommand{\DIFaddFL}[1]{\DIFadd{#1}} %DIF PREAMBLE
 \providecommand{\DIFdelFL}[1]{\DIFdel{#1}} %DIF PREAMBLE
@@ -114,22 +111,6 @@
 \DeclareRobustCommand{\DIFaddendFL}{\DIFOaddendFL \let\includegraphics\DIFOincludegraphics} %DIF PREAMBLE
 \DeclareRobustCommand{\DIFdelbeginFL}{\DIFOdelbeginFL \let\includegraphics\DIFdelincludegraphics} %DIF PREAMBLE
 \DeclareRobustCommand{\DIFdelendFL}{\DIFOaddendFL \let\includegraphics\DIFOincludegraphics} %DIF PREAMBLE
-%DIF LISTINGS PREAMBLE %DIF PREAMBLE
-\RequirePackage{listings} %DIF PREAMBLE
-\RequirePackage{color} %DIF PREAMBLE
-\lstdefinelanguage{DIFcode}{ %DIF PREAMBLE
-%DIF DIFCODE_UNDERLINE %DIF PREAMBLE
-  moredelim=[il][\color{red}\sout]{\%DIF\ <\ }, %DIF PREAMBLE
-  moredelim=[il][\color{blue}\uwave]{\%DIF\ >\ } %DIF PREAMBLE
-} %DIF PREAMBLE
-\lstdefinestyle{DIFverbatimstyle}{ %DIF PREAMBLE
-	language=DIFcode, %DIF PREAMBLE
-	basicstyle=\ttfamily, %DIF PREAMBLE
-	columns=fullflexible, %DIF PREAMBLE
-	keepspaces=true %DIF PREAMBLE
-} %DIF PREAMBLE
-\lstnewenvironment{DIFverbatim}{\lstset{style=DIFverbatimstyle}}{} %DIF PREAMBLE
-\lstnewenvironment{DIFverbatim*}{\lstset{style=DIFverbatimstyle,showspaces=true}}{} %DIF PREAMBLE
 %DIF END PREAMBLE EXTENSION ADDED BY LATEXDIFF
 
 \begin{document}
@@ -139,435 +120,201 @@
   \end{titlepage}
 
 \begin{abstract}
-The mental contexts in which we interpret experiences are often person-specific, even when the experiences themselves are shared. We developed a geometric framework for mathematically characterizing the subjective conceptual content of dynamic naturalistic experiences. We model experiences and memories as \DIFdelbegin \textit{\DIFdel{trajectories}} %DIFAUXCMD
-\DIFdelend \DIFaddbegin \DIFadd{``trajectories'' }\DIFaddend through word embedding spaces whose coordinates reflect the universe of thoughts under consideration. Memory encoding can then be modeled as geometrically preserving or distorting the \DIFdelbegin \textit{\DIFdel{shape}} %DIFAUXCMD
-\DIFdelend \DIFaddbegin \DIFadd{``shape'' }\DIFaddend of the original experience. We applied our approach to data collected as participants watched and verbally recounted a television episode while undergoing functional neuroimaging. Participants’ recountings all preserved coarse spatial properties (essential narrative elements), but not fine spatial scale (low-level) details, of the episode’s trajectory. We also identified networks of brain structures sensitive to these trajectory shapes. Our work provides insights into how we preserve and distort our ongoing experiences when we encode them into episodic memories.
-\end{abstract}
+\DIFaddbegin \DIFadd{How do we preserve and distort our ongoing experiences when encoding them into episodic memories?
+}\DIFaddend The mental contexts in which we interpret experiences are often person-specific, even when the experiences themselves are shared. We developed a geometric framework for mathematically characterizing the subjective conceptual content of dynamic naturalistic experiences. We model experiences and memories as ``trajectories'' through word embedding spaces whose coordinates reflect the universe of thoughts under consideration. Memory encoding can then be modeled as geometrically preserving or distorting the ``shape'' of the original experience. We applied our approach to data collected as participants watched and verbally recounted a television episode while undergoing functional neuroimaging. Participants’ recountings all preserved coarse spatial properties (essential narrative elements), but not fine spatial scale (low-level) details, of the episode’s trajectory. We also identified networks of brain structures sensitive to these trajectory shapes.
+\DIFdelbegin \DIFdel{Our work provides insights into how we preserve and distort our ongoing experiences when we encode them into episodic memories.
+}\DIFdelend \end{abstract}
 
 
 \section*{Introduction}
-What does it mean to \DIFdelbegin \textit{\DIFdel{remember}} %DIFAUXCMD
-\DIFdelend \DIFaddbegin \DIFadd{remember }\DIFaddend something? In traditional episodic memory experiments \DIFdelbegin \DIFdel{\mbox{%DIFAUXCMD
-\citep[e.g., list-learning or trial-based experiments;][]{Murd62a, Kaha96}}\hspace{0pt}%DIFAUXCMD
-, }\DIFdelend \DIFaddbegin \DIFadd{(e.g., list-learning or trial-based experiments\mbox{%DIFAUXCMD
-\citep{Murd62a, Kaha96}}\hspace{0pt}%DIFAUXCMD
-), }\DIFaddend remembering is often cast as a discrete, binary operation: each studied item may be separated from the rest of one's experience and labeled as having been either recalled or forgotten. More nuanced studies might incorporate self-reported confidence measures as a proxy for memory strength, or ask participants to discriminate between recollecting the (contextual) details of an experience and having a general feeling of familiarity~\citep{Yone02}. Using well-controlled, trial-based experimental designs, the field has amassed a wealth of information regarding human episodic memory~\DIFdelbegin \DIFdel{\mbox{%DIFAUXCMD
-\citep[for review see][]{Kaha12}}\hspace{0pt}%DIFAUXCMD
-}\DIFdelend \DIFaddbegin \DIFadd{\mbox{%DIFAUXCMD
-\citep{Kaha12}}\hspace{0pt}%DIFAUXCMD
-}\DIFaddend .  However, there are fundamental properties of the external world and our memories that trial-based experiments are not well suited to capture~\DIFdelbegin \DIFdel{\mbox{%DIFAUXCMD
-\citep[for review, also see][]{KoriGold94, HukEtal18}}\hspace{0pt}%DIFAUXCMD
-}\DIFdelend \DIFaddbegin \DIFadd{\mbox{%DIFAUXCMD
-\citep{KoriGold94, HukEtal18}}\hspace{0pt}%DIFAUXCMD
-}\DIFaddend .  First, our experiences and memories are continuous, rather than discrete---isolating a naturalistic event from the context in which it occurs can substantially change its meaning.  Second, whether or not the rememberer has precisely reproduced a specific set of words in describing a given experience is nearly orthogonal to how well they were actually able to remember it.  In classic (e.g., list-learning) memory studies, by contrast, the number or proportion of \DIFdelbegin \textit{\DIFdel{exact}} %DIFAUXCMD
-\DIFdelend \DIFaddbegin \DIFadd{exact }\DIFaddend recalls is often considered to be a primary metric for assessing the quality of participants' memories.  Third, one might remember the essence (or a general summary) of an experience but forget (or neglect to recount) particular low-level details.  Capturing the essence of what happened is often a main goal of recounting an episodic memory to a listener, whereas the inclusion of specific low-level details is often less pertinent.
-
-How might we formally characterize the \DIFdelbegin \textit{\DIFdel{essence}} %DIFAUXCMD
-\DIFdelend \DIFaddbegin \DIFadd{``essence'' }\DIFaddend of an experience, and whether it has been recovered by the rememberer?  And how might we distinguish an experience's overarching essence from its low-level details?  One approach is to start by considering some fundamental properties of the dynamics of our experiences.  Each given moment of an experience tends to derive meaning from surrounding moments, as well as from longer-range temporal associations~\citep{LernEtal11, Mann19, Mann20}.  Therefore, the timecourse describing how an event unfolds is fundamental to its overall meaning.  Further, this hierarchy formed by our subjective experiences at different timescales defines a \DIFdelbegin \textit{\DIFdel{context}} %DIFAUXCMD
-\DIFdelend \DIFaddbegin \DIFadd{context }\DIFaddend for each new moment~\DIFdelbegin \DIFdel{\mbox{%DIFAUXCMD
-\citep[e.g.,][]{HowaKaha02a, HowaEtal14}}\hspace{0pt}%DIFAUXCMD
-}\DIFdelend \DIFaddbegin \DIFadd{\mbox{%DIFAUXCMD
-\citep{HowaKaha02a, HowaEtal14}}\hspace{0pt}%DIFAUXCMD
-}\DIFaddend , and plays an important role in how we interpret that moment and remember it later~\DIFdelbegin \DIFdel{\mbox{%DIFAUXCMD
-\citep[for review see][]{MannEtal15, Mann20}}\hspace{0pt}%DIFAUXCMD
-}\DIFdelend \DIFaddbegin \DIFadd{\mbox{%DIFAUXCMD
-\citep{MannEtal15, Mann20}}\hspace{0pt}%DIFAUXCMD
-}\DIFaddend .  Our memory systems can leverage these associations to form predictions that help guide our behaviors~\citep{RangRitc12}.  For example, as we navigate the world, the features of our subjective experiences tend to change gradually (e.g., the room or situation we find ourselves in at any given moment is strongly temporally autocorrelated), allowing us to form stable estimates of our current situation and behave accordingly~\citep{ZackEtal07, ZwaaRadv98}.
-
-Occasionally, this gradual drift of our ongoing experience is punctuated by sudden changes, or shifts \DIFdelbegin \DIFdel{~\mbox{%DIFAUXCMD
-\citep[e.g., when we walk through a doorway;][]{RadvZack17}}\hspace{0pt}%DIFAUXCMD
-}\DIFdelend \DIFaddbegin \DIFadd{(e.g., when we walk through a doorway~\mbox{%DIFAUXCMD
-\citep{RadvZack17}}\hspace{0pt}%DIFAUXCMD
-)}\DIFaddend .  Prior research suggests that these sharp transitions (termed \DIFdelbegin \textit{\DIFdel{event boundaries}}%DIFAUXCMD
-\DIFdelend \DIFaddbegin \DIFadd{``event boundaries''}\DIFaddend ) help to discretize our experiences (and their mental representations) into \DIFdelbegin \textit{\DIFdel{events}}%DIFAUXCMD
-\DIFdelend \DIFaddbegin \DIFadd{``events''}\DIFaddend ~\citep{RadvZack17, BrunEtal18, HeusEtal18b, ClewDava17, EzzyDava11, DuBrDava13}.  The interplay between the stable (within-event) and transient (across-event) temporal dynamics of an experience also provides a potential framework for transforming experiences into memories that distills those experiences down to their essences.  For example, prior work has shown that event boundaries can influence how we learn sequences of items~\citep{HeusEtal18b, DuBrDava13}, navigate~\citep{BrunEtal18}, and remember and understand narratives~\citep{ZwaaRadv98, EzzyDava11}.  This work also suggests a means of distinguishing the essence of an experience from its low-level details:  The overall structure of events and event transitions reflects how the high-level experience unfolds (i.e., its essence), while subtler event-level properties reflect its low-level details.  Prior research has also implicated a network of brain regions (including the hippocampus and the medial prefrontal cortex) in playing a critical role in transforming experiences into structured and consolidated memories ~\citep{TompDava17}.
-
-Here, we sought to examine how the temporal dynamics of a naturalistic experience were later reflected in participants' memories.  We also sought to leverage the above conceptual insights into the distinctions between an experience's essence and its low-level details to build models that explicitly quantified these distinctions.  We analyzed an open dataset that comprised behavioral and functional Magnetic Resonance Imaging (fMRI) data collected as participants viewed and then verbally recounted an episode of the BBC television show \textit{Sherlock}~\citep{ChenEtal17}.  We developed a computational framework for characterizing the temporal dynamics of the moment-by-moment content of the episode and of participants' verbal recalls.  Our framework uses topic modeling~\citep{BleiEtal03} to characterize the thematic conceptual (semantic) content present in each moment of the episode and recalls by projecting each moment into a word embedding space.  We then use hidden Markov models~\citep{Rabi89, BaldEtal17} to discretize this evolving semantic content into events.  In this way, we cast both naturalistic experiences and memories of those experiences as geometric \DIFdelbegin \textit{\DIFdel{trajectories}} %DIFAUXCMD
-\DIFdelend \DIFaddbegin \DIFadd{``trajectories'' }\DIFaddend through word embedding space that describe how they evolve over time. Under this framework, successful remembering entails verbally traversing the content trajectory of the episode, thereby reproducing the shape (essence) of the original experience.  Our framework captures the episode's essence in the sequence of geometric coordinates for its events, and its low-level details by examining its within-event geometric properties.
-
-Comparing the overall shapes of the topic trajectories for the episode and participants' recalls reveals which aspects of the episode's essence were preserved (or lost) in the translation into memory.  We also develop two metrics for assessing participants' memories for low-level details: (1) the \DIFdelbegin \textit{\DIFdel{precision}} %DIFAUXCMD
-\DIFdelend \DIFaddbegin \DIFadd{``precision'' }\DIFaddend with which a participant recounts details about each event, and (2) the \DIFdelbegin \textit{\DIFdel{distinctiveness}} %DIFAUXCMD
-\DIFdelend \DIFaddbegin \DIFadd{``distinctiveness'' }\DIFaddend of their recall for each event, relative to other events.  We examine how these metrics relate to overall memory performance as judged by third-party human annotators.  We also compare and contrast our general approach to studying memory for naturalistic experiences with standard metrics for assessing performance on more traditional memory tasks, such as list-learning.  Last, we leverage our framework to identify networks of brain structures whose responses (as participants watched the episode) reflected the temporal dynamics of the episode and/or how participants would later recount it.
+What does it mean to remember something? In traditional episodic memory experiments (e.g., list-learning or trial-based experiments\citep{Murd62a, Kaha96}), remembering is often cast as a discrete, binary operation: each studied item may be separated from the rest of one's experience and labeled as having been either recalled or forgotten. More nuanced studies might incorporate self-reported confidence measures as a proxy for memory strength, or ask participants to discriminate between recollecting the (contextual) details of an experience and having a general feeling of familiarity~\citep{Yone02}. Using well-controlled, trial-based experimental designs, the field has amassed a wealth of information regarding human episodic memory~\citep{Kaha12}.  However, there are fundamental properties of the external world and our memories that trial-based experiments are not well suited to capture~\citep{KoriGold94, HukEtal18}.  First, our experiences and memories are continuous, rather than discrete---isolating a naturalistic event from the context in which it occurs can substantially change its meaning.  Second, whether or not the rememberer has precisely reproduced a specific set of words in describing a given experience is nearly orthogonal to how well they were actually able to remember it.  In classic (e.g., list-learning) memory studies, by contrast, the number or proportion of exact recalls is often considered to be a primary metric for assessing the quality of participants' memories.  Third, one might remember the essence (or a general summary) of an experience but forget (or neglect to recount) particular low-level details.  Capturing the essence of what happened is often a main goal of recounting an episodic memory to a listener, whereas the inclusion of specific low-level details is often less pertinent.
+
+How might we formally characterize the ``essence'' of an experience, and whether it has been recovered by the rememberer?  And how might we distinguish an experience's overarching essence from its low-level details?  One approach is to start by considering some fundamental properties of the dynamics of our experiences.  Each given moment of an experience tends to derive meaning from surrounding moments, as well as from longer-range temporal associations~\citep{LernEtal11, Mann19, Mann20}.  Therefore, the timecourse describing how an event unfolds is fundamental to its overall meaning.  Further, this hierarchy formed by our subjective experiences at different timescales defines a context for each new moment~\citep{HowaKaha02a, HowaEtal14}, and plays an important role in how we interpret that moment and remember it later~\citep{MannEtal15, Mann20}.  Our memory systems can leverage these associations to form predictions that help guide our behaviors~\citep{RangRitc12}.  For example, as we navigate the world, the features of our subjective experiences tend to change gradually (e.g., the room or situation we find ourselves in at any given moment is strongly temporally autocorrelated), allowing us to form stable estimates of our current situation and behave accordingly~\citep{ZackEtal07, ZwaaRadv98}.
+
+Occasionally, this gradual drift of our ongoing experience is punctuated by sudden changes, or shifts (e.g., when we walk through a doorway~\citep{RadvZack17}).  Prior research suggests that these sharp transitions (termed ``event boundaries'') help to discretize our experiences (and their mental representations) into ``events''~\citep{RadvZack17, BrunEtal18, HeusEtal18b, ClewDava17, EzzyDava11, DuBrDava13}.  The interplay between the stable (within-event) and transient (across-event) temporal dynamics of an experience also provides a potential framework for transforming experiences into memories that distills those experiences down to their essences.  For example, prior work has shown that event boundaries can influence how we learn sequences of items~\citep{HeusEtal18b, DuBrDava13}, navigate~\citep{BrunEtal18}, and remember and understand narratives~\citep{ZwaaRadv98, EzzyDava11}.  This work also suggests a means of distinguishing the essence of an experience from its low-level details:  The overall structure of events and event transitions reflects how the high-level experience unfolds (i.e., its essence), while subtler event-level properties reflect its low-level details.  Prior research has also implicated a network of brain regions (including the hippocampus and the medial prefrontal cortex) in playing a critical role in transforming experiences into structured and consolidated memories ~\citep{TompDava17}.
+
+Here, we sought to examine how the temporal dynamics of a naturalistic experience were later reflected in participants' memories.  We also sought to leverage the above conceptual insights into the distinctions between an experience's essence and its low-level details to build models that explicitly quantified these distinctions.  We analyzed an open dataset that comprised behavioral and functional Magnetic Resonance Imaging (fMRI) data collected as participants viewed and then verbally recounted an episode of the BBC television show \textit{Sherlock}~\citep{ChenEtal17}.  We developed a computational framework for characterizing the temporal dynamics of the moment-by-moment content of the episode and of participants' verbal recalls.  Our framework uses topic modeling~\citep{BleiEtal03} to characterize the thematic conceptual (semantic) content present in each moment of the episode and recalls by projecting each moment into a word embedding space.  We then use hidden Markov models~\citep{Rabi89, BaldEtal17} to discretize this evolving semantic content into events.  In this way, we cast both naturalistic experiences and memories of those experiences as geometric ``trajectories'' through word embedding space that describe how they evolve over time. Under this framework, successful remembering entails verbally traversing the content trajectory of the episode, thereby reproducing the shape (essence) of the original experience.  Our framework captures the episode's essence in the sequence of geometric coordinates for its events, and its low-level details by examining its within-event geometric properties.
+
+Comparing the overall shapes of the topic trajectories for the episode and participants' recalls reveals which aspects of the episode's essence were preserved (or lost) in the translation into memory.  We also develop two metrics for assessing participants' memories for low-level details: (1) the ``precision'' with which a participant recounts details about each event, and (2) the ``distinctiveness'' of their recall for each event, relative to other events.  We examine how these metrics relate to overall memory performance as judged by third-party human annotators.  We also compare and contrast our general approach to studying memory for naturalistic experiences with standard metrics for assessing performance on more traditional memory tasks, such as list-learning.  Last, we leverage our framework to identify networks of brain structures whose responses (as participants watched the episode) reflected the temporal dynamics of the episode and/or how participants would later recount it.
 
 
 \section*{Results}
-To characterize the dynamic content of the \textit{Sherlock} episode and participants' subsequent recountings, we used a topic model~\citep{BleiEtal03} to discover the episode's latent themes.  Topic models take as inputs a vocabulary of words to consider and a collection of text documents, and return two output matrices.  The first of these is a \DIFdelbegin \textit{\DIFdel{topics matrix}} %DIFAUXCMD
-\DIFdelend \DIFaddbegin \DIFadd{``topics matrix'' }\DIFaddend whose rows are \DIFdelbegin \textit{\DIFdel{topics}} %DIFAUXCMD
-\DIFdelend \DIFaddbegin \DIFadd{``topics'' }\DIFaddend (or latent themes) and whose columns correspond to words in the vocabulary. The entries in the topics matrix reflect how each word in the vocabulary is weighted by each discovered topic.  For example, a detective-themed topic might weight heavily on words like ``crime,'' and ``search.''  The second output is a \DIFdelbegin \textit{\DIFdel{topic proportions matrix}} %DIFAUXCMD
-\DIFdelend \DIFaddbegin \DIFadd{``topic proportions matrix'' }\DIFaddend with one row per document and one column per topic.  The topic proportions matrix describes the mixture of discovered topics reflected in each document.
-
-\DIFdelbegin \DIFdel{\mbox{%DIFAUXCMD
-\cite{ChenEtal17} }\hspace{0pt}%DIFAUXCMD
-}\DIFdelend \DIFaddbegin \DIFadd{Chen et al. (2017) }\DIFaddend collected hand-annotated information about each of \DIFdelbegin \DIFdel{1,000 }\DIFdelend \DIFaddbegin \DIFadd{1000 }\DIFaddend (manually delineated) time segments spanning the roughly 50 minute video used in their study\DIFaddbegin \DIFadd{~\mbox{%DIFAUXCMD
-\cite{ChenEtal17}}\hspace{0pt}%DIFAUXCMD
-}\DIFaddend .  Each annotation included: a brief narrative description of what was happening, the location where the action took place, the names of any characters on the screen, and other similar details (for a full list of annotated features, see \textit{Methods}).  We took the union of all unique words (excluding stop words, such as ``and,'' ``or,'' ``but,'' etc.) across all features from all annotations as the vocabulary for the topic model.  We then concatenated the sets of words across all features contained in overlapping sliding windows of (up to) 50 annotations, and treated each window as a single document for the purpose of fitting the topic model.  Next, we fit a topic model with (up to) $K = 100$ topics to this collection of documents.  We found that 32 unique topics (with non-zero weights) were sufficient to describe the time-varying content of the episode (see \textit{Methods}; \DIFdelbegin \DIFdel{Figs}\DIFdelend \DIFaddbegin \DIFadd{Fig}\DIFaddend .~\ref{fig:schematic}, \DIFaddbegin \DIFadd{Supp.\ Fig.~}\DIFaddend \topics).  We note that our approach is similar in some respects to Dynamic Topic Models~\citep{BleiLaff06} in that we sought to characterize how the thematic content of the episode evolved over time.  However, whereas Dynamic Topic Models are designed to characterize how the properties of \DIFdelbegin \textit{\DIFdel{collections}} %DIFAUXCMD
-\DIFdelend \DIFaddbegin \DIFadd{collections }\DIFaddend of documents change over time, our sliding window approach allows us to examine the topic dynamics within a single document (or video).  Specifically, our approach yielded (via the topic proportions matrix) a single \DIFdelbegin \textit{\DIFdel{topic vector}} %DIFAUXCMD
-\DIFdelend \DIFaddbegin \DIFadd{``topic vector'' }\DIFaddend for each sliding window of annotations transformed by the topic model.  We then stretched (interpolated) the resulting windows-by-topics matrix to match the time series of the \DIFdelbegin \DIFdel{1,976 }\DIFdelend \DIFaddbegin \DIFadd{1976 }\DIFaddend fMRI volumes collected as participants viewed the episode.
+To characterize the dynamic content of the \textit{Sherlock} episode and participants' subsequent recountings, we used a topic model~\citep{BleiEtal03} to discover the episode's latent themes.  Topic models take as inputs a vocabulary of words to consider and a collection of text documents, and return two output matrices.  The first of these is a ``topics matrix'' whose rows are ``topics'' (or latent themes) and whose columns correspond to words in the vocabulary. The entries in the topics matrix reflect how each word in the vocabulary is weighted by each discovered topic.  For example, a detective-themed topic might weight heavily on words like ``crime,'' and ``search.''  The second output is a ``topic proportions matrix'' with one row per document and one column per topic.  The topic proportions matrix describes the mixture of discovered topics reflected in each document.
 
-\begin{figure}
-[tp]
+Chen et al. (2017) collected hand-annotated information about each of 1000 (manually delineated) time segments spanning the roughly 50 minute video used in their study~\cite{ChenEtal17}.  Each annotation included: a brief narrative description of what was happening, the location where the action took place, the names of any characters on the screen, and other similar details (for a full list of annotated features, see \textit{Methods}).  We took the union of all unique words (excluding stop words, such as ``and,'' ``or,'' ``but,'' etc.) across all features from all annotations as the vocabulary for the topic model.  We then concatenated the sets of words across all features contained in overlapping sliding windows of (up to) 50 annotations, and treated each window as a single document for the purpose of fitting the topic model.  Next, we fit a topic model with (up to) $K = 100$ topics to this collection of documents.  We found that 32 unique topics (with non-zero weights) were sufficient to describe the time-varying content of the episode (see \textit{Methods}; Fig.~\ref{fig:schematic}, Supp.\ Fig.~\topics).  We note that our approach is similar in some respects to Dynamic Topic Models~\citep{BleiLaff06} in that we sought to characterize how the thematic content of the episode evolved over time.  However, whereas Dynamic Topic Models are designed to characterize how the properties of collections of documents change over time, our sliding window approach allows us to examine the topic dynamics within a single document (or video).  Specifically, our approach yielded (via the topic proportions matrix) a single ``topic vector'' for each sliding window of annotations transformed by the topic model.  We then stretched (interpolated) the resulting windows-by-topics matrix to match the time series of the 1976 fMRI volumes collected as participants viewed the episode.
+
+\begin{figure}[tp]
 \centering
 \includegraphics[width=1\textwidth]{figs/schematic}
-\caption{\small \textbf{Topic weights in episode and recall content.} We used detailed, hand-generated annotations describing each manually identified time segment from the episode to fit a topic model.  Three example frames from the episode (first row) are displayed, along with their descriptions from the corresponding episode annotation (second row) and an example participant's recall transcript (third row).  We used the topic model (fit to the episode annotations) to estimate topic vectors for each moment of the episode and each sentence of participants' recalls.  Example topic vectors are displayed in the bottom row (blue: episode annotations; green: example participant's recalls).  Three topic dimensions are shown (the highest-weighted topics for each of the three example scenes, respectively), along with the 10 highest-weighted words for each topic.  \DIFaddbeginFL \DIFaddFL{Supplementary }\DIFaddendFL Figure~\topics~provides a full list of the top 10 words from each of the discovered topics.}
+\caption{\small \textbf{Topic weights in episode and recall content.} We used detailed, hand-generated annotations describing each manually identified time segment from the episode to fit a topic model.  Three example frames from the episode (first row) are displayed, along with their descriptions from the corresponding episode annotation (second row) and an example participant's recall transcript (third row).  We used the topic model (fit to the episode annotations) to estimate topic vectors for each moment of the episode and each sentence of participants' recalls.  Example topic vectors are displayed in the bottom row (blue: episode annotations; green: example participant's recalls).  Three topic dimensions are shown (the highest-weighted topics for each of the three example scenes, respectively), along with the 10 highest-weighted words for each topic.  Supplementary Figure~\topics~provides a full list of the top 10 words from each of the discovered topics.}
 \label{fig:schematic}
-
 \end{figure}
+The 32 topics we found were heavily character-focused (i.e., the top-weighted word in each topic was nearly always a character) and could be roughly divided into themes centered around Sherlock Holmes (the titular character), John Watson (Sherlock's close confidant and assistant), supporting characters (e.g., Inspector Lestrade, Sergeant Donovan, or Sherlock's brother Mycroft), or the interactions between various groupings of these characters (Supp.\ Fig.~\topics).  This likely follows from the frequency with which these terms appeared in the episode annotations.  Several of the identified topics were highly similar, which we hypothesized might allow us to distinguish between subtle narrative differences if the distinctions between those overlapping topics were meaningful.  The topic vectors for each timepoint were also sparse, in that only a small number of topics (typically one or two) tended to be ``active'' in any given timepoint (Fig.~\ref{fig:model}A).  Further, the dynamics of the topic activations appeared to exhibit persistence (i.e., given that a topic was active in one timepoint, it was likely to be active in the following timepoint) along with occasional rapid changes (i.e., occasionally topic weights would change abruptly from one timepoint to the next).  These two properties of the topic dynamics may be seen in the block diagonal structure of the timepoint-by-timepoint correlation matrix (Fig.~\ref{fig:model}B) and reflect the gradual drift and sudden shifts fundamental to the temporal dynamics of many real-world experiences, as well as television episodes.  Given this observation, we adapted an approach devised by Baldassano et al. (2017)~\cite{BaldEtal17}, and used a hidden Markov model (HMM) to identify the ``event boundaries'' where the topic activations changed rapidly (i.e., the boundaries of the blocks in the temporal correlation matrix; event boundaries identified by the HMM are outlined in yellow in Fig.~\ref{fig:model}B).  Part of our model fitting procedure required selecting an appropriate number of events into which the topic trajectory should be segmented.  To accomplish this, we used an optimization procedure that maximized the difference between the topic weights for timepoints within an event versus timepoints across multiple events (see \textit{Methods}).  We then created a stable summary of the content within each episode event by averaging the topic vectors across the timepoints spanned by each event (Fig.~\ref{fig:model}C).
 
-The 32 topics we found were heavily character-focused (i.e., the top-weighted word in each topic was nearly always a character) and could be roughly divided into themes centered around Sherlock Holmes (the titular character), John Watson (Sherlock's close confidant and assistant), supporting characters (e.g., Inspector Lestrade, Sergeant Donovan, or Sherlock's brother Mycroft), or the interactions between various groupings of these characters (\DIFaddbegin \DIFadd{Supp.\ }\DIFaddend Fig.~\topics).  This likely follows from the frequency with which these terms appeared in the episode annotations.  Several of the identified topics were highly similar, which we hypothesized might allow us to distinguish between subtle narrative differences if the distinctions between those overlapping topics were meaningful.  The topic vectors for each timepoint were also \DIFdelbegin \textit{\DIFdel{sparse}}%DIFAUXCMD
-\DIFdelend \DIFaddbegin \DIFadd{sparse}\DIFaddend , in that only a small number of topics (typically one or two) tended to be ``active'' in any given timepoint (Fig.~\ref{fig:model}A).  Further, the dynamics of the topic activations appeared to exhibit \DIFdelbegin \textit{\DIFdel{persistence}} %DIFAUXCMD
-\DIFdelend \DIFaddbegin \DIFadd{persistence }\DIFaddend (i.e., given that a topic was active in one timepoint, it was likely to be active in the following timepoint) along with \DIFdelbegin \textit{\DIFdel{occasional rapid changes}} %DIFAUXCMD
-\DIFdelend \DIFaddbegin \DIFadd{occasional rapid changes }\DIFaddend (i.e., occasionally topic weights would change abruptly from one timepoint to the next).  These two properties of the topic dynamics may be seen in the block diagonal structure of the timepoint-by-timepoint correlation matrix (Fig.~\ref{fig:model}B) and reflect the gradual drift and sudden shifts fundamental to the temporal dynamics of many real-world experiences, as well as television episodes.  Given this observation, we adapted an approach devised by \DIFaddbegin \DIFadd{Baldassano et al. (2017)~}\DIFaddend \cite{BaldEtal17}, and used a hidden Markov model (HMM) to identify the \DIFdelbegin \textit{\DIFdel{event boundaries}} %DIFAUXCMD
-\DIFdelend \DIFaddbegin \DIFadd{``event boundaries'' }\DIFaddend where the topic activations changed rapidly (i.e., the boundaries of the blocks in the temporal correlation matrix; event boundaries identified by the HMM are outlined in yellow in Fig.~\ref{fig:model}B).  Part of our model fitting procedure required selecting an appropriate number of events into which the topic trajectory should be segmented.  To accomplish this, we used an optimization procedure that maximized the difference between the topic weights for timepoints within an event versus timepoints across multiple events (see \textit{Methods}).  We then created a stable summary of the content within each episode event by averaging the topic vectors across the timepoints spanned by each event (Fig.~\ref{fig:model}C).
-
-\begin{figure}
-[tp]
+\begin{figure}[tp]
 \centering
 \includegraphics[width=\textwidth]{figs/eventseg}
-\caption{\small \textbf{Modeling naturalistic stimuli and recalls.} All panels: darker colors indicate greater values; range: [0, 1].  \textbf{A.} Topic vectors ($K = 100$) for each of the 1976 episode timepoints.  \textbf{B.} Timepoint-by-timepoint correlation matrix of the topic vectors displayed in Panel A.  Event boundaries discovered by the HMM are denoted in yellow (30 events detected).  \textbf{C.} Average topic vectors for each of the 30 episode events. \textbf{D.} Topic vectors for each of 265 sliding windows of sentences spoken by an example participant while recalling the episode.  \textbf{E.} Timepoint-by-timepoint correlation matrix of the topic vectors displayed in Panel D. Event boundaries detected by the HMM are denoted in yellow (22 events detected).  For similar plots for all participants, see \DIFaddbeginFL \DIFaddFL{Supplementary }\DIFaddendFL Figure~\corrmats.  \textbf{F.} Average topic vectors for each of the 22 recall events from the example participant.  \textbf{G.} Correlations between the topic vectors for every pair of episode events (Panel C) and recall events (from the example participant; Panel F).  For similar plots for all participants, see \DIFaddbeginFL \DIFaddFL{Supplementary }\DIFaddendFL Figure~\matchmats.  \textbf{H.} Average correlations between each pair of episode events and recall events (across all 17 participants).  To create the figure, each recalled event was assigned to the episode event with the most correlated topic vector (yellow boxes in panels G and H).}
+\caption{\small \textbf{Modeling naturalistic stimuli and recalls.} All panels: darker colors indicate greater values; range: [0, 1].  \textbf{A.} Topic vectors ($K = 100$) for each of the 1976 episode timepoints.  \textbf{B.} Timepoint-by-timepoint correlation matrix of the topic vectors displayed in Panel A.  Event boundaries discovered by the HMM are denoted in yellow (30 events detected).  \textbf{C.} Average topic vectors for each of the 30 episode events. \textbf{D.} Topic vectors for each of 265 sliding windows of sentences spoken by an example participant while recalling the episode.  \textbf{E.} Timepoint-by-timepoint correlation matrix of the topic vectors displayed in Panel D. Event boundaries detected by the HMM are denoted in yellow (22 events detected).  For similar plots for all participants, see \DIFdelbeginFL \DIFdelFL{Supplementary }\DIFdelendFL \DIFaddbeginFL \DIFaddFL{Extended Data }\DIFaddendFL Figure~\corrmats.  \textbf{F.} Average topic vectors for each of the 22 recall events from the example participant.  \textbf{G.} Correlations between the topic vectors for every pair of episode events (Panel C) and recall events (from the example participant; Panel F).  For similar plots for all participants, see \DIFdelbeginFL \DIFdelFL{Supplementary }\DIFdelendFL \DIFaddbeginFL \DIFaddFL{Extended Data }\DIFaddendFL Figure~\matchmats.  \textbf{H.} Average correlations between each pair of episode events and recall events (across all 17 participants).  To create the figure, each recalled event was assigned to the episode event with the most correlated topic vector (yellow boxes in panels G and H).}
 \label{fig:model}
-
 \end{figure}
-
 Given that the time-varying content of the episode could be segmented cleanly into discrete events, we wondered whether participants' recalls of the episode also displayed a similar structure.  We applied the same topic model (already trained on the episode annotations) to each participant's recalls.  Analogously to how we parsed the time-varying content of the episode, to obtain similar estimates for each participant's recall transcript, we treated each overlapping  window of (up to) 10 sentences from their transcript as a document, and computed the most probable mix of topics reflected in each timepoint's sentences.  This yielded, for each participant, a number-of-windows by number-of-topics topic proportions matrix that characterized how the topics identified in the original episode were reflected in the participant's recalls.  An important feature of our approach is that it allows us to compare participants' recalls to events from the original episode, despite that different participants used widely varying language to describe the events, and that those descriptions often diverged in content, quality, and quantity from the episode annotations.  This ability to match up conceptually related text that differs in specific vocabulary, detail, and length is an important benefit of projecting the episode and recalls into a shared topic space.  An example topic proportions matrix from one participant's recalls is shown in Figure~\ref{fig:model}D.
 
-Although the example participant's recall topic proportions matrix has some visual similarity to the episode topic proportions matrix, the time-varying topic proportions for the example participant's recalls are not as sparse as those for the episode (compare Figs.~\ref{fig:model}A and D).  Similarly, although there do appear to be periods of stability in the recall topic dynamics (i.e., most topics are active or inactive over contiguous blocks of time), the changes in topic activations that define event boundaries appear less clearly delineated in participants' recalls than in the episode's annotations.  To examine these patterns in detail, we computed the timepoint-by-timepoint correlation matrix for the example participant's recall topic proportions matrix (Fig.~\ref{fig:model}E).  As in the episode correlation matrix (Fig.~\ref{fig:model}B), the example participant's recall correlation matrix has a strong block diagonal structure, indicating that their recalls are discretized into separated events.  We used the same HMM-based optimization procedure that we had applied to the episode's topic proportions matrix (see \textit{Methods}) to estimate an analogous set of event boundaries in the participant's recounting of the episode (outlined in yellow).  We carried out this analysis on all 17 participants' recall topic proportions matrices (\DIFaddbegin \DIFadd{Supp.\ }\DIFaddend Fig.~\corrmats).
+Although the example participant's recall topic proportions matrix has some visual similarity to the episode topic proportions matrix, the time-varying topic proportions for the example participant's recalls are not as sparse as those for the episode (compare Figs.~\ref{fig:model}A and D).  Similarly, although there do appear to be periods of stability in the recall topic dynamics (i.e., most topics are active or inactive over contiguous blocks of time), the changes in topic activations that define event boundaries appear less clearly delineated in participants' recalls than in the episode's annotations.  To examine these patterns in detail, we computed the timepoint-by-timepoint correlation matrix for the example participant's recall topic proportions matrix (Fig.~\ref{fig:model}E).  As in the episode correlation matrix (Fig.~\ref{fig:model}B), the example participant's recall correlation matrix has a strong block diagonal structure, indicating that their recalls are discretized into separated events.  We used the same HMM-based optimization procedure that we had applied to the episode's topic proportions matrix (see \textit{Methods}) to estimate an analogous set of event boundaries in the participant's recounting of the episode (outlined in yellow).  We carried out this analysis on all 17 participants' recall topic proportions matrices (\DIFdelbegin \DIFdel{Supp.\ }\DIFdelend \DIFaddbegin \DIFadd{Extended Data }\DIFaddend Fig.~\corrmats).
 
-Two clear patterns emerged from this set of analyses.  First, although every individual participant's recalls could be segmented into discrete events (i.e., every individual participant's recall correlation matrix exhibited clear block diagonal structure; \DIFaddbegin \DIFadd{Supp.\ }\DIFaddend Fig.~\corrmats), each participant appeared to have a unique \DIFdelbegin \textit{\DIFdel{recall resolution}}%DIFAUXCMD
-\DIFdel{,}\DIFdelend \DIFaddbegin \DIFadd{``recall resolution,'' }\DIFaddend reflected in the sizes of those blocks.  While some participants' recall topic proportions segmented into just a few events (e.g., Participants P4, P5, and P7), others' segmented into many shorter-duration events (e.g., Participants P12, P13, and P17).  This suggests that different participants may be recalling the episode with different levels of detail---i.e., some might recount only high-level essential plot details, whereas others might recount low-level details instead (or in addition).  The second clear pattern present in every individual participant's recall correlation matrix was that, unlike in the episode correlation matrix, there were substantial off-diagonal correlations.  \DIFdelbegin \DIFdel{Whereas }\DIFdelend \DIFaddbegin \DIFadd{One potential explanation for this finding is that the topic models, trained only on episode annotations, do not capture topic proportions in participants' ``held-out'' recalls as accurately.  A second possibility is that, whereas }\DIFaddend each event in the original episode was (largely) separable from the others (Fig.~\ref{fig:model}B), in transforming those separable events into memory, participants appeared to be integrating across multiple events, blending elements of previously recalled and not-yet-recalled content into each newly recalled event \DIFdelbegin \DIFdel{~\mbox{%DIFAUXCMD
-\citep[Figs.~\ref{fig:model}E, \corrmats; also see][]{MannEtal11, HowaEtal12, Mann19}}\hspace{0pt}%DIFAUXCMD
-.}\DIFdelend \DIFaddbegin \DIFadd{(Fig.~\ref{fig:model}E, Supp.\ Fig.~\corrmats)~\mbox{%DIFAUXCMD
-\citep{MannEtal11, HowaEtal12, Mann19}}\hspace{0pt}%DIFAUXCMD
-.
-}\DIFaddend 
+Two clear patterns emerged from this set of analyses.  First, although every individual participant's recalls could be segmented into discrete events (i.e., every individual participant's recall correlation matrix exhibited clear block diagonal structure; \DIFdelbegin \DIFdel{Supp.\ }\DIFdelend \DIFaddbegin \DIFadd{Extended Data }\DIFaddend Fig.~\corrmats), each participant appeared to have a unique ``recall resolution,'' reflected in the sizes of those blocks.  While some participants' recall topic proportions segmented into just a few events (e.g., Participants P4, P5, and P7), others' segmented into many shorter-duration events (e.g., Participants P12, P13, and P17).  This suggests that different participants may be recalling the episode with different levels of detail---i.e., some might recount only high-level essential plot details, whereas others might recount low-level details instead (or in addition).  The second clear pattern present in every individual participant's recall correlation matrix was that, unlike in the episode correlation matrix, there were substantial off-diagonal correlations.  One potential explanation for this finding is that the topic models, trained only on episode annotations, do not capture topic proportions in participants' ``held-out'' recalls as accurately.  A second possibility is that, whereas each event in the original episode was (largely) separable from the others (Fig.~\ref{fig:model}B), in transforming those separable events into memory, participants appeared to be integrating across multiple events, blending elements of previously recalled and not-yet-recalled content into each newly recalled event (Fig.~\ref{fig:model}E, \DIFdelbegin \DIFdel{Supp.\ }\DIFdelend \DIFaddbegin \DIFadd{Extended Data }\DIFaddend Fig.~\corrmats)~\citep{MannEtal11, HowaEtal12, Mann19}.
 
-The above results demonstrate that topic models capture the dynamic conceptual content of the episode and participants' recalls of the episode.  Further, the episode and recalls exhibit event boundaries that can be identified automatically using HMMs to segment the dynamic content.  Next, we asked whether some correspondence might be made between the specific content of the events the participants experienced while viewing the episode, and the events they later recalled.  We labeled each recall event as matching the episode event with the most similar (i.e., most highly correlated) topic vector (\DIFdelbegin \DIFdel{Figs}\DIFdelend \DIFaddbegin \DIFadd{Fig}\DIFaddend .~\ref{fig:model}G, \DIFaddbegin \DIFadd{Supp.\ Fig.~}\DIFaddend \matchmats).  This yielded a sequence of ``presented'' events from the original episode, and a (potentially differently ordered) sequence of ``recalled'' events for each participant.  Analogous to classic list-learning studies, we can then examine participants' recall sequences by asking which events they tended to recall first \DIFdelbegin \DIFdel{~\mbox{%DIFAUXCMD
-\citep[probability of first recall; Fig.~\ref{fig:list-learning}A;][]{AtkiShif68, PostPhil65, WelcBurn24}}\hspace{0pt}%DIFAUXCMD
-; }\DIFdelend \DIFaddbegin \DIFadd{(probability of first recall~\mbox{%DIFAUXCMD
-\citep{AtkiShif68, PostPhil65, WelcBurn24}}\hspace{0pt}%DIFAUXCMD
-; Fig.~\ref{fig:list-learning}A); }\DIFaddend how participants most often transitioned between recalls of the events as a function of the temporal distance between them \DIFdelbegin \DIFdel{~\mbox{%DIFAUXCMD
-\citep[lag-conditional response probability; Fig.~\ref{fig:list-learning}B;][]{Kaha96}}\hspace{0pt}%DIFAUXCMD
-; }\DIFdelend \DIFaddbegin \DIFadd{(lag-conditional response probability~\mbox{%DIFAUXCMD
-\citep{Kaha96}}\hspace{0pt}%DIFAUXCMD
-; Fig.~\ref{fig:list-learning}B); }\DIFaddend and which events they were likely to remember overall \DIFdelbegin \DIFdel{~\mbox{%DIFAUXCMD
-\citep[serial position recall analyses; Fig.~\ref{fig:list-learning}C;][]{Murd62a}}\hspace{0pt}%DIFAUXCMD
-.}\DIFdelend \DIFaddbegin \DIFadd{(serial position recall analyses~\mbox{%DIFAUXCMD
-\citep{Murd62a}}\hspace{0pt}%DIFAUXCMD
-; Fig.~\ref{fig:list-learning}C). }\DIFaddend Some of the patterns we observed appeared to be similar to classic effects from the list-learning literature.  For example, participants had a higher probability of initiating recall with early events (Fig.~\ref{fig:list-learning}A) and a higher probability of transitioning to neighboring events with an asymmetric forward bias (Fig.~\ref{fig:list-learning}B). However, unlike what is typically observed in list-learning studies, we did not observe patterns comparable to the primacy or recency serial position effects (Fig.~\ref{fig:list-learning}C).  We hypothesized that participants might be leveraging meaningful narrative associations and references over long timescales throughout the episode.
-
-\DIFaddbegin \begin{figure}
-[tp]
+The above results demonstrate that topic models capture the dynamic conceptual content of the episode and participants' recalls of the episode.  Further, the episode and recalls exhibit event boundaries that can be identified automatically using HMMs to segment the dynamic content.  Next, we asked whether some correspondence might be made between the specific content of the events the participants experienced while viewing the episode, and the events they later recalled.  We labeled each recall event as matching the episode event with the most similar (i.e., most highly correlated) topic vector (Fig.~\ref{fig:model}G, \DIFdelbegin \DIFdel{Supp.\ }\DIFdelend \DIFaddbegin \DIFadd{Extended Data }\DIFaddend Fig.~\matchmats).  This yielded a sequence of ``presented'' events from the original episode, and a (potentially differently ordered) sequence of ``recalled'' events for each participant.  Analogous to classic list-learning studies, we can then examine participants' recall sequences by asking which events they tended to recall first (probability of first recall~\citep{AtkiShif68, PostPhil65, WelcBurn24}; Fig.~\ref{fig:list-learning}A); how participants most often transitioned between recalls of the events as a function of the temporal distance between them (lag-conditional response probability~\citep{Kaha96}; Fig.~\ref{fig:list-learning}B); and which events they were likely to remember overall (serial position recall analyses~\citep{Murd62a}; Fig.~\ref{fig:list-learning}C). Some of the patterns we observed appeared to be similar to classic effects from the list-learning literature.  For example, participants had a higher probability of initiating recall with early events (Fig.~\ref{fig:list-learning}A) and a higher probability of transitioning to neighboring events with an asymmetric forward bias (Fig.~\ref{fig:list-learning}B). However, unlike what is typically observed in list-learning studies, we did not observe patterns comparable to the primacy or recency serial position effects (Fig.~\ref{fig:list-learning}C).  We hypothesized that participants might be leveraging meaningful narrative associations and references over long timescales throughout the episode.
+
+\begin{figure}[tp]
   \centering
   \includegraphics[width=1\textwidth]{figs/list_learning}
-  \caption{\small \textbf{\DIFaddFL{Naturalistic extensions of classic list-learning memory analyses.}} \textbf{\DIFaddFL{A.}} \DIFaddFL{The probability of first recall as a function of the serial position of the event in the episode. }\textbf{\DIFaddFL{B}}\DIFaddFL{.  The probability of recalling each event, conditioned on having most recently recalled the event }\textit{\DIFaddFL{lag}} \DIFaddFL{events away in the episode.  }\textbf{\DIFaddFL{C.}} \DIFaddFL{The proportion of participants who recalled each event, as a function of the serial position of the events in the episode.  All panels: error ribbons denote the bootstrap-estimated 95\% confidence interval.}}
+  \caption{\small \textbf{Naturalistic extensions of classic list-learning memory analyses.} \textbf{A.} The probability of first recall as a function of the serial position of the event in the episode. \textbf{B}.  The probability of recalling each event, conditioned on having most recently recalled the event \textit{lag} events away in the episode.  \textbf{C.} The proportion of participants who recalled each event, as a function of the serial position of the events in the episode.  All panels: error ribbons denote the bootstrap-estimated 95\% confidence interval.}
   \label{fig:list-learning}
-
 \end{figure}
-
-\DIFaddend Clustering scores are often used by memory researchers to characterize how people organize their memories of words on a studied list~\DIFdelbegin \DIFdel{\mbox{%DIFAUXCMD
-\citep[for review, see][]{PolyEtal09}}\hspace{0pt}%DIFAUXCMD
-}\DIFdelend \DIFaddbegin \DIFadd{\mbox{%DIFAUXCMD
-\citep{PolyEtal09}}\hspace{0pt}%DIFAUXCMD
-}\DIFaddend .  We defined analogous measures to characterize how participants organized their memories for episodic events (see \textit{Methods} for details).  Temporal clustering refers to the extent to which participants group their recall responses according to encoding position.  Overall, we found that sequentially viewed episode events tended to appear nearby in participants' recall event sequences (mean clustering score: 0.732, SEM: 0.033).  Participants with higher temporal clustering scores tended to exhibit better overall memory for the episode, according to both \DIFaddbegin \DIFadd{Chen et al. (2017)~}\DIFaddend \cite{ChenEtal17}'s hand-counted numbers of recalled scenes from the episode (Pearson's \DIFdelbegin \DIFdel{$r(15) = 0.49,~p = 0.046$}\DIFdelend \DIFaddbegin \DIFadd{$r(15) = 0.49,~p = 0.046,~95\%~\mathrm{CI} = [0.25, 0.76]$}\DIFaddend ) and the numbers of episode events that best-matched at least one recall event (i.e., model-estimated number of events recalled; Pearson's \DIFdelbegin \DIFdel{$r(15) = 0.59,~p = 0.013$}\DIFdelend \DIFaddbegin \DIFadd{$r(15) = 0.59,~p = 0.013,~95\%~\mathrm{CI} = [0.31, 0.80]$}\DIFaddend ).  Semantic clustering measures the extent to which participants cluster their recall responses according to semantic similarity\DIFaddbegin \DIFadd{~\mbox{%DIFAUXCMD
-\citep{MannKaha12}}\hspace{0pt}%DIFAUXCMD
-}\DIFaddend .  We found that participants tended to recall semantically similar episode events together (mean clustering score: 0.650, SEM: 0.032), and that semantic clustering \DIFdelbegin \DIFdel{score was }\DIFdelend \DIFaddbegin \DIFadd{scores were }\DIFaddend also related to both hand-counted  (Pearson's \DIFdelbegin \DIFdel{$r(15) = 0.65,~p = 0.005$}\DIFdelend \DIFaddbegin \DIFadd{$r(15) = 0.65,~p = 0.004,~95\%~\mathrm{CI} = [0.31, 0.85]$}\DIFaddend ) and model-estimated (Pearson's \DIFdelbegin \DIFdel{$r(15) = 0.58,~p = 0.015$}\DIFdelend \DIFaddbegin \DIFadd{$r(15) = 0.58,~p = 0.015,~95\%~\mathrm{CI} = [0.10, 0.83]$}\DIFaddend ) numbers of recalled events.
-
+Clustering scores are often used by memory researchers to characterize how people organize their memories of words on a studied list~\citep{PolyEtal09}.  We defined analogous measures to characterize how participants organized their memories for episodic events (see \textit{Methods} for details).  Temporal clustering refers to the extent to which participants group their recall responses according to encoding position.  Overall, we found that sequentially viewed episode events tended to appear nearby in participants' recall event sequences (mean clustering score: 0.732, SEM: 0.033).  Participants with higher temporal clustering scores tended to exhibit better overall memory for the episode, according to both Chen et al. (2017)~\cite{ChenEtal17}'s hand-counted numbers of recalled scenes from the episode (Pearson's $r(15) = 0.49,~p = 0.046,~95\%~\mathrm{CI} = [0.25, 0.76]$) and the numbers of episode events that best-matched at least one recall event (i.e., model-estimated number of events recalled; Pearson's $r(15) = 0.59,~p = 0.013,~95\%~\mathrm{CI} = [0.31, 0.80]$).  Semantic clustering measures the extent to which participants cluster their recall responses according to semantic similarity~\citep{MannKaha12}.  We found that participants tended to recall semantically similar episode events together (mean clustering score: 0.650, SEM: 0.032), and that semantic clustering scores were also related to both hand-counted  (Pearson's $r(15) = 0.65,~p = 0.004,~95\%~\mathrm{CI} = [0.31, 0.85]$) and model-estimated (Pearson's $r(15) = 0.58,~p = 0.015,~95\%~\mathrm{CI} = [0.10, 0.83]$) numbers of recalled events.
 
 
 
-\begin{figure}
-[tp]
+\begin{figure}[tp]
   \centering
-  \DIFdelbeginFL %DIFDELCMD < \includegraphics[width=1\textwidth]{figs/list_learning}
-%DIFDELCMD <   %%%
-%DIFDELCMD < \caption{%
-{%DIFAUXCMD
-%DIFDELCMD < \small %%%
-\textbf{\DIFdelFL{Naturalistic extensions of classic list-learning memory analyses.}} %DIFAUXCMD
-\textbf{\DIFdelFL{A.}} %DIFAUXCMD
-\DIFdelFL{The probability of first recall as a function of the serial position of the event in the episode. }\textbf{\DIFdelFL{B}}%DIFAUXCMD
-\DIFdelFL{.  The probability of recalling each event, conditioned on having most recently recalled the event }\textit{\DIFdelFL{lag}} %DIFAUXCMD
-\DIFdelFL{events away in the episode.  }\textbf{\DIFdelFL{C.}} %DIFAUXCMD
-\DIFdelFL{The proportion of participants who recalled each event, as a function of the serial position of the events in the episode.  All panels: error ribbons denote bootstrap-estimated standard error of the mean.}}
-  %DIFAUXCMD
-%DIFDELCMD < \label{fig:list-learning}
-%DIFDELCMD < \end{figure}
-%DIFDELCMD < 
-
-%DIFDELCMD < \begin{figure}[tp]
-%DIFDELCMD <   \centering
-%DIFDELCMD <   %%%
-\DIFdelendFL \includegraphics[width=1\textwidth]{figs/precision_distinctiveness}
-  \caption{\small \textbf{Novel content-based metrics of naturalistic memory: precision and distinctiveness.} \textbf{A.} The episode-recall correlation matrix for \DIFdelbeginFL \DIFdelFL{a representative }\DIFdelendFL \DIFaddbeginFL \DIFaddFL{an example }\DIFaddendFL participant (P17)\DIFaddbeginFL \DIFaddFL{, chosen for their large number of recall events (for analogous figures for other participants, see Supp}\DIFaddendFL .\DIFaddbeginFL \DIFaddFL{\ Fig.~\corrmats).  }\DIFaddendFL The yellow boxes highlight the maximum correlation in each column.  The example participant's overall precision score was computed as the average across the (Fisher $z$-transformed) correlation values in the yellow boxes.  Their distinctiveness score was computed as the average (over recall events) of the $z$-scored (within column) event precisions. \textbf{B.} The \DIFaddbeginFL \DIFaddFL{across-participants }\DIFaddendFL (Pearson's) correlation between precision and hand-counted number of recalled scenes.  \textbf{C.} The correlation between distinctiveness and hand-counted number of recalled scenes. \textbf{D.} The correlation between precision and the number of recalled episode events, as determined by our model. \textbf{E.} The correlation between distinctiveness and the number of recalled episode events, as determined by our model.}
+  \includegraphics[width=1\textwidth]{figs/precision_distinctiveness}
+  \caption{\small \textbf{Novel content-based metrics of naturalistic memory: precision and distinctiveness.} \textbf{A.} The episode-recall correlation matrix for an example participant (P17), chosen for their large number of recall events (for analogous figures for other participants, see \DIFdelbeginFL \DIFdelFL{Supp.\ }\DIFdelendFL \DIFaddbeginFL \DIFaddFL{Extended Data }\DIFaddendFL Fig.~\corrmats).  The yellow boxes highlight the maximum correlation in each column.  The example participant's overall precision score was computed as the average across the (Fisher $z$-transformed) correlation values in the yellow boxes.  Their distinctiveness score was computed as the average (over recall events) of the $z$-scored (within column) event precisions. \textbf{B.} The across-participants (Pearson's) correlation between precision and hand-counted number of recalled scenes.  \textbf{C.} The correlation between distinctiveness and hand-counted number of recalled scenes. \textbf{D.} The correlation between precision and the number of recalled episode events, as determined by our model. \textbf{E.} The correlation between distinctiveness and the number of recalled episode events, as determined by our model.}
   \label{fig:precision-distinctiveness}
-
 \end{figure}
+The above analyses illustrate how our framework for characterizing the dynamic conceptual content of naturalistic episodes enables us to carry out analyses that have traditionally been applied to much simpler list-learning paradigms.  However, perhaps the most interesting aspects of memory for naturalistic episodes are those that have no list-learning analogs.  The nuances of how one's memory for an event might capture some details, yet distort or neglect others, is central to how we use our memory systems in daily life.  Yet when researchers study memory in highly simplified paradigms, those nuances are not typically observable.  We next developed two novel, continuous metrics, termed ``precision'' and ``distinctiveness,'' aimed at characterizing distortions in the conceptual content of individual recall events, and the conceptual overlap between how people described different events.
 
-The above analyses illustrate how our framework for characterizing the dynamic conceptual content of naturalistic episodes enables us to carry out analyses that have traditionally been applied to much simpler list-learning paradigms.  However, perhaps the most interesting aspects of memory for naturalistic episodes are those that have no list-learning analogs.  The nuances of how one's memory for an event might capture some details, yet distort or neglect others, is central to how we use our memory systems in daily life.  Yet when researchers study memory in highly simplified paradigms, those nuances are not typically observable.  We next developed two novel, continuous metrics, termed \DIFdelbegin \textit{\DIFdel{precision}} %DIFAUXCMD
-\DIFdel{and }\textit{\DIFdel{distinctiveness}}%DIFAUXCMD
-\DIFdel{,}\DIFdelend \DIFaddbegin \DIFadd{``precision'' and ``distinctiveness,'' }\DIFaddend aimed at characterizing distortions in the conceptual content of individual recall events, and the conceptual overlap between how people described different events.
-
-\DIFdelbegin \textit{\DIFdel{Precision}} %DIFAUXCMD
-\DIFdelend \DIFaddbegin \DIFadd{Precision }\DIFaddend is intended to capture the ``completeness'' of recall, or how fully the presented content was recapitulated in a participant's recounting.  We define a recall event's precision as the maximum correlation between the topic proportions of that recall event and any episode event (Fig.~\ref{fig:precision-distinctiveness}).  In other words, given that a recall event best matches a particular episode event, more precisely recalled events overlap more strongly with the conceptual content of the original episode event.  When a given event is assigned a blend of several topics, as is often the case (Fig.~\ref{fig:model}), a high precision score requires recapitulating the relative topic proportions during recall.
+Precision is intended to capture the ``completeness'' of recall, or how fully the presented content was recapitulated in a participant's recounting.  We define a recall event's precision as the maximum correlation between the topic proportions of that recall event and any episode event (Fig.~\ref{fig:precision-distinctiveness}).  In other words, given that a recall event best matches a particular episode event, more precisely recalled events overlap more strongly with the conceptual content of the original episode event.  When a given event is assigned a blend of several topics, as is often the case (Fig.~\ref{fig:model}), a high precision score requires recapitulating the relative topic proportions during recall.
 
-\DIFdelbegin \textit{\DIFdel{Distinctiveness}} %DIFAUXCMD
-\DIFdelend \DIFaddbegin \DIFadd{Distinctiveness }\DIFaddend is intended to capture the ``specificity'' of recall.  In other words, distinctiveness quantifies the extent to which a given recall event reflects the most similar episode event over and above other episode events.  Intuitively, distinctiveness is like a normalized variant of our precision metric.  Whereas precision solely measures how much detail about an \DIFdelbegin \DIFdel{episode }\DIFdelend \DIFaddbegin \DIFadd{event }\DIFaddend was captured in someone's recall, distinctiveness penalizes details that also pertain to other episode events.  We define the distinctiveness of an event's recall as its precision expressed in standard deviation units with respect to other episode events.  Specifically, for a given recall event, we compute the correlation between its topic vector and that of each episode event.  This yields a distribution of correlation coefficients (one per episode event).  We subtract the mean and divide by the standard deviation of this distribution to $z$-score the coefficients.  The maximum value in this distribution (which, by definition, belongs to the episode event that best matches the given recall event) is that recall event's distinctiveness score.  In this way, recall events that match one episode event far better than all other episode events will receive a high distinctiveness score.  By contrast, a recall event that matches all episode events roughly equally will receive a comparatively low distinctiveness score.
+Distinctiveness is intended to capture the ``specificity'' of recall.  In other words, distinctiveness quantifies the extent to which a given recall event reflects the most similar episode event over and above other episode events.  Intuitively, distinctiveness is like a normalized variant of our precision metric.  Whereas precision solely measures how much detail about an event was captured in someone's recall, distinctiveness penalizes details that also pertain to other episode events.  We define the distinctiveness of an event's recall as its precision expressed in standard deviation units with respect to other episode events.  Specifically, for a given recall event, we compute the correlation between its topic vector and that of each episode event.  This yields a distribution of correlation coefficients (one per episode event).  We subtract the mean and divide by the standard deviation of this distribution to $z$-score the coefficients.  The maximum value in this distribution (which, by definition, belongs to the episode event that best matches the given recall event) is that recall event's distinctiveness score.  In this way, recall events that match one episode event far better than all other episode events will receive a high distinctiveness score.  By contrast, a recall event that matches all episode events roughly equally will receive a comparatively low distinctiveness score.
 
-In addition to examining how precisely and distinctively participants recalled individual events, one may also use these metrics to summarize each participant's performance by averaging across a participant's event-wise precision or distinctiveness scores.  This enables us to quantify how precisely a participant tended to recall subtle within-event details, as well as how specific (distinctive) those details were to individual events from the episode.  Participants' average precision and distinctiveness scores were strongly correlated (\DIFdelbegin \DIFdel{$r(15) = 0.90, p < 0.001$}\DIFdelend \DIFaddbegin \DIFadd{$r(15) = 0.90,~p < 0.001,~95\%~\mathrm{CI} = [0.66, 0.96]$}\DIFaddend ).  This indicates that participants who tended to precisely recount low-level details of episode events also tended to do so in an event-specific way (e.g., as opposed to detailing recurring themes that were present in most or all episode events; this behavior would have resulted in high precision but low distinctiveness).  We found that, across participants, higher precision scores were positively correlated with the numbers of both hand-annotated scenes (\DIFdelbegin \DIFdel{$r(15) = 0.60, p = 0.010$}\DIFdelend \DIFaddbegin \DIFadd{$r(15) = 0.60,~p = 0.010,~95\%~\mathrm{CI} = [0.02, 0.83]$}\DIFaddend ) and model-estimated events (\DIFdelbegin \DIFdel{$r(15) = 0.90, p < 0.001$}\DIFdelend \DIFaddbegin \DIFadd{$r(15) = 0.90,~p < 0.001,~95\%~\mathrm{CI} = [0.54, 0.96]$}\DIFaddend ) that participants recalled.  Participants' average distinctiveness scores were also \DIFaddbegin \DIFadd{marginally }\DIFaddend correlated with both the hand-annotated (\DIFdelbegin \DIFdel{$r(15) = 0.45, p = 0.068$}\DIFdelend \DIFaddbegin \DIFadd{$r(15) = 0.45,~p = 0.068,~95\%~\mathrm{CI} = [-0.21, 0.79]$}\DIFaddend ) and model-estimated (\DIFdelbegin \DIFdel{$r(15) = 0.71, p = 0.001$}\DIFdelend \DIFaddbegin \DIFadd{$r(15) = 0.71,~p = 0.001,~95\%~\mathrm{CI} = [-0.07, 0.90]$}\DIFaddend ) numbers of recalled events.
+In addition to examining how precisely and distinctively participants recalled individual events, one may also use these metrics to summarize each participant's performance by averaging across a participant's event-wise precision or distinctiveness scores.  This enables us to quantify how precisely a participant tended to recall subtle within-event details, as well as how specific (distinctive) those details were to individual events from the episode.  Participants' average precision and distinctiveness scores were strongly correlated ($r(15) = 0.90,~p < 0.001,~95\%~\mathrm{CI} = [0.66, 0.96]$).  This indicates that participants who tended to precisely recount low-level details of episode events also tended to do so in an event-specific way (e.g., as opposed to detailing recurring themes that were present in most or all episode events; this behavior would have resulted in high precision but low distinctiveness).  We found that, across participants, higher precision scores were positively correlated with the numbers of both \DIFdelbegin \DIFdel{hand-annotated scenes ($r(15) = 0.60,~p = 0.010,~95\%~\mathrm{CI} = [0.02, 0.83]$) and }\DIFdelend model-estimated events ($r(15) = 0.90,~p < 0.001,~95\%~\mathrm{CI} = [0.54, 0.96]$) \DIFaddbegin \DIFadd{and hand-annotated scenes ($r(15) = 0.60,~p = 0.010,~95\%~\mathrm{CI} = [0.02, 0.83]$) }\DIFaddend that participants recalled.  Participants' average distinctiveness scores were also \DIFdelbegin \DIFdel{marginally correlated with both the hand-annotated ($r(15) = 0.45,~p = 0.068,~95\%~\mathrm{CI} = [-0.21, 0.79]$) and }\DIFdelend \DIFaddbegin \DIFadd{correlated with their numbers of }\DIFaddend model-estimated \DIFaddbegin \DIFadd{recalled events }\DIFaddend ($r(15) = 0.71,~p = 0.001,~95\%~\mathrm{CI} = [-0.07, 0.90]$) \DIFdelbegin \DIFdel{numbers of recalled events}\DIFdelend \DIFaddbegin \DIFadd{and marginally significantly correlated with their numbers of hand-annotated ($r(15) = 0.45,~p = 0.068,~95\%~\mathrm{CI} = [-0.21, 0.79]$)}\DIFaddend .
 
 
-
-\begin{figure}
-[tp]
+\begin{figure}[tp]
   \centering
   \vspace*{-1cm}
   \includegraphics[width=.7\textwidth]{figs/precision_distinctiveness_detail}
-  \caption{\small \textbf{Precision reflects the completeness of recall, whereas distinctiveness reflects recall specificity.} \textbf{A.} Recall precision by episode event.  Grey violin plots display kernel density estimates for the distribution of recall precision scores for a single episode event.  Colored dots within each violin plot represent individual participants' recall \DIFdelbeginFL \DIFdelFL{precision }\DIFdelendFL \DIFaddbeginFL \DIFaddFL{precisions }\DIFaddendFL for the given event.  \textbf{B.} Recall distinctiveness by episode event, analogous to Panel A.  \textbf{C.} The set of ``Narrative Details'' episode annotations\DIFdelbeginFL \DIFdelFL{(generated by \mbox{%DIFAUXCMD
-\citealp{ChenEtal17}}\hspace{0pt}%DIFAUXCMD
-) }\DIFdelendFL \DIFaddbeginFL \DIFaddFL{~\mbox{%DIFAUXCMD
-\citep{ChenEtal17} }\hspace{0pt}%DIFAUXCMD
-}\DIFaddendFL comprising an example episode event (22) identified by the HMM.  Each action or feature is highlighted in a different color.  \textbf{D.} Sentences comprising the most precise (P17) and least precise (P6) participants' recalls of episode event 21.  Descriptions of specific actions or features reflecting those highlighted in Panel B are highlighted in the corresponding color.  The text highlighted in gray denotes a (rare) false recall.  The unhighlighted text denotes correctly recalled information about other episode events.  \textbf{E.} The sets of ``Narrative Details'' episode annotations\DIFdelbeginFL \DIFdelFL{(generated by \mbox{%DIFAUXCMD
-\citealp{ChenEtal17}}\hspace{0pt}%DIFAUXCMD
-) }\DIFdelendFL \DIFaddbeginFL \DIFaddFL{~\mbox{%DIFAUXCMD
-\citep{ChenEtal17} }\hspace{0pt}%DIFAUXCMD
-}\DIFaddendFL for scenes comprising episode events described by the example participants in Panel F.  Each event's text is highlighted in a different color.  \textbf{F.} The sentences comprising the most distinctive (P9) and least distinctive (P6) participants' recalls of episode event 21.  Sections of recall describing each each episode event in Panel E are highlighted with the corresponding color.}
+  \caption{\small \textbf{Precision reflects the completeness of recall, whereas distinctiveness reflects recall specificity.} \textbf{A.} Recall precision by episode event.  Grey violin plots display kernel density estimates for the distribution of recall precision scores for a single episode event.  Colored dots within each violin plot represent individual participants' recall precisions for the given event.  \textbf{B.} Recall distinctiveness by episode event, analogous to Panel A.  \textbf{C.} The set of ``Narrative Details'' episode annotations~\citep{ChenEtal17} comprising an example episode event (22) identified by the HMM.  Each action or feature is highlighted in a different color.  \textbf{D.} Sentences comprising the most precise (P17) and least precise (P6) participants' recalls of episode event 21.  Descriptions of specific actions or features reflecting those highlighted in Panel B are highlighted in the corresponding color.  The text highlighted in gray denotes a (rare) false recall.  The unhighlighted text denotes correctly recalled information about other episode events.  \textbf{E.} The sets of ``Narrative Details'' episode annotations~\citep{ChenEtal17} for scenes comprising episode events described by the example participants in Panel F.  Each event's text is highlighted in a different color.  \textbf{F.} The sentences comprising the most distinctive (P9) and least distinctive (P6) participants' recalls of episode event 21.  Sections of recall describing each each episode event in Panel E are highlighted with the corresponding color.}
   \label{fig:precision-detail}
-
 \end{figure}
-
-
 Examining individual recalls of the same episode event can provide insights into how the above precision and distinctiveness scores may be used to characterize similarities and differences in how different people describe the same shared experience.  In Figure \ref{fig:precision-detail}, we compare recalls for the same episode event from the participants with the highest (P17) and lowest (P6) precision scores.  From the HMM-identified episode event boundaries, we recovered the set of annotations describing the content of a single episode event (event 21; Fig.~\ref{fig:precision-detail}C), and divided them into different color-coded sections for each action or feature described.  Next, we used an analogous approach to identify the set of sentences comprising the corresponding recall event from each of the two example participants (Fig.~\ref{fig:precision-detail}D).  We then colored all words describing actions and features in the transcripts shown in Panel D according to the color-coded annotations in Panel C.  Visual comparison of these example recalls reveals that the more precise recall captures more of the episode event's content, and in greater detail.
 
-Figure \ref{fig:precision-detail} also illustrates the differences between high and low distinctiveness scores.  We extracted the set of sentences comprising the most distinctive recall event (P9) and least distinctive recall event (P6) corresponding to the example episode event shown in Panel C (event 21).  We also extracted the annotations for all episode events whose content these participants' single recall events \DIFdelbegin \DIFdel{described}\DIFdelend \DIFaddbegin \DIFadd{touched on}\DIFaddend .  We assigned each episode event a unique color (Fig.~\ref{fig:precision-detail}E), and colored each recalled sentence (Panel F) according to the episode events they best matched.  Visual inspection of Panel F reveals that the most distinctive recall's content is tightly concentrated around event 21, whereas the least distinctive recall incorporates content from a much wider range of episode events.
+Figure \ref{fig:precision-detail} also illustrates the differences between high and low distinctiveness scores.  We extracted the set of sentences comprising the most distinctive recall event (P9) and least distinctive recall event (P6) corresponding to the example episode event shown in Panel C (event 21).  We also extracted the annotations for all episode events whose content these participants' single recall events touched on.  We assigned each episode event a unique color (Fig.~\ref{fig:precision-detail}E), and colored each recalled sentence (Panel F) according to the episode events they best matched.  Visual inspection of Panel F reveals that the most distinctive recall's content is tightly concentrated around event 21, whereas the least distinctive recall incorporates content from a much wider range of episode events.
 
-The preceding analyses sought to characterize how participants' recountings of individual episode events captured the low-level details of each event.  Next, we sought to characterize how participants' recountings of the full episode captured its high-level essence---i.e., the shape of the episode's trajectory through word embedding (topic) space.  To visualize the essence of the episode and each participant's recall trajectory~\citep{HeusEtal18a}, we projected the topic proportions matrices for the episode and recalls onto a shared two-dimensional space using Uniform Manifold Approximation and Projection \DIFdelbegin \DIFdel{~\mbox{%DIFAUXCMD
-\citep[UMAP; ][]{McInEtal18}}\hspace{0pt}%DIFAUXCMD
-}\DIFdelend \DIFaddbegin \DIFadd{(UMAP)~\mbox{%DIFAUXCMD
-\citep{McInEtal18}}\hspace{0pt}%DIFAUXCMD
-}\DIFaddend .  In this lower-dimensional space, each point represents a single episode or recall event, and the distances between the points reflect the distances between the events' associated topic vectors (Fig.~\ref{fig:trajectory}). In other words, events that are nearer to each other in this space are more semantically similar, and those that are farther apart are less so.
+The preceding analyses sought to characterize how participants' recountings of individual episode events captured the low-level details of each event.  Next, we sought to characterize how participants' recountings of the full episode captured its high-level essence---i.e., the shape of the episode's trajectory through word embedding (topic) space.  To visualize the essence of the episode and each participant's recall trajectory~\citep{HeusEtal18a}, we projected the topic proportions matrices for the episode and recalls onto a shared two-dimensional space using Uniform Manifold Approximation and Projection (UMAP)~\citep{McInEtal18}.  In this lower-dimensional space, each point represents a single episode or recall event, and the distances between the points reflect the distances between the events' associated topic vectors (Fig.~\ref{fig:trajectory}). In other words, events that are nearer to each other in this space are more semantically similar, and those that are farther apart are less so.
 
-\begin{figure}
-[tp]
+\begin{figure}[tp]
 \centering
 \includegraphics[width=1\textwidth]{figs/trajectory}
-\caption{\small \textbf{Trajectories through topic space capture the dynamic content of the episode and recalls.}  All panels: the topic proportion matrices have been projected onto a shared two-dimensional space using UMAP.  \textbf{A.} The two-dimensional topic trajectory taken by the episode of \textit{Sherlock}.  Each dot indicates an event identified using the HMM (see \textit{Methods}); the dot colors denote the order of the events (early events are in red; later events are in blue), and the connecting lines indicate the transitions between successive events.  \textbf{B.} The average two-dimensional trajectory captured by participants' recall sequences, with the same format and coloring as the trajectory in Panel A.  To compute the event positions, we matched each recalled event with an event from the original episode (see \textit{Results}), and then we averaged the positions of all events with the same label.  The arrows reflect the average transition direction through topic space taken by any participants whose trajectories crossed that part of topic space; blue denotes reliable agreement across participants via a Rayleigh test ($p < 0.05$, corrected).  \DIFaddbeginFL \DIFaddFL{For additional detail see }\textit{\DIFaddFL{Methods}} \DIFaddFL{and Supplementary Figure~\arrows.  }\DIFaddendFL \textbf{C.} The recall topic trajectories (gray) taken by each individual participant (P1--P17).  The episode's trajectory is shown in black for reference.  Here, events (dots) are colored by their matched episode event (Panel A).}
+\caption{\small \textbf{Trajectories through topic space capture the dynamic content of the episode and recalls.}  All panels: the topic proportion matrices have been projected onto a shared two-dimensional space using UMAP.  \textbf{A.} The two-dimensional topic trajectory taken by the episode of \textit{Sherlock}.  Each dot indicates an event identified using the HMM (see \textit{Methods}); the dot colors denote the order of the events (early events are in red; later events are in blue), and the connecting lines indicate the transitions between successive events.  \textbf{B.} The average two-dimensional trajectory captured by participants' recall sequences, with the same format and coloring as the trajectory in Panel A.  To compute the event positions, we matched each recalled event with an event from the original episode (see \textit{Results}), and then we averaged the positions of all events with the same label.  The arrows reflect the average transition direction through topic space taken by any participants whose trajectories crossed that part of topic space; blue denotes reliable agreement across participants via a Rayleigh test ($p < 0.05$, corrected).  For additional detail see \textit{Methods} and \DIFdelbeginFL \DIFdelFL{Supplementary }\DIFdelendFL \DIFaddbeginFL \DIFaddFL{Extended Data }\DIFaddendFL Figure~\arrows.  \textbf{C.} The recall topic trajectories (gray) taken by each individual participant (P1--P17).  The episode's trajectory is shown in black for reference.  Here, events (dots) are colored by their matched episode event (Panel A).}
 \label{fig:trajectory}
-
 \end{figure}
+Visual inspection of the episode and recall topic trajectories reveals a striking pattern.  First, the topic trajectory of the episode (which reflects its dynamic content; Fig.~\ref{fig:trajectory}A) is captured nearly perfectly by the averaged topic trajectories of participants' recalls (Fig.~\ref{fig:trajectory}B).  To assess the consistency of these recall trajectories across participants, we asked: given that a participant's recall trajectory had entered a particular location in the reduced topic space, could the position of their next recalled event be predicted reliably?  For each location in the reduced topic space, we computed the set of line segments connecting successively recalled events (across all participants) that intersected that location (see \textit{Methods}, \DIFdelbegin \DIFdel{Supp.\ }\DIFdelend \DIFaddbegin \DIFadd{Extended Data }\DIFaddend Fig.~\arrows).  We then computed (for each location) the distribution of angles formed by the lines defined by those line segments and a fixed reference line (the $x$-axis).  Rayleigh tests revealed the set of locations in topic space at which these across-participant distributions exhibited reliable peaks (blue arrows in Fig.~\ref{fig:trajectory}B reflect significant peaks at $p < 0.05$, corrected).  We observed that the locations traversed by nearly the entire episode trajectory exhibited such peaks.  In other words, participants' recalls exhibited similar trajectories to each other that also matched the trajectory of the original episode (Fig.~\ref{fig:trajectory}C).  This is especially notable when considering the fact that the number of HMM-identified recall events (dots in Fig.~\ref{fig:trajectory}C) varied considerably across people, and that every participant used different words to describe what they had remembered happening in the episode.  Differences in the numbers of recall events appear in participants' trajectories as differences in the sampling resolution along the trajectory.  We note that this framework also provides a means of disentangling classic ``proportion recalled'' measures (i.e., the proportion of episode events described in participants' recalls) from participants' abilities to recapitulate the episode's essence (i.e., the similarity between the shapes of the original episode trajectory and that defined by each participant's recounting of the episode).
 
-Visual inspection of the episode and recall topic trajectories reveals a striking pattern.  First, the topic trajectory of the episode (which reflects its dynamic content; Fig.~\ref{fig:trajectory}A) is captured nearly perfectly by the averaged topic trajectories of participants' recalls (Fig.~\ref{fig:trajectory}B).  To assess the consistency of these recall trajectories across participants, we asked: given that a participant's recall trajectory had entered a particular location in the reduced topic space, could the position of their \DIFdelbegin \textit{\DIFdel{next}} %DIFAUXCMD
-\DIFdelend \DIFaddbegin \DIFadd{next }\DIFaddend recalled event be predicted reliably?  For each location in the reduced topic space, we computed the set of line segments connecting successively recalled events (across all participants) that intersected that location (see \textit{Methods}\DIFaddbegin \DIFadd{, Supp.\ Fig.~\arrows}\DIFaddend ).  We then computed (for each location) the distribution of angles formed by the lines defined by those line segments and a fixed reference line (the $x$-axis).  Rayleigh tests revealed the set of locations in topic space at which these across-participant distributions exhibited reliable peaks (blue arrows in Fig.~\ref{fig:trajectory}B reflect significant peaks at $p < 0.05$, corrected).  We observed that the locations traversed by nearly the entire episode trajectory exhibited such peaks.  In other words, participants' recalls exhibited similar trajectories to each other that also matched the trajectory of the original episode (Fig.~\ref{fig:trajectory}C).  This is especially notable when considering the fact that the number of HMM-identified recall events (dots in Fig.~\ref{fig:trajectory}C) varied considerably across people, and that every participant used different words to describe what they had remembered happening in the episode.  Differences in the numbers of recall events appear in participants' trajectories as differences in the sampling resolution along the trajectory.  We note that this framework also provides a means of disentangling classic ``proportion recalled'' measures (i.e., the proportion of episode events described in participants' recalls) from participants' abilities to recapitulate the episode's essence (i.e., the similarity between the shapes of the original episode trajectory and that defined by each participant's recounting of the episode).
+In addition to enabling us to visualize the episode's high-level essence, describing the episode as a geometric trajectory also enables us to drill down to individual words and quantify how each word relates to the memorability of each event.  This provides another approach to examining participants' recall for low-level details beyond the precision and distinctiveness measures we defined above.  The results displayed in Figures \ref{fig:list-learning}C and \ref{fig:precision-detail}A suggest that certain events were remembered better than others.  Given this, we next asked whether the events that were generally remembered precisely or imprecisely tended to reflect particular content.  Because our analysis framework projects the dynamic episode content and participants' recalls into a shared space, and because the dimensions of that space represent topics (which are, in turn, sets of weights over known words in the vocabulary), we are able to recover the weighted combination of words that make up any point (i.e., topic vector) in this space.  We first computed the average precision with which participants recalled each of the 30 episode events (Fig.~\ref{fig:topics}A; note that this result is analogous to a serial position curve created from our precision metric).  We then computed a weighted average of the topic vectors for each episode event, where the weights reflected how precisely each event was recalled.  To visualize the result, we created a ``wordle'' image~\citep{MuelEtal18} where words weighted more heavily by more precisely remembered topics appear in a larger font (Fig.~\ref{fig:topics}B, green box).  Across the full episode, content that weighted heavily on topics and words central to the major foci of the episode (e.g., the names of the two main characters, ``Sherlock'' and ``John,'' and the address of a major recurring location, ``221B Baker Street'') was best remembered.  An analogous analysis revealed which themes were less-precisely remembered.  Here, in computing the weighted average over events' topic vectors, we weighted each event in inverse proportion to its average precision (Fig.~\ref{fig:topics}B, red box).  The least precisely remembered episode content reflected information that was extraneous to the episode's essence, such as the proper names of relatively minor characters (e.g., ``Mike,'' ``Molly,'' and ``Lestrade'') and locations (e.g., ``St. Bartholomew's Hospital'').
 
-In addition to enabling us to visualize the episode's high-level essence, describing the episode as a geometric trajectory also enables us to drill down to individual words and quantify how each word relates to the memorability of each event.  This provides another approach to examining participants' recall for low-level details beyond the precision and distinctiveness measures we defined above.  The results displayed in Figures \ref{fig:list-learning}C and \ref{fig:precision-detail}A suggest that certain events were remembered better than others.  Given this, we next asked \DIFdelbegin \DIFdel{asked }\DIFdelend whether the events that were generally remembered precisely or imprecisely tended to reflect particular content.  Because our analysis framework projects the dynamic episode content and participants' recalls into a shared space, and because the dimensions of that space represent topics (which are, in turn, sets of weights over known words in the vocabulary), we are able to recover the weighted combination of words that make up any point (i.e., topic vector) in this space.  We first computed the average precision with which participants recalled each of the 30 episode events (Fig.~\ref{fig:topics}A; note that this result is analogous to a serial position curve created from our precision metric).  We then computed a weighted average of the topic vectors for each episode event, where the weights reflected how precisely each event was recalled.  To visualize the result, we created a ``wordle'' image~\citep{MuelEtal18} where words weighted more heavily by more \DIFdelbegin \DIFdel{precisely-remembered }\DIFdelend \DIFaddbegin \DIFadd{precisely remembered }\DIFaddend topics appear in a larger font (Fig.~\ref{fig:topics}B, green box).  Across the full episode, content that weighted heavily on topics and words central to the major foci of the episode (e.g., the names of the two main characters, ``Sherlock'' and ``John,'' and the address of a major recurring location, ``221B Baker Street'') was best remembered.  An analogous analysis revealed which themes were less-precisely remembered.  Here\DIFaddbegin \DIFadd{, }\DIFaddend in computing the weighted average over events' topic vectors, we weighted each event in \DIFdelbegin \textit{\DIFdel{inverse}} %DIFAUXCMD
-\DIFdelend \DIFaddbegin \DIFadd{inverse }\DIFaddend proportion to its average precision (Fig.~\ref{fig:topics}B, red box).  The least precisely remembered episode content reflected information that was extraneous to the episode's essence, such as the proper names of relatively minor characters (e.g., ``Mike,'' ``Molly,'' and ``Lestrade'') and locations (e.g., ``St. Bartholomew's Hospital'').
-
-\begin{figure}
-[tp]
+\begin{figure}[tp]
 \centering
 \includegraphics[width=1\textwidth]{figs/topics}
 \caption{\small \textbf{Language used in the most and least precisely remembered events.} \textbf{A.} Average precision (episode event-recall event topic vector correlation) across participants for each episode event.  Here we defined each episode event's precision for each participant as the correlation between its topic vector and the most-correlated recall event's topic vector from that participant.  Error bars denote bootstrap-derived across-participant 95\% confidence intervals.  The stars denote the three most precisely remembered events (green) and least precisely remembered events (red).  \textbf{B.} Wordles comprising the top 200 highest-weighted words reflected in the weighted-average topic vector across episode events.  Green: episode events were weighted by their precision (Panel A).  Red: episode events were weighted by the inverse of their precision.  \textbf{C.}  The set of all episode and recall events is projected onto the two-dimensional space derived in Figure~\ref{fig:trajectory}.  The dots outlined in black denote episode events (dot size is proportional to each event's average precision).  The dots without black outlines denote individual recall events from each participant.  All dots are colored using the same scheme as Figure~\ref{fig:trajectory}A.  Wordles for several example events are displayed (green: three most precisely remembered events; red: three least precisely remembered events).  Within each circular wordle, the left side displays words associated with the topic vector for the episode event, and the right side displays words associated with the (average) recall event topic vector, across all recall events matched to the given episode event.}
 \label{fig:topics}
-
 \end{figure}
-
 A similar result emerged from assessing the topic vectors for individual episode and recall events (Fig.~\ref{fig:topics}C).  Here, for each of the three most and least precisely remembered episode events, we have constructed two wordles: one from the original episode event's topic vector (left) and a second from the average recall topic vector for that event (right).  The three most precisely remembered events (circled in green) correspond to scenes integral to the central plot-line: a mysterious figure spying on John in a phone booth; John meeting Sherlock at Baker St.~to discuss the murders; and Sherlock laying a trap to catch the killer.  Meanwhile, the least precisely remembered events (circled in red) reflect scenes that comprise minor plot points: a video of singing cartoon characters that participants viewed in an introductory clip prior to the main episode; John asking Molly about Sherlock's habit of over-analyzing people; and Sherlock noticing evidence of Anderson's and Donovan's affair.
 
-The results this far inform us about which aspects of the dynamic content in the episode participants watched were preserved or altered in participants' memories.  We next carried out a series of analyses aimed at understanding which brain structures might facilitate these preservations and transformations between the participants' shared experience of watching the episode and their subsequent memories of the episode.  In the first analysis, we sought to identify brain structures that were sensitive to the dynamic unfolding of the episode's content, as characterized by its topic trajectory.  We used a searchlight procedure to identify clusters of voxels whose activity patterns displayed a proximal temporal correlation structure (as participants watched the episode) matching that of the original episode's topic proportions (Fig.~\ref{fig:brainz}A; see \textit{Methods} for additional details).  In a second analysis, we sought to identify brain structures whose responses (during episode viewing) reflected how each participant would later structure their \DIFdelbegin \textit{\DIFdel{recounting}} %DIFAUXCMD
-\DIFdelend \DIFaddbegin \DIFadd{recounting }\DIFaddend of the episode.  We used a searchlight procedure to identify clusters of voxels whose proximal temporal correlation matrices matched that of the topic proportions matrix for each participant's recall transcript (Figs.~\ref{fig:brainz}B; see \textit{Methods} for additional details).  To ensure our searchlight procedure identified regions \DIFdelbegin \textit{\DIFdel{specifically}} %DIFAUXCMD
-\DIFdelend \DIFaddbegin \DIFadd{specifically }\DIFaddend sensitive to the temporal structure of the episode or recalls (i.e., rather than those with a temporal autocorrelation length similar to that of the episode and recalls), we performed a phase shift-based permutation correction (see \textit{Methods}). As shown in Figure~\ref{fig:brainz}C, the episode-driven searchlight analysis revealed a distributed network of regions that may play a role in processing information relevant to the narrative structure of the episode.  The recall-driven searchlight analysis revealed a second network of regions (Fig.~\ref{fig:brainz}D) that may facilitate a person-specific transformation of one's experience into memory.  In identifying regions whose responses to ongoing experiences reflect how those experiences will be remembered later, this latter analysis extends classic \DIFdelbegin \textit{\DIFdel{subsequent memory effect analyses}}%DIFAUXCMD
-\DIFdel{~\mbox{%DIFAUXCMD
-\citep[e.g.,][]{PallWagn02} }\hspace{0pt}%DIFAUXCMD
-}\DIFdelend \DIFaddbegin \DIFadd{``subsequent memory effect analyses''~\mbox{%DIFAUXCMD
-\citep{PallWagn02} }\hspace{0pt}%DIFAUXCMD
-}\DIFaddend to the domain of naturalistic experiences.
+The results this far inform us about which aspects of the dynamic content in the episode participants watched were preserved or altered in participants' memories.  We next carried out a series of analyses aimed at understanding which brain structures might facilitate these preservations and transformations between the participants' shared experience of watching the episode and their subsequent memories of the episode.  In the first analysis, we sought to identify brain structures that were sensitive to the dynamic unfolding of the episode's content, as characterized by its topic trajectory.  We used a searchlight procedure to identify clusters of voxels whose activity patterns displayed a proximal temporal correlation structure (as participants watched the episode) matching that of the original episode's topic proportions (Fig.~\ref{fig:brainz}A; see \textit{Methods} for additional details).  In a second analysis, we sought to identify brain structures whose responses (during episode viewing) reflected how each participant would later structure their recounting of the episode.  We used a searchlight procedure to identify clusters of voxels whose proximal temporal correlation matrices matched that of the topic proportions matrix for each participant's recall transcript (Figs.~\ref{fig:brainz}B; see \textit{Methods} for additional details).  To ensure our searchlight procedure identified regions specifically sensitive to the temporal structure of the episode or recalls (i.e., rather than those with a temporal autocorrelation length similar to that of the episode and recalls), we performed a phase shift-based permutation correction (see \textit{Methods}). As shown in Figure~\ref{fig:brainz}C, the episode-driven searchlight analysis revealed a distributed network of regions that may play a role in processing information relevant to the narrative structure of the episode.  The recall-driven searchlight analysis revealed a second network of regions (Fig.~\ref{fig:brainz}D) that may facilitate a person-specific transformation of one's experience into memory.  In identifying regions whose responses to ongoing experiences reflect how those experiences will be remembered later, this latter analysis extends classic ``subsequent memory effect analyses''~\citep{PallWagn02} to the domain of naturalistic experiences.
 
 \begin{figure}
 [tp]
 \centering
 \includegraphics[width=1\textwidth]{figs/searchlights}
-\caption{\small \textbf{Brain structures that underlie the transformation of experience into memory.} \textbf{A.} We isolated the proximal diagonals from the upper triangle of the episode correlation matrix, and applied this same diagonal mask to the voxel response correlation matrix for each cube of voxels in the brain. We then searched for brain regions whose activation timeseries consistently exhibited a similar proximal correlational structure to the episode model, across participants.  \textbf{B.} We used dynamic time warping \citep{BernClif94} to align each participant's recall timeseries to the TR timeseries of the episode.  We then \DIFaddbeginFL \DIFaddFL{computed the temporal correlation matrix of each participant's warped recalls.  Next, we }\DIFaddendFL applied the same diagonal mask used in Panel A to isolate the proximal temporal correlations and searched for brain regions whose activation timeseries for \DIFdelbeginFL \DIFdelFL{an individual }\DIFdelendFL \DIFaddbeginFL \DIFaddFL{each participant }\DIFaddendFL consistently exhibited a similar proximal correlational structure to \DIFdelbeginFL \DIFdelFL{each individual}\DIFdelendFL \DIFaddbeginFL \DIFaddFL{that participant}\DIFaddendFL 's \DIFdelbeginFL \DIFdelFL{recall}\DIFdelendFL \DIFaddbeginFL \DIFaddFL{recalls}\DIFaddendFL .  \textbf{C.} We identified a network of regions sensitive to the narrative structure of participants' ongoing experience.  The map shown is thresholded at $p < 0.05$, corrected.  The top ten \texttt{Neurosynth} terms displayed in the panel were computed using the unthresholded map.  \textbf{D}. We also identified a network of regions sensitive to how individuals would later structure the episode's content in their recalls.  The map shown is thresholded at $p < 0.05$, corrected.  The top ten \texttt{Neurosynth} terms displayed in the panel were computed using the unthresholded map.}
+\caption{\small \textbf{Brain structures that underlie the transformation of experience into memory.} \textbf{A.} We isolated the proximal diagonals from the upper triangle of the episode correlation matrix, and applied this same diagonal mask to the voxel response correlation matrix for each cube of voxels in the brain. We then searched for brain regions whose activation timeseries consistently exhibited a similar proximal correlational structure to the episode model, across participants.  \textbf{B.} We used dynamic time warping \citep{BernClif94} to align each participant's recall timeseries to the TR timeseries of the episode.  We then computed the temporal correlation matrix of each participant's warped recalls.  Next, we applied the same diagonal mask used in Panel A to isolate the proximal temporal correlations and searched for brain regions whose activation timeseries for each participant consistently exhibited a similar proximal correlational structure to that participant's recalls.  \textbf{C.} We identified a network of regions sensitive to the narrative structure of participants' ongoing experience.  The map shown is thresholded at $p < 0.05$, corrected.  The top ten \texttt{Neurosynth} terms displayed in the panel were computed using the unthresholded map.  \textbf{D}. We also identified a network of regions sensitive to how individuals would later structure the episode's content in their recalls.  The map shown is thresholded at $p < 0.05$, corrected.  The top ten \texttt{Neurosynth} terms displayed in the panel were computed using the unthresholded map.}
 \label{fig:brainz}
 
 \end{figure}
-
-The searchlight analyses described above yielded two distributed networks of brain regions whose activity timecourses tracked with the temporal structure of the episode (Fig.~\ref{fig:brainz}C) or participants' subsequent recalls (Fig.~\ref{fig:brainz}D).  We next sought to gain greater insight into the structures and functional networks our results reflected.  To accomplish this, we performed an additional, exploratory analysis using \texttt{Neurosynth}\DIFaddbegin \DIFadd{~}\DIFaddend \citep{YarkEtal11}.  Given an arbitrary statistical map as input, \texttt{Neurosynth} performs a massive automated meta-analysis, returning a \DIFdelbegin \DIFdel{ranked }\DIFdelend \DIFaddbegin \DIFadd{frequency-ranked }\DIFaddend list of terms \DIFdelbegin \DIFdel{frequently }\DIFdelend used in neuroimaging papers that report similar statistical maps. We ran \texttt{Neurosynth} on the (unthresholded) permutation-corrected maps for the episode- and recall-driven searchlight analyses. The top ten terms with maximally similar meta-analysis images identified by \texttt{Neurosynth} are shown in Figure \ref{fig:brainz}.
+The searchlight analyses described above yielded two distributed networks of brain regions whose activity timecourses tracked with the temporal structure of the episode (Fig.~\ref{fig:brainz}C) or participants' subsequent recalls (Fig.~\ref{fig:brainz}D).  We next sought to gain greater insight into the structures and functional networks our results reflected.  To accomplish this, we performed an additional, exploratory analysis using \texttt{Neurosynth}~\citep{YarkEtal11}.  Given an arbitrary statistical map as input, \texttt{Neurosynth} performs a massive automated meta-analysis, returning a frequency-ranked list of terms used in neuroimaging papers that report similar statistical maps. We ran \texttt{Neurosynth} on the (unthresholded) permutation-corrected maps for the episode- and recall-driven searchlight analyses. The top ten terms with maximally similar meta-analysis images identified by \texttt{Neurosynth} are shown in Figure \ref{fig:brainz}.
 
 \section*{Discussion}
 \label{sec:discussion}
 
-Explicitly modeling the dynamic content of a naturalistic stimulus and participants' memories enabled us to connect the present study of naturalistic recall with an extensive prior literature that has used list-learning paradigms to study memory~\DIFdelbegin \DIFdel{\mbox{%DIFAUXCMD
-\citep[for review see][]{Kaha12}}\hspace{0pt}%DIFAUXCMD
-}\DIFdelend \DIFaddbegin \DIFadd{\mbox{%DIFAUXCMD
-\citep{Kaha12}}\hspace{0pt}%DIFAUXCMD
-}\DIFaddend , as in Figure~\ref{fig:list-learning}.  We found some similarities between how participants in the present study recounted a television episode and how participants typically recall memorized random word lists.  However, our broader claim is that word lists miss out on fundamental aspects of naturalistic memory more like the sort of memory we rely on in everyday life.  For example, there are no random word list analogs of character interactions, conceptual dependencies between temporally distant episode events, the sense of solving a mystery that pervades the \textit{Sherlock} episode, or the myriad other features of the episode that convey deep meaning and capture interest.  Nevertheless, each of these properties affects how people process and engage with the episode as they are watching it, and how they remember it later.  The overarching goal of the present study is to characterize how the rich dynamics of the episode affect the rich behavioral and neural dynamics of how people remember it.
-
-Our work casts remembering as reproducing (behaviorally and neurally) the topic trajectory, or ``shape,'' of an experience\DIFaddbegin \DIFadd{, thereby drawing implicit analogies between mentally navigating through word embedding spaces and physically navigating through spatial environments~\mbox{%DIFAUXCMD
-\citep{BellEtal18, BellEtal20, ConsEtal16}}\hspace{0pt}%DIFAUXCMD
-}\DIFaddend .  When we characterized memory for a television episode using this framework, we found that every participant's recounting of the episode recapitulated the low spatial frequency details of the shape of its trajectory through topic space (Fig.~\ref{fig:trajectory}).  We termed this narrative scaffolding the episode's \DIFdelbegin \textit{\DIFdel{essence}}%DIFAUXCMD
-\DIFdelend \DIFaddbegin \DIFadd{essence}\DIFaddend .  Where participants' behaviors varied most was in their tendencies to recount specific low-level details from each episode event.  Geometrically, this appears as high spatial frequency distortions in participants' recall trajectories relative to the trajectory of the original episode (Fig.~\ref{fig:topics}).  We developed metrics to characterize the precision (recovery of any and all event-level information) and distinctiveness (recovery of event-specific information).  We also used word cloud visualizations to interpret the details of these event-level distortions.
-
-The neural analyses we carried out (Fig.~\ref{fig:brainz}) also leveraged our geometric framework for characterizing the shapes of the episode and participants' recountings.  We identified one network of regions whose responses tracked with temporal correlations in the conceptual content of the episode (as quantified by topic models applied to a set of annotations about the episode).  This network included orbitofrontal cortex, ventromedial prefrontal cortex, and striatum, among others.  As reviewed by \DIFaddbegin \DIFadd{Ranganath and Ritchey (2012)~}\DIFaddend \cite{RangRitc12}, several of these regions are members of the \DIFdelbegin \textit{\DIFdel{anterior temporal system}}%DIFAUXCMD
-\DIFdel{,}\DIFdelend \DIFaddbegin \DIFadd{``anterior temporal system,'' }\DIFaddend which has been implicated in assessing and processing the familiarity of ongoing experiences, emotions, social cognition, and reward.  A second network we identified tracked with temporal correlations in the idiosyncratic conceptual content of participants' subsequent recountings of the episode.  This network included occipital cortex, extrastriate cortex, fusiform gyrus, and the precuneus.  Several of these regions are members of the \DIFdelbegin \textit{\DIFdel{posterior medial system}}%DIFAUXCMD
-\DIFdelend \DIFaddbegin \DIFadd{``posterior medial system''}\DIFaddend ~\citep{RangRitc12}, which has been implicated in matching incoming cues about the current situation to internally maintained \DIFdelbegin \textit{\DIFdel{situation models}} %DIFAUXCMD
-\DIFdelend \DIFaddbegin \DIFadd{``situation models'' }\DIFaddend that specify the parameters and expectations inherent to the current situation~\DIFdelbegin \DIFdel{\mbox{%DIFAUXCMD
-\citep[also see][]{ZackEtal07, ZwaaRadv98}}\hspace{0pt}%DIFAUXCMD
-}\DIFdelend \DIFaddbegin \DIFadd{\mbox{%DIFAUXCMD
-\citep{ZackEtal07, ZwaaRadv98}}\hspace{0pt}%DIFAUXCMD
-}\DIFaddend .  Taken together, our results support the notion that these two (partially overlapping) networks work in coordination to make sense of our ongoing experiences, distort them in a way that links them with our prior knowledge and experiences, and encodes those distorted representations into memory for our later use.  \DIFaddbegin \DIFadd{Our work also provides a potential framework for modeling and elucidating ``memory schemas''---i.e., cognitive abstractions that may be applied to multiple related experiences~\mbox{%DIFAUXCMD
-\citep{GilbMarl17, BaldEtal18}}\hspace{0pt}%DIFAUXCMD
-.  For example, the event-level geometric scaffolding of an experience (e.g., Fig.~\ref{fig:trajectory}A) might reflect its underlying schema, and experiences that share similar schemas might have similar shapes.  This could also help explain how brain structures including the ventromedial prefrontal cortex~\mbox{%DIFAUXCMD
-\citep{GilbMarl17} }\hspace{0pt}%DIFAUXCMD
-(Fig.~\ref{fig:brainz}) might acquire or apply schema knowledge across different experiences (i.e., by learning patterns in the schema's shape).
-}\DIFaddend 
+Explicitly modeling the dynamic content of a naturalistic stimulus and participants' memories enabled us to connect the present study of naturalistic recall with an extensive prior literature that has used list-learning paradigms to study memory~\citep{Kaha12}, as in Figure~\ref{fig:list-learning}.  We found some similarities between how participants in the present study recounted a television episode and how participants typically recall memorized random word lists.  However, our broader claim is that word lists miss out on fundamental aspects of naturalistic memory more like the sort of memory we rely on in everyday life.  For example, there are no random word list analogs of character interactions, conceptual dependencies between temporally distant episode events, the sense of solving a mystery that pervades the \textit{Sherlock} episode, or the myriad other features of the episode that convey deep meaning and capture interest.  Nevertheless, each of these properties affects how people process and engage with the episode as they are watching it, and how they remember it later.  The overarching goal of the present study is to characterize how the rich dynamics of the episode affect the rich behavioral and neural dynamics of how people remember it.
+
+Our work casts remembering as reproducing (behaviorally and neurally) the topic trajectory, or ``shape,'' of an experience, thereby drawing implicit analogies between mentally navigating through word embedding spaces and physically navigating through spatial environments~\citep{BellEtal18, BellEtal20, ConsEtal16}.  When we characterized memory for a television episode using this framework, we found that every participant's recounting of the episode recapitulated the low spatial frequency details of the shape of its trajectory through topic space (Fig.~\ref{fig:trajectory}).  We termed this narrative scaffolding the episode's essence.  Where participants' behaviors varied most was in their tendencies to recount specific low-level details from each episode event.  Geometrically, this appears as high spatial frequency distortions in participants' recall trajectories relative to the trajectory of the original episode (Fig.~\ref{fig:topics}).  We developed metrics to characterize the precision (recovery of any and all event-level information) and distinctiveness (recovery of event-specific information).  We also used word cloud visualizations to interpret the details of these event-level distortions.
 
-Our general approach draws inspiration from prior work aimed at elucidating the neural and behavioral underpinnings of how we process dynamic naturalistic experiences and remember them later.  Our approach to identifying neural responses to naturalistic stimuli (including experiences) entails building an explicit model of the stimulus dynamics and searching for brain regions whose responses are consistent with the model~\DIFdelbegin \DIFdel{\mbox{%DIFAUXCMD
-\citep[also see][]{HuthEtal12, HuthEtal16}}\hspace{0pt}%DIFAUXCMD
-.  }\DIFdelend \DIFaddbegin \DIFadd{\mbox{%DIFAUXCMD
-\citep{HuthEtal12, HuthEtal16}}\hspace{0pt}%DIFAUXCMD
-.  Building an explicit model of these dynamics also enables us to match up different people's recountings of a common shared experience, despite individual differences~\mbox{%DIFAUXCMD
-\cite{GagnEtal20}}\hspace{0pt}%DIFAUXCMD
-.  }\DIFaddend In prior work, a series of studies from Uri Hasson's group~\citep{LernEtal11, SimoEtal16, ChenEtal17, BaldEtal17, ZadbEtal17} have presented a clever alternative approach: rather than building an explicit stimulus model, these studies instead search for brain responses to the stimulus that are reliably similar across individuals.  So called \DIFdelbegin \textit{\DIFdel{inter-subject correlation}} %DIFAUXCMD
-\DIFdelend \DIFaddbegin \DIFadd{``inter-subject correlation'' }\DIFaddend (ISC) and \DIFdelbegin \textit{\DIFdel{inter-subject functional connectivity}} %DIFAUXCMD
-\DIFdelend \DIFaddbegin \DIFadd{``inter-subject functional connectivity'' }\DIFaddend (ISFC) analyses effectively treat other people's brain responses to the stimulus as a ``model'' of how its features change over time~\DIFdelbegin \DIFdel{\mbox{%DIFAUXCMD
-\citep[also see][]{SimoChan20}}\hspace{0pt}%DIFAUXCMD
-}\DIFdelend \DIFaddbegin \DIFadd{\mbox{%DIFAUXCMD
-\citep{SimoChan20}}\hspace{0pt}%DIFAUXCMD
-}\DIFaddend .  These purely brain-driven approaches are well suited to identifying which brain structures exhibit similar stimulus-driven responses across individuals.  Further, because neural response dynamics are observed data (rather than model approximations), such approaches do not require a detailed understanding of which stimulus properties or features might be driving the observed responses.  However, this also means that the specific stimulus features driving those responses are typically opaque to the researcher.  Our approach is complementary.  By explicitly modeling the stimulus dynamics, we are able to relate specific stimulus features to behavioral and neural dynamics.  However, when our model fails to accurately capture the stimulus dynamics that are truly driving behavioral and neural responses, our approach necessarily yields an incomplete characterization of the neural basis of the processes we are studying.
-
-Other recent work has used HMMs to discover latent event structure in neural responses to naturalistic stimuli~\citep{BaldEtal17}.  By applying HMMs to our explicit models of stimulus and memory dynamics, we gain a more direct understanding of those state dynamics.  For example, we found that although the events comprising each participant's recalls recapitulated the episode's essence, participants differed in the \DIFdelbegin \textit{\DIFdel{resolution}} %DIFAUXCMD
-\DIFdelend \DIFaddbegin \DIFadd{resolution }\DIFaddend of their recounting of low-level details.  In turn, these individual behavioral differences were reflected in differences in neural activity dynamics as participants watched the television episode.
+The neural analyses we carried out (Fig.~\ref{fig:brainz}) also leveraged our geometric framework for characterizing the shapes of the episode and participants' recountings.  We identified one network of regions whose responses tracked with temporal correlations in the conceptual content of the episode (as quantified by topic models applied to a set of annotations about the episode).  This network included orbitofrontal cortex, ventromedial prefrontal cortex, and striatum, among others.  As reviewed by Ranganath and Ritchey (2012)~\cite{RangRitc12}, several of these regions are members of the ``anterior temporal system,'' which has been implicated in assessing and processing the familiarity of ongoing experiences, emotions, social cognition, and reward.  A second network we identified tracked with temporal correlations in the idiosyncratic conceptual content of participants' subsequent recountings of the episode.  This network included occipital cortex, extrastriate cortex, fusiform gyrus, and the precuneus.  Several of these regions are members of the ``posterior medial system''~\citep{RangRitc12}, which has been implicated in matching incoming cues about the current situation to internally maintained ``situation models'' that specify the parameters and expectations inherent to the current situation~\citep{ZackEtal07, ZwaaRadv98}.  Taken together, our results support the notion that these two (partially overlapping) networks work in coordination to make sense of our ongoing experiences, distort them in a way that links them with our prior knowledge and experiences, and encodes those distorted representations into memory for our later use.  Our work also provides a potential framework for modeling and elucidating ``memory schemas''---i.e., cognitive abstractions that may be applied to multiple related experiences~\citep{GilbMarl17, BaldEtal18}.  For example, the event-level geometric scaffolding of an experience (e.g., Fig.~\ref{fig:trajectory}A) might reflect its underlying schema, and experiences that share similar schemas might have similar shapes.  This could also help explain how brain structures including the ventromedial prefrontal cortex~\citep{GilbMarl17} (Fig.~\ref{fig:brainz}) might acquire or apply schema knowledge across different experiences (i.e., by learning patterns in the schema's shape).
+
+Our general approach draws inspiration from prior work aimed at elucidating the neural and behavioral underpinnings of how we process dynamic naturalistic experiences and remember them later.  Our approach to identifying neural responses to naturalistic stimuli (including experiences) entails building an explicit model of the stimulus dynamics and searching for brain regions whose responses are consistent with the model~\citep{HuthEtal12, HuthEtal16}.  Building an explicit model of these dynamics also enables us to match up different people's recountings of a common shared experience, despite individual differences~\cite{GagnEtal20}.  In prior work, a series of studies from Uri Hasson's group~\citep{LernEtal11, SimoEtal16, ChenEtal17, BaldEtal17, ZadbEtal17} have presented a clever alternative approach: rather than building an explicit stimulus model, these studies instead search for brain responses to the stimulus that are reliably similar across individuals.  So called ``inter-subject correlation'' (ISC) and ``inter-subject functional connectivity'' (ISFC) analyses effectively treat other people's brain responses to the stimulus as a ``model'' of how its features change over time~\citep{SimoChan20}.  These purely brain-driven approaches are well suited to identifying which brain structures exhibit similar stimulus-driven responses across individuals.  Further, because neural response dynamics are observed data (rather than model approximations), such approaches do not require a detailed understanding of which stimulus properties or features might be driving the observed responses.  However, this also means that the specific stimulus features driving those responses are typically opaque to the researcher.  Our approach is complementary.  By explicitly modeling the stimulus dynamics, we are able to relate specific stimulus features to behavioral and neural dynamics.  However, when our model fails to accurately capture the stimulus dynamics that are truly driving behavioral and neural responses, our approach necessarily yields an incomplete characterization of the neural basis of the processes we are studying.
+
+Other recent work has used HMMs to discover latent event structure in neural responses to naturalistic stimuli~\citep{BaldEtal17}.  By applying HMMs to our explicit models of stimulus and memory dynamics, we gain a more direct understanding of those state dynamics.  For example, we found that although the events comprising each participant's recalls recapitulated the episode's essence, participants differed in the resolution of their recounting of low-level details.  In turn, these individual behavioral differences were reflected in differences in neural activity dynamics as participants watched the television episode.
 
 Our approach also draws inspiration from the growing field of word embedding models.  The topic models~\citep{BleiEtal03} we used to embed text from the episode annotations and participants' recall transcripts are just one of many models that have been studied in an extensive literature.  The earliest approaches to word embedding, including latent
-semantic analysis~\citep{LandDuma97}, used word co-occurrence statistics (i.e., how often pairs of words occur in the same documents contained in the corpus) to derive a unique feature vector for each word.  The feature vectors are constructed so that words that co-occur more frequently have feature vectors that are closer (in Euclidean distance).  Topic models are essentially an extension of those early models, in that they attempt to explicitly model the underlying causes of word co-occurrences by automatically identifying the set of themes or topics reflected across the documents in the corpus.  More recent work on these types of semantic models, including word2vec~\citep{MikoEtal13a}, the Universal Sentence Encoder~\citep{CerEtal18}, \DIFaddbegin \DIFadd{and Generative Pre-trained Transformers (e.g., }\DIFaddend GPT-2~\citep{RadfEtal19} \DIFdelbegin \DIFdel{, }\DIFdelend and GTP-3~\citep{BrowEtal20}\DIFaddbegin \DIFadd{) }\DIFaddend use deep neural networks to attempt to identify the deeper conceptual representations underlying each word.  Despite the growing popularity of these sophisticated deep learning-based embedding models, we chose to prioritize interpretability of the embedding dimensions (e.g., Fig.~\ref{fig:topics}) over raw performance (e.g., with respect to some predefined benchmark).  Nevertheless, we note that our general framework is, in principle, robust to the specific choice of language model as well as other aspects of our computational pipeline.  For example, the word embedding model, timeseries segmentation model, and the episode-recall matching function could each be customized to suit a particular question space or application.  Indeed, for some questions, interpretability of the embeddings may not be a priority, and thus other text embedding approaches (including the deep learning-based models described above) may be preferable.  Further work will be needed to explore the influence of particular models on our framework's predictions and performance.
+semantic analysis~\citep{LandDuma97}, used word co-occurrence statistics (i.e., how often pairs of words occur in the same documents contained in the corpus) to derive a unique feature vector for each word.  The feature vectors are constructed so that words that co-occur more frequently have feature vectors that are closer (in Euclidean distance).  Topic models are essentially an extension of those early models, in that they attempt to explicitly model the underlying causes of word co-occurrences by automatically identifying the set of themes or topics reflected across the documents in the corpus.  More recent work on these types of semantic models, including word2vec~\citep{MikoEtal13a}, the Universal Sentence Encoder~\citep{CerEtal18}, and Generative Pre-trained Transformers (e.g., GPT-2~\citep{RadfEtal19} and GTP-3~\citep{BrowEtal20}) use deep neural networks to attempt to identify the deeper conceptual representations underlying each word.  Despite the growing popularity of these sophisticated deep learning-based embedding models, we chose to prioritize interpretability of the embedding dimensions (e.g., Fig.~\ref{fig:topics}) over raw performance (e.g., with respect to some predefined benchmark).  Nevertheless, we note that our general framework is, in principle, robust to the specific choice of language model as well as other aspects of our computational pipeline.  For example, the word embedding model, timeseries segmentation model, and the episode-recall matching function could each be customized to suit a particular question space or application.  Indeed, for some questions, interpretability of the embeddings may not be a priority, and thus other text embedding approaches (including the deep learning-based models described above) may be preferable.  Further work will be needed to explore the influence of particular models on our framework's predictions and performance.
 
-\DIFdelbegin \DIFdel{Our work has }\DIFdelend \DIFaddbegin \DIFadd{Speculatively, our work may have }\DIFaddend broad implications for how we characterize and assess memory in real-world settings, such as the classroom or physician's office.  For example, the most commonly used classroom evaluation tools involve simply computing the proportion of correctly answered exam questions.  Our work \DIFdelbegin \DIFdel{indicates }\DIFdelend \DIFaddbegin \DIFadd{suggests }\DIFaddend that this approach is only loosely related to what educators might really want to measure: how well did the students understand the key ideas presented in the course?  Under this typical framework of assessment, the same exam score of 50\% could be ascribed to two very different students: one who attended to the full course but struggled to learn more than a broad overview of the material, and one who attended to only half of the course but understood the attended material perfectly.  Instead, one could apply our computational framework to build explicit dynamic content models of the course material and exam questions.  This approach \DIFdelbegin \DIFdel{would }\DIFdelend \DIFaddbegin \DIFadd{might }\DIFaddend provide a more nuanced and specific view into which aspects of the material students had learned well (or poorly).  In clinical settings, memory measures that incorporate such explicit content models might also provide more direct evaluations of patients' memories, and of doctor-patient interactions.
+Speculatively, our work may have broad implications for how we characterize and assess memory in real-world settings, such as the classroom or physician's office.  For example, the most commonly used classroom evaluation tools involve simply computing the proportion of correctly answered exam questions.  Our work suggests that this approach is only loosely related to what educators might really want to measure: how well did the students understand the key ideas presented in the course?  Under this typical framework of assessment, the same exam score of 50\% could be ascribed to two very different students: one who attended to the full course but struggled to learn more than a broad overview of the material, and one who attended to only half of the course but understood the attended material perfectly.  Instead, one could apply our computational framework to build explicit dynamic content models of the course material and exam questions.  This approach might provide a more nuanced and specific view into which aspects of the material students had learned well (or poorly).  In clinical settings, memory measures that incorporate such explicit content models might also provide more direct evaluations of patients' memories, and of doctor-patient interactions.
 
 
 \section*{Methods}
 \label{sec:methods}
 
 \subsection*{Paradigm and data collection}
-Data were collected by \DIFdelbegin \DIFdel{\mbox{%DIFAUXCMD
-\cite{ChenEtal17}}\hspace{0pt}%DIFAUXCMD
-. }\DIFdelend \DIFaddbegin \DIFadd{Chen et al. (2017)~\mbox{%DIFAUXCMD
-\citep{ChenEtal17}}\hspace{0pt}%DIFAUXCMD
-.  }\DIFaddend In brief, participants ($n=22$) viewed the first 48 minutes of ``A Study in Pink,'' the first episode of the BBC television show \textit{Sherlock}, while fMRI volumes were collected (TR = 1500~ms).  Participants were pre-screened to ensure they had never seen any episode of the show before.  The stimulus was divided into a 23~min (946~TR) and a 25~min (1030~TR) segment to mitigate technical issues related to the scanner.  After finishing the clip, participants were instructed to \DIFdelbegin \DIFdel{\mbox{%DIFAUXCMD
-\citep[quoting from][]{ChenEtal17} }\hspace{0pt}%DIFAUXCMD
-}\DIFdelend ``describe what they recalled of the [episode] in as much detail as they could, to try to recount events in the original order they were viewed in, and to speak for at least 10 minutes if possible but that longer was better. They were told that completeness and detail were more important than temporal order, and that if at any point they realized they had missed something, to return to it. Participants were then allowed to speak for as long as they wished, and verbally indicated when they were finished (e.g., `I’m done').''\DIFaddbegin \DIFadd{~\mbox{%DIFAUXCMD
-\citep{ChenEtal17}  }\hspace{0pt}%DIFAUXCMD
-}\DIFaddend Five participants were dropped from the original dataset due to excessive head motion (2 participants), insufficient recall length (2 participants), or falling asleep during stimulus viewing (1 participant), resulting in a final sample size of $n=17$.  For additional details about the testing procedures and scanning parameters, see \DIFaddbegin \DIFadd{Chen et al. (2017)~}\DIFaddend \cite{ChenEtal17}.  The testing protocol was approved by Princeton University's Institutional Review Board.
-
-After preprocessing the fMRI data and warping the images into a standard (3~mm$^3$ MNI) space, the voxel activations were $z$-scored (within voxel) and spatially smoothed using a 6~mm (full width at half maximum) Gaussian kernel.  The fMRI data were also cropped so that all episode-viewing data were aligned across participants.  This included a constant 3 TR (4.5~s) shift to account for the lag in the hemodynamic response.  \DIFdelbegin \DIFdel{\mbox{%DIFAUXCMD
-\citep[All of these preprocessing steps followed][where additional details may be found.]{ChenEtal17}
-}\hspace{0pt}%DIFAUXCMD
-}\DIFdelend \DIFaddbegin \DIFadd{All of these preprocessing steps followed Chen et al. (2017)~\mbox{%DIFAUXCMD
-\citep{ChenEtal17}}\hspace{0pt}%DIFAUXCMD
-, where additional details may be found.
-}\DIFaddend 
+Data were collected by Chen et al. (2017)~\citep{ChenEtal17}.  In brief, participants ($n=22$) viewed the first 48 minutes of ``A Study in Pink,'' the first episode of the BBC television show \textit{Sherlock}, while fMRI volumes were collected (TR = 1500~ms).  Participants were pre-screened to ensure they had never seen any episode of the show before.  The stimulus was divided into a 23~min (946~TR) and a 25~min (1030~TR) segment to mitigate technical issues related to the scanner.  After finishing the clip, participants were instructed to ``describe what they recalled of the [episode] in as much detail as they could, to try to recount events in the original order they were viewed in, and to speak for at least 10 minutes if possible but that longer was better. They were told that completeness and detail were more important than temporal order, and that if at any point they realized they had missed something, to return to it. Participants were then allowed to speak for as long as they wished, and verbally indicated when they were finished (e.g., `I’m done').''~\citep{ChenEtal17}  Five participants were dropped from the original dataset due to excessive head motion (2 participants), insufficient recall length (2 participants), or falling asleep during stimulus viewing (1 participant), resulting in a final sample size of $n=17$.  For additional details about the testing procedures and scanning parameters, see Chen et al. (2017)~\cite{ChenEtal17}.  The testing protocol was approved by Princeton University's Institutional Review Board.
 
-The video stimulus was divided into \DIFdelbegin \DIFdel{1,000 }\DIFdelend \DIFaddbegin \DIFadd{1000 }\DIFaddend fine-grained ``time segments'' and annotated by an independent coder.  For each of these \DIFdelbegin \DIFdel{1,000 }\DIFdelend \DIFaddbegin \DIFadd{1000 }\DIFaddend annotations, the following information was recorded: a brief narrative description of what was happening, the location where the time segment took place, whether that location was indoors or outdoors, the names of all characters on-screen, the name(s) of the character(s) in focus in the shot, the name(s) of the character(s) currently speaking, the camera angle of the shot, a transcription of any text appearing on-screen, and whether or not there was music present in the background.  Each time segment was also tagged with its onset and offset time, in both seconds and TRs.
+After preprocessing the fMRI data and warping the images into a standard (3~mm$^3$ MNI) space, the voxel activations were $z$-scored (within voxel) and spatially smoothed using a 6~mm (full width at half maximum) Gaussian kernel.  The fMRI data were also cropped so that all episode-viewing data were aligned across participants.  This included a constant 3 TR (4.5~s) shift to account for the lag in the hemodynamic response.  All of these preprocessing steps followed Chen et al. (2017)~\citep{ChenEtal17}, where additional details may be found.
 
-\DIFdelbegin \subsection*{\DIFdel{Data and code availability}}
-%DIFAUXCMD
-\DIFdel{The fMRI data we analyzed are available online }%DIFDELCMD < \href{http://dataspace.princeton.edu/jspui/handle/88435/dsp01nz8062179}{\underline{here}}%%%
-\DIFdel{.  The behavioral data and all of our analysis code may be downloaded }%DIFDELCMD < \href{https://github.com/ContextLab/sherlock-topic-model-paper}{\underline{here}}%%%
-\DIFdel{.
-}%DIFDELCMD < 
+The video stimulus was divided into 1000 fine-grained ``time segments'' and annotated by an independent coder.  For each of these 1000 annotations, the following information was recorded: a brief narrative description of what was happening, the location where the time segment took place, whether that location was indoors or outdoors, the names of all characters on-screen, the name(s) of the character(s) in focus in the shot, the name(s) of the character(s) currently speaking, the camera angle of the shot, a transcription of any text appearing on-screen, and whether or not there was music present in the background.  Each time segment was also tagged with its onset and offset time, in both seconds and TRs.
 
-%DIFDELCMD < %%%
-\DIFdelend \subsection*{Statistics}
-All statistical tests performed in the behavioral analyses were two-sided.  All statistical tests performed in the neural data analyses were two-sided, except for the permutation-based thresholding, which was one-sided.  In this case, we were specifically interested in identifying voxels whose activation time series reflected the temporal structure of the episode and recall topic proportions matrices to a \DIFdelbegin \textit{\DIFdel{greater}} %DIFAUXCMD
-\DIFdelend \DIFaddbegin \DIFadd{greater }\DIFaddend extent than that of the phase-shifted matrices.  \DIFaddbegin \DIFadd{The 95\% confidence intervals we reported for each correlation were estimated by generating 10000 ``bootstrap'' distributions of correlation coefficients by sampling (with replacement) from the observed data.
-}\DIFaddend 
+\subsection*{Statistics}
+All statistical tests performed in the behavioral analyses were two-sided.  All statistical tests performed in the neural data analyses were two-sided, except for the permutation-based thresholding, which was one-sided.  In this case, we were specifically interested in identifying voxels whose activation time series reflected the temporal structure of the episode and recall topic proportions matrices to a greater extent than that of the phase-shifted matrices.  The 95\% confidence intervals we reported for each correlation were estimated by generating 10000 ``bootstrap'' distributions of correlation coefficients by sampling (with replacement) from the observed data.
 
 \subsection*{Modeling the dynamic content of the episode and recall transcripts}
 \subsubsection*{Topic modeling}
-The input to the topic model we trained to characterize the dynamic content of the episode comprised 998 hand-generated annotations of short (mean: 2.96s) time segments spanning the video clip~(\DIFdelbegin \DIFdel{\mbox{%DIFAUXCMD
-\citealp{ChenEtal17} }\hspace{0pt}%DIFAUXCMD
-}\DIFdelend \DIFaddbegin \DIFadd{Chen et al., 2017~\mbox{%DIFAUXCMD
-\citep{ChenEtal17} }\hspace{0pt}%DIFAUXCMD
-}\DIFaddend generated 1000 annotations total; we removed two annotations referring to a break between the first and second scan sessions, during which no fMRI data were collected).  We concatenated the text for all of the annotated features within each segment, creating a ``bag of words'' describing its content\DIFaddbegin \DIFadd{, }\DIFaddend and performed some minor preprocessing (e.g., stemming possessive nouns and removing punctuation).  We then re-organized the text descriptions into overlapping sliding windows spanning (up to) 50 annotations each.  In other words, we estimated the ``context'' for each annotated segment using the text descriptions of the preceding 25 annotations, the present annotations, and the following 24 annotations.  To model the context for annotations near the beginning of the episode (i.e., within 25 of the beginning or end), we created overlapping sliding windows that grew in size from one annotation to the full length.  We also tapered the sliding window lengths at the end of the episode, whereby time segments within fewer than 24 annotations of the end of the episode were assigned sliding windows that extended to the end of the episode.  This procedure ensured that each annotation's content was represented in the text corpus an equal number of times.
-
-We trained our model using these overlapping text samples with \texttt{scikit-learn} \DIFdelbegin \DIFdel{~\mbox{%DIFAUXCMD
-\citep[version 0.19.1; ][]{PedrEtal11}}\hspace{0pt}%DIFAUXCMD
-}\DIFdelend \DIFaddbegin \DIFadd{version 0.19.1~\mbox{%DIFAUXCMD
-\citep{PedrEtal11}}\hspace{0pt}%DIFAUXCMD
-}\DIFaddend , called from our high-dimensional visualization and text analysis software, \texttt{HyperTools}~\citep{HeusEtal18a}.  Specifically, we used the \texttt{CountVectorizer} class to transform the text from each window into a vector of word counts (using the union of all words across all annotations as the ``vocabulary,'' excluding English stop words); this yielded a number-of-windows by number-of-words \DIFdelbegin \textit{\DIFdel{word count}} %DIFAUXCMD
-\DIFdelend \DIFaddbegin \DIFadd{``word count'' }\DIFaddend matrix.  We then used the \texttt{LatentDirichletAllocation} class (topics=100, method=`batch') to fit a topic model~\citep{BleiEtal03} to the word count matrix, yielding a number-of-windows (1047) by number-of-topics (100) \DIFdelbegin \textit{\DIFdel{topic proportions}} %DIFAUXCMD
-\DIFdelend \DIFaddbegin \DIFadd{``topic proportions'' }\DIFaddend matrix.  The topic proportions matrix describes the gradually evolving mix of topics (latent themes) present in each annotated time segment of the episode.  Next, we transformed the topic proportions matrix to match the 1976 fMRI volume acquisition times.  We assigned each topic vector to the timepoint (in seconds) midway between the beginning of the first annotation and the end of the last annotation in its corresponding sliding text window.  By doing so, we warped the linear temporal distance between consecutive topic vectors to align with the inconsistent temporal distance between consecutive annotations (whose durations varied greatly).  We then rescaled these timepoints to 1.5s TR units, and used linear interpolation to estimate a topic vector for each TR.  This resulted in a number-of-TRs (1976) by number-of-topics (100) matrix.
-
-We created similar topic proportions matrices using hand-annotated transcripts of each participant's verbal recall of the episode~\DIFdelbegin \DIFdel{\mbox{%DIFAUXCMD
-\citep[annotated by][]{ChenEtal17}}\hspace{0pt}%DIFAUXCMD
-}\DIFdelend \DIFaddbegin \DIFadd{\mbox{%DIFAUXCMD
-\citep{ChenEtal17}}\hspace{0pt}%DIFAUXCMD
-}\DIFaddend .  We tokenized the transcript into a list of sentences, and then re-organized the list into overlapping sliding windows spanning (up to) 10 sentences each, analogously to how we parsed the episode annotations.  In turn, we transformed each window's sentences into a word count vector (using the same vocabulary as for the episode model), then used the topic model already trained on the episode scenes to compute the most probable topic proportions for each sliding window.  This yielded a number-of-windows (range: 83--312) by number-of-topics (100) topic proportions matrix for each participant.  These reflected the dynamic content of each participant's recalls.  \DIFdelbegin \DIFdel{Note: for }\DIFdelend \DIFaddbegin \DIFadd{For }\DIFaddend details on how we selected the episode and recall window lengths and number of topics, see \textit{\DIFdelbegin \DIFdel{Supporting }\DIFdelend \DIFaddbegin \DIFadd{Supplementary }\DIFaddend Information} and \DIFaddbegin \DIFadd{Supplementary }\DIFaddend Figure~\topicopt.
+The input to the topic model we trained to characterize the dynamic content of the episode comprised 998 hand-generated annotations of short (mean: 2.96s) time segments spanning the video clip~(Chen et al., 2017~\citep{ChenEtal17} generated 1000 annotations total; we removed two annotations referring to a break between the first and second scan sessions, during which no fMRI data were collected).  We concatenated the text for all of the annotated features within each segment, creating a ``bag of words'' describing its content, and performed some minor preprocessing (e.g., stemming possessive nouns and removing punctuation).  We then re-organized the text descriptions into overlapping sliding windows spanning (up to) 50 annotations each.  In other words, we estimated the ``context'' for each annotated segment using the text descriptions of the preceding 25 annotations, the present annotations, and the following 24 annotations.  To model the context for annotations near the beginning of the episode (i.e., within 25 of the beginning or end), we created overlapping sliding windows that grew in size from one annotation to the full length.  We also tapered the sliding window lengths at the end of the episode, whereby time segments within fewer than 24 annotations of the end of the episode were assigned sliding windows that extended to the end of the episode.  This procedure ensured that each annotation's content was represented in the text corpus an equal number of times.
+
+We trained our model using these overlapping text samples with \texttt{scikit-learn} version 0.19.1~\citep{PedrEtal11}, called from our high-dimensional visualization and text analysis software, \texttt{HyperTools}~\citep{HeusEtal18a}.  Specifically, we used the \texttt{CountVectorizer} class to transform the text from each window into a vector of word counts (using the union of all words across all annotations as the ``vocabulary,'' excluding English stop words); this yielded a number-of-windows by number-of-words ``word count'' matrix.  We then used the \texttt{LatentDirichletAllocation} class (topics=100, method=`batch') to fit a topic model~\citep{BleiEtal03} to the word count matrix, yielding a number-of-windows (1047) by number-of-topics (100) ``topic proportions'' matrix.  The topic proportions matrix describes the gradually evolving mix of topics (latent themes) present in each annotated time segment of the episode.  Next, we transformed the topic proportions matrix to match the 1976 fMRI volume acquisition times.  We assigned each topic vector to the timepoint (in seconds) midway between the beginning of the first annotation and the end of the last annotation in its corresponding sliding text window.  By doing so, we warped the linear temporal distance between consecutive topic vectors to align with the inconsistent temporal distance between consecutive annotations (whose durations varied greatly).  We then rescaled these timepoints to 1.5s TR units, and used linear interpolation to estimate a topic vector for each TR.  This resulted in a number-of-TRs (1976) by number-of-topics (100) matrix.
+
+We created similar topic proportions matrices using hand-annotated transcripts of each participant's verbal recall of the episode~\citep{ChenEtal17}.  We tokenized the transcript into a list of sentences, and then re-organized the list into overlapping sliding windows spanning (up to) 10 sentences each, analogously to how we parsed the episode annotations.  In turn, we transformed each window's sentences into a word count vector (using the same vocabulary as for the episode model), then used the topic model already trained on the episode scenes to compute the most probable topic proportions for each sliding window.  This yielded a number-of-windows (range: 83--312) by number-of-topics (100) topic proportions matrix for each participant.  These reflected the dynamic content of each participant's recalls.  For details on how we selected the episode and recall window lengths and number of topics, see \textit{Supplementary Information} and Supplementary Figure~\topicopt.
 
 
 \subsubsection*{Segmenting topic proportions matrices into discrete events using hidden Markov Models}
-We parsed the topic proportions matrices of the episode and participants' recalls into discrete events using hidden Markov Models \DIFdelbegin \DIFdel{~\mbox{%DIFAUXCMD
-\citep[HMMs;][]{Rabi89}}\hspace{0pt}%DIFAUXCMD
-}\DIFdelend \DIFaddbegin \DIFadd{(HMMs)~\mbox{%DIFAUXCMD
-\citep{Rabi89}}\hspace{0pt}%DIFAUXCMD
-}\DIFaddend .  Given the topic proportions matrix (describing the mix of topics at each timepoint) and a number of states, $K$, an HMM recovers the set of state transitions that segments the timeseries into $K$ discrete states.  Following \DIFaddbegin \DIFadd{Baldassano et al. (2017)~}\DIFaddend \cite{BaldEtal17}, we imposed an additional set of constraints on the discovered state transitions that ensured that each state was encountered exactly once (i.e., never repeated).  We used the \texttt{BrainIAK} toolbox~\citep{Brainiak} to implement this segmentation.
-
-We used an optimization procedure to select the appropriate $K$ for each topic proportions matrix.  Prior studies on narrative structure and processing have shown that we both perceive and internally represent the world around us at multiple, hierarchical timescales\DIFdelbegin \DIFdel{\mbox{%DIFAUXCMD
-\citep[e.g.,][]{HassEtal08, LernEtal11, HassEtal15, ChenEtal17, BaldEtal17, BaldEtal18}}\hspace{0pt}%DIFAUXCMD
-}\DIFdelend \DIFaddbegin \DIFadd{~\mbox{%DIFAUXCMD
-\citep{HassEtal08, LernEtal11, HassEtal15, ChenEtal17, BaldEtal17, BaldEtal18}}\hspace{0pt}%DIFAUXCMD
-}\DIFaddend .  However, for the purposes of our framework, we sought to identify the single timeseries of \DIFdelbegin \DIFdel{event-representations that is emphasized }\textit{\DIFdel{most heavily}} %DIFAUXCMD
-\DIFdelend \DIFaddbegin \DIFadd{event representations that was emphasized most heavily }\DIFaddend in the temporal structure of the episode and of each participant's recall.  We quantified this as the set of $K$ states that maximized the similarity between topic vectors for timepoints comprising each state, while minimizing the similarity between topic vectors for timepoints across different states.  Specifically, we computed (for each matrix)
+We parsed the topic proportions matrices of the episode and participants' recalls into discrete events using hidden Markov Models (HMMs)~\citep{Rabi89}.  Given the topic proportions matrix (describing the mix of topics at each timepoint) and a number of states, $K$, an HMM recovers the set of state transitions that segments the timeseries into $K$ discrete states.  Following Baldassano et al. (2017)~\cite{BaldEtal17}, we imposed an additional set of constraints on the discovered state transitions that ensured that each state was encountered exactly once (i.e., never repeated).  We used the \texttt{BrainIAK} toolbox~\citep{Brainiak} to implement this segmentation.
+
+We used an optimization procedure to select the appropriate $K$ for each topic proportions matrix.  Prior studies on narrative structure and processing have shown that we both perceive and internally represent the world around us at multiple, hierarchical timescales~\citep{HassEtal08, LernEtal11, HassEtal15, ChenEtal17, BaldEtal17, BaldEtal18}.  However, for the purposes of our framework, we sought to identify the single timeseries of event representations that was emphasized most heavily in the temporal structure of the episode and of each participant's recall.  We quantified this as the set of $K$ states that maximized the similarity between topic vectors for timepoints comprising each state, while minimizing the similarity between topic vectors for timepoints across different states.  Specifically, we computed (for each matrix)
 \[
   \argmax_K \left[W_{1}(a, b)\right],
 \]
-where $a$ was the distribution of within-state topic vector correlations, and $b$ was the distribution of across-state topic vector correlations .  We computed the first Wasserstein distance ($W_{1}$\DIFdelbegin \DIFdel{; }\DIFdelend \DIFaddbegin \DIFadd{, }\DIFaddend also known as \DIFdelbegin \textit{\DIFdel{Earth mover's distance}}%DIFAUXCMD
-\DIFdel{; \mbox{%DIFAUXCMD
-\citealp{Dobr70, RamdEtal17}}\hspace{0pt}%DIFAUXCMD
-}\DIFdelend \DIFaddbegin \DIFadd{``Earth mover's distance''~\mbox{%DIFAUXCMD
-\citep{Dobr70, RamdEtal17}}\hspace{0pt}%DIFAUXCMD
-}\DIFaddend ) between these distributions for a large range of possible $K$-values (range [2, 50]), and selected the $K$ that yielded the maximum value.  Figure~\ref{fig:model}B displays the event boundaries returned for the episode, and \DIFaddbegin \DIFadd{Supplementary }\DIFaddend Figure~\corrmats~displays the event boundaries returned for each participant's recalls.  See \DIFaddbegin \DIFadd{Supplementary }\DIFaddend Figure \kopt~for the optimization functions for the episode and recalls.  After obtaining these event boundaries, we created stable estimates of the content represented in each event by averaging the topic vectors across timepoints between each pair of event boundaries.  This yielded a number-of-events by number-of-topics matrix for the episode and recalls from each participant.
+where $a$ was the distribution of within-state topic vector correlations, and $b$ was the distribution of across-state topic vector correlations .  We computed the first Wasserstein distance ($W_{1}$, also known as ``Earth mover's distance''~\citep{Dobr70, RamdEtal17}) between these distributions for a large range of possible $K$-values (range [2, 50]), and selected the $K$ that yielded the maximum value.  Figure~\ref{fig:model}B displays the event boundaries returned for the episode, and \DIFdelbegin \DIFdel{Supplementary }\DIFdelend \DIFaddbegin \DIFadd{Extended Data }\DIFaddend Figure~\corrmats~displays the event boundaries returned for each participant's recalls.  See \DIFdelbegin \DIFdel{Supplementary Figure}\DIFdelend \DIFaddbegin \DIFadd{Extended Data Figure~}\DIFaddend \kopt~for the optimization functions for the episode and recalls.  After obtaining these event boundaries, we created stable estimates of the content represented in each event by averaging the topic vectors across timepoints between each pair of event boundaries.  This yielded a number-of-events by number-of-topics matrix for the episode and recalls from each participant.
 
 \subsubsection*{Naturalistic extensions of classic list-learning analyses}
 In traditional list-learning experiments, participants view a list of items (e.g., words) and then recall the items later.  Our episode-recall event matching approach affords us the ability to analyze memory in a similar way. The episode and recall events can be treated analogously to studied and recalled ``items'' in a list-learning study.  We can then extend classic analyses of memory performance and dynamics (originally designed for list-learning experiments) to the more naturalistic episode recall task used in this study.
 
-Perhaps the simplest and most widely used measure of memory performance is \DIFdelbegin \textit{\DIFdel{accuracy}}%DIFAUXCMD
-\DIFdelend \DIFaddbegin \DIFadd{``accuracy''}\DIFaddend ---i.e., the proportion of studied (experienced) items (in this case, episode events) that the participant later remembered.  \DIFaddbegin \DIFadd{Chen et al.~(2017)~}\DIFaddend \cite{ChenEtal17} used this method to rate each participant's memory quality by computing the proportion of (50 \DIFdelbegin \DIFdel{, }\DIFdelend manually identified) scenes mentioned in their recall.  We found a strong across-participants correlation between these independent ratings and the proportion of 30 HMM-identified episode events matched to participants' recalls (Pearson's \DIFdelbegin \DIFdel{$r(15) = 0.71, p = 0.002$}\DIFdelend \DIFaddbegin \DIFadd{$r(15) = 0.71, p = 0.002,~95\%~\mathrm{CI} = [0.39, 0.88]$}\DIFaddend ).  We further considered a number of more nuanced memory performance measures that are typically associated with list-learning studies.  We also provide a software package, \texttt{Quail}, for carrying out these analyses~\citep{HeusEtal17b}.
+Perhaps the simplest and most widely used measure of memory performance is ``accuracy''---i.e., the proportion of studied (experienced) items (in this case, episode events) that the participant later remembered.  Chen et al.~(2017)~\cite{ChenEtal17} used this method to rate each participant's memory quality by computing the proportion of (50 manually identified) scenes mentioned in their recall.  We found a strong across-participants correlation between these independent ratings and the proportion of 30 HMM-identified episode events matched to participants' recalls (Pearson's $r(15) = 0.71, p = 0.002,~95\%~\mathrm{CI} = [0.39, 0.88]$).  We further considered a number of more nuanced memory performance measures that are typically associated with list-learning studies.  We also provide a software package, \texttt{Quail}, for carrying out these analyses~\citep{HeusEtal17b}.
 
 \paragraph{Probability of first recall (PFR).}  PFR curves~\citep{WelcBurn24, PostPhil65, AtkiShif68} reflect the probability that an item will be recalled first, as a function of its serial position during encoding. To carry out this analysis, we initialized a number-of-participants (17) by number-of-episode-events (30) matrix of zeros. Then, for each participant, we found the index of the episode event that was recalled first (i.e., the episode event whose topic vector was most strongly correlated with that of the first recall event) and filled in that index in the matrix with a 1.  Finally, we averaged over the rows of the matrix, resulting in a 1 by 30 array representing the proportion of participants that recalled an event first, as a function of the order of the event's appearance in the episode (Fig.~\ref{fig:list-learning}A).
-%DIF <  ------ NOTE: ------
-%DIF <  reiterate meaning of error ribbons in list-learning figure? (already noted in figure caption)
-%DIF <  - Paxton
 
-\paragraph{Lag conditional probability curve (lag-CRP).} The lag-CRP curve~\citep{Kaha96} reflects the probability of recalling a given item after the just-recalled item, as a function of their relative encoding positions (\DIFdelbegin \DIFdel{or }\textit{\DIFdel{lag}}%DIFAUXCMD
-\DIFdelend \DIFaddbegin \DIFadd{lag}\DIFaddend ).  In other words, a lag of 1 indicates that a recalled item was presented immediately after the previously recalled item, and a lag of -3 indicates that a recalled item came 3 items before the previously recalled item.  For each recall transition (following the first recall), we computed the lag between the current recall event and the next recall event, normalizing by the total number of possible transitions.  This yielded a number-of-participants (17) by number-of-lags (-29 to +29; 58 lags total excluding lags of 0) matrix. We averaged over the rows of this matrix to obtain a group-averaged lag-CRP curve (Fig.~\ref{fig:list-learning}B).
+\paragraph{Lag conditional probability curve (lag-CRP).} The lag-CRP curve~\citep{Kaha96} reflects the probability of recalling a given item after the just-recalled item, as a function of their relative encoding positions (lag).  In other words, a lag of 1 indicates that a recalled item was presented immediately after the previously recalled item, and a lag of -3 indicates that a recalled item came 3 items before the previously recalled item.  For each recall transition (following the first recall), we computed the lag between the current recall event and the next recall event, normalizing by the total number of possible transitions.  This yielded a number-of-participants (17) by number-of-lags (-29 to +29; 58 lags total excluding lags of 0) matrix. We averaged over the rows of this matrix to obtain a group-averaged lag-CRP curve (Fig.~\ref{fig:list-learning}B).
 
 \paragraph{Serial position curve (SPC).} SPCs~\citep{Murd62a} reflect the proportion of participants that remember each item as a function of the item's serial position during encoding. We initialized a number-of-participants (17) by number-of-episode-events (30) matrix of zeros. Then, for each recalled event, for each participant, we found the index of the episode event that the recalled event most closely matched (via the correlation between the events' topic vectors) and entered a 1 into that position in the matrix. This resulted in a matrix whose entries indicated whether or not each event was recalled by each participant (depending on whether the corresponding entires were set to one or zero).  Finally, we averaged over the rows of the matrix to yield a 1 by 30 array representing the proportion of participants that recalled each event as a function of the events' order appearance in the episode (Fig.~\ref{fig:list-learning}C).
 
 \paragraph{Temporal clustering scores.} Temporal clustering describes a participant's tendency to organize their recall sequences by the learned items' encoding positions.  For instance, if a participant recalled the episode events in the exact order they occurred (or in exact reverse order), this would yield a score of 1.  If a participant recalled the events in random order, this would yield an expected score of 0.5.  For each recall event transition (and separately for each participant), we sorted all not-yet-recalled events according to their absolute lag (i.e., distance away in the episode).  We then computed the percentile rank of the next event the participant recalled.  We averaged these percentile ranks across all of the participant's recalls to obtain a single temporal clustering score for the participant.
 
-\paragraph{Semantic clustering scores.} Semantic clustering describes a participant's tendency to recall semantically similar presented items together in their recall sequences.  Here, we used the topic vectors for each event as a proxy for its semantic content. Thus, the similarity between the semantic content for two events can be computed by correlating their respective topic vectors.  For each recall event transition, we sorted all not-yet-recalled events according to how correlated the topic vector \DIFdelbegin \textit{\DIFdel{of the closest-matching episode event}} %DIFAUXCMD
-\DIFdelend \DIFaddbegin \DIFadd{of the closest-matching episode event }\DIFaddend was to the topic vector of the closest-matching episode event to the just-recalled event.  We then computed the percentile rank of the observed next recall.  We averaged these percentile ranks across all of the participant's recalls to obtain a single semantic clustering score for the participant.
+\paragraph{Semantic clustering scores.} Semantic clustering describes a participant's tendency to recall semantically similar presented items together in their recall sequences.  Here, we used the topic vectors for each event as a proxy for its semantic content. Thus, the similarity between the semantic content for two events can be computed by correlating their respective topic vectors.  For each recall event transition, we sorted all not-yet-recalled events according to how correlated the topic vector of the closest-matching episode event was to the topic vector of the closest-matching episode event to the just-recalled event.  We then computed the percentile rank of the observed next recall.  We averaged these percentile ranks across all of the participant's recalls to obtain a single semantic clustering score for the participant.
 
 \subsubsection*{Averaging correlations}
 In all instances where we performed statistical tests involving precision or distinctiveness scores (Fig.~\ref{fig:precision-detail}), we used the Fisher $z$-transformation~\citep{Fish25} to stabilize the variance across the distribution of correlation values prior to performing the test.  Similarly, when averaging precision or distinctiveness scores, we $z$-transformed the scores prior to computing the mean, and inverse $z$-transformed the result.
 
 \subsubsection*{Visualizing the episode and recall topic trajectories}
-We used the UMAP algorithm~\citep{McInEtal18} to project the 100-dimensional topic space onto a two-dimensional space for visualization (Figs.~\ref{fig:trajectory}, \ref{fig:topics}).  To ensure that all of the trajectories were projected onto the \DIFdelbegin \textit{\DIFdel{same}} %DIFAUXCMD
-\DIFdelend \DIFaddbegin \DIFadd{same }\DIFaddend lower dimensional space, we computed the low-dimensional embedding on a ``stacked'' matrix created by vertically concatenating the events-by-topics topic proportions matrices for the episode, \DIFaddbegin \DIFadd{the }\DIFaddend across-participants average \DIFdelbegin \DIFdel{recall }\DIFdelend \DIFaddbegin \DIFadd{recalls }\DIFaddend and all 17 individual participants' recalls.  We then separated the rows of the result (a total-number-of-events by two matrix) back into individual matrices for the episode topic trajectory, \DIFaddbegin \DIFadd{the }\DIFaddend across-participant average recall trajectory, and the trajectories for each individual participant's recalls (Fig.~\ref{fig:trajectory}).  This general approach for discovering a shared low-dimensional embedding for a collections of high-dimensional observations follows \DIFaddbegin \DIFadd{our prior work on manifold learning~}\DIFaddend \cite{HeusEtal18a}.
+We used the UMAP algorithm~\citep{McInEtal18} to project the 100-dimensional topic space onto a two-dimensional space for visualization (Figs.~\ref{fig:trajectory}, \ref{fig:topics}).  To ensure that all of the trajectories were projected onto the same lower dimensional space, we computed the low-dimensional embedding on a ``stacked'' matrix created by vertically concatenating the events-by-topics topic proportions matrices for the episode, the across-participants average recalls and all 17 individual participants' recalls.  We then separated the rows of the result (a total-number-of-events by two matrix) back into individual matrices for the episode topic trajectory, the across-participant average recall trajectory, and the trajectories for each individual participant's recalls (Fig.~\ref{fig:trajectory}).  This general approach for discovering a shared low-dimensional embedding for a collections of high-dimensional observations follows our prior work on manifold learning~\cite{HeusEtal18a}.
 
-We optimized the manifold space for visualization based on two criteria: First, that the 2D embedding of the episode trajectory should reflect its original 100-dimensional structure as faithfully as possible. Second, that the path traversed by the embedded episode trajectory should intersect itself a minimal number of times.  The first criteria helps bolster the validity of visual intuitions about relationships between sections of episode content, based on their locations in the embedding space.  The second criteria was motivated by the observed low off-diagonal values in the episode trajectory's temporal correlation matrix (suggesting that the same topic-space coordinates should not be revisited; see Fig.~2A). For further details on how we created this low-dimensional embedding space, see \textit{\DIFdelbegin \DIFdel{Supporting }\DIFdelend \DIFaddbegin \DIFadd{Supplementary }\DIFaddend Information}.
+We optimized the manifold space for visualization based on two criteria: First, that the 2D embedding of the episode trajectory should reflect its original 100-dimensional structure as faithfully as possible. Second, that the path traversed by the embedded episode trajectory should intersect itself a minimal number of times.  The first criteria helps bolster the validity of visual intuitions about relationships between sections of episode content, based on their locations in the embedding space.  The second criteria was motivated by the observed low off-diagonal values in the episode trajectory's temporal correlation matrix (suggesting that the same topic-space coordinates should not be revisited; see Fig.~2A). For further details on how we created this low-dimensional embedding space, see \textit{Supplementary Information}.
 
 \subsubsection*{Estimating the consistency of flow through topic space across participants}
-In Figure~\ref{fig:trajectory}B, we present an analysis aimed at characterizing locations in topic space that different participants move through in a consistent way (via their recall topic trajectories\DIFaddbegin \DIFadd{; also see Supp.\ Fig.~\arrows}\DIFaddend ).  The two-dimensional topic space used in our visualizations (Fig.~\ref{fig:trajectory}) comprised a $60 \times 60$ (arbitrary units) square.  We tiled this space with a $50 \times 50$ grid of evenly spaced vertices, and defined a circular area centered on each vertex whose radius was two times the distance between adjacent vertices (i.e., 2.4 units).  For each vertex, we examined the set of line segments formed by connecting each pair successively recalled events, across all participants, that passed through this circle.  We computed the distribution of angles formed by those segments and the $x$-axis, and used a Rayleigh test to determine whether the distribution of angles was reliably ``peaked'' (i.e., consistent across all transitions that passed through that local portion of topic space).  To create Figure~\ref{fig:trajectory}B, we drew an arrow originating from each grid vertex, pointing in the direction of the average angle formed by the line segments that passed within 2.4 units.  We set the arrow lengths to be inversely proportional to the $p$-values of the Rayleigh tests at each vertex.  Specifically, for each vertex we converted all of the angles of segments that passed within 2.4 units to unit vectors, and we set the arrow lengths at each vertex proportional to the length of the (circular) mean vector.  We also indicated any significant results ($p < 0.05$, corrected using the Benjamani-Hochberg procedure) by coloring the arrows in blue (darker blue denotes a lower $p$-value, i.e., a longer mean vector); all tests with $p \geq 0.05$ are displayed in gray and given a lower opacity value.
+In Figure~\ref{fig:trajectory}B, we present an analysis aimed at characterizing locations in topic space that different participants move through in a consistent way (via their recall topic trajectories; also see \DIFdelbegin \DIFdel{Supp.\ }\DIFdelend \DIFaddbegin \DIFadd{Extended Data }\DIFaddend Fig.~\arrows).  The two-dimensional topic space used in our visualizations (Fig.~\ref{fig:trajectory}) comprised a $60 \times 60$ (arbitrary units) square.  We tiled this space with a $50 \times 50$ grid of evenly spaced vertices, and defined a circular area centered on each vertex whose radius was two times the distance between adjacent vertices (i.e., 2.4 units).  For each vertex, we examined the set of line segments formed by connecting each pair successively recalled events, across all participants, that passed through this circle.  We computed the distribution of angles formed by those segments and the $x$-axis, and used a Rayleigh test to determine whether the distribution of angles was reliably ``peaked'' (i.e., consistent across all transitions that passed through that local portion of topic space).  To create Figure~\ref{fig:trajectory}B, we drew an arrow originating from each grid vertex, pointing in the direction of the average angle formed by the line segments that passed within 2.4 units.  We set the arrow lengths to be inversely proportional to the $p$-values of the Rayleigh tests at each vertex.  Specifically, for each vertex we converted all of the angles of segments that passed within 2.4 units to unit vectors, and we set the arrow lengths at each vertex proportional to the length of the (circular) mean vector.  We also indicated any significant results ($p < 0.05$, corrected using the Benjamani-Hochberg procedure) by coloring the arrows in blue (darker blue denotes a lower $p$-value, i.e., a longer mean vector); all tests with $p \geq 0.05$ are displayed in gray and given a lower opacity value.
 
 \subsection*{Searchlight fMRI analyses}
-In Figure~\ref{fig:brainz}, we present two analyses aimed at identifying brain regions whose responses (as participants viewed the episode) exhibited a particular temporal structure.  We developed a searchlight analysis wherein we constructed a $5 \times 5 \times 5$ cube of voxels \DIFdelbegin \DIFdel{~\mbox{%DIFAUXCMD
-\citep[following][]{ChenEtal17} }\hspace{0pt}%DIFAUXCMD
-}\DIFdelend centered on each voxel in the brain\DIFaddbegin \DIFadd{~\mbox{%DIFAUXCMD
-\citep{ChenEtal17}}\hspace{0pt}%DIFAUXCMD
-}\DIFaddend , and for each of these cubes, computed the temporal correlation matrix of the voxel responses during episode viewing.  Specifically, for each of the 1976 volumes collected during episode viewing, we correlated the activity patterns in the given cube with the activity patterns (in the same cube) collected during every other timepoint.  This yielded a $1976 \times 1976$ correlation matrix for each cube.  Note: participant 5's scan ended 75s early, and in \DIFaddbegin \DIFadd{Chen et al. (2017)}\DIFaddend ~\cite{ChenEtal17}'s publicly released dataset, their scan data was zero-padded to match the length of the other participants'.  For our searchlight analyses, we removed this padded data (i.e., the last 50 TRs), resulting in a $1925 \times 1925$ correlation matrix for each cube in participant 5's brain.
+In Figure~\ref{fig:brainz}, we present two analyses aimed at identifying brain regions whose responses (as participants viewed the episode) exhibited a particular temporal structure.  We developed a searchlight analysis wherein we constructed a $5 \times 5 \times 5$ cube of voxels centered on each voxel in the brain~\citep{ChenEtal17}, and for each of these cubes, computed the temporal correlation matrix of the voxel responses during episode viewing.  Specifically, for each of the 1976 volumes collected during episode viewing, we correlated the activity patterns in the given cube with the activity patterns (in the same cube) collected during every other timepoint.  This yielded a $1976 \times 1976$ correlation matrix for each cube.  Note: participant 5's scan ended 75s early, and in Chen et al. (2017)~\cite{ChenEtal17}'s publicly released dataset, their scan data was zero-padded to match the length of the other participants'.  For our searchlight analyses, we removed this padded data (i.e., the last 50 TRs), resulting in a $1925 \times 1925$ correlation matrix for each cube in participant 5's brain.
 
-Next, we constructed a series of ``template'' matrices.  The first template reflected the timecourse of the episode's topic proportions matrix, and the others reflected the timecourse of each participant's recall topic proportions matrix.  To construct the episode template, we computed the correlations between the topic proportions estimated for every pair of TRs (prior to segmenting the topic proportions matrices into discrete events; i.e., the correlation matrix shown in Figs.~\ref{fig:model}B and \ref{fig:brainz}A).  We constructed similar temporal correlation matrices for each participant's recall topic proportions matrix (\DIFdelbegin \DIFdel{Figs}\DIFdelend \DIFaddbegin \DIFadd{Fig}\DIFaddend .~\ref{fig:model}D, \DIFaddbegin \DIFadd{Supp.\ Fig.~}\DIFaddend \corrmats).  However, to correct for length differences and potential non-linear transformations between viewing time and recall time, we first used dynamic time warping~\citep{BernClif94} to temporally align participants' recall topic proportions matrices with the episode topic proportions matrix.  An example correlation matrix before and after warping is shown in Fig.~\ref{fig:brainz}B.  This yielded a $1976 \times 1976$ correlation matrix for the episode template and for each participant's recall template.
+Next, we constructed a series of ``template'' matrices.  The first template reflected the timecourse of the episode's topic proportions matrix, and the others reflected the timecourse of each participant's recall topic proportions matrix.  To construct the episode template, we computed the correlations between the topic proportions estimated for every pair of TRs (prior to segmenting the topic proportions matrices into discrete events; i.e., the correlation matrix shown in Figs.~\ref{fig:model}B and \ref{fig:brainz}A).  We constructed similar temporal correlation matrices for each participant's recall topic proportions matrix (Fig.~\ref{fig:model}D, \DIFdelbegin \DIFdel{Supp.\ }\DIFdelend \DIFaddbegin \DIFadd{Extended Data }\DIFaddend Fig.~\corrmats).  However, to correct for length differences and potential non-linear transformations between viewing time and recall time, we first used dynamic time warping~\citep{BernClif94} to temporally align participants' recall topic proportions matrices with the episode topic proportions matrix.  An example correlation matrix before and after warping is shown in Fig.~\ref{fig:brainz}B.  This yielded a $1976 \times 1976$ correlation matrix for the episode template and for each participant's recall template.
 
-The temporal structure of the episode's content (as described by our model) is captured in the block-diagonal structure of the episode's temporal correlation matrix (e.g., Figs.~\ref{fig:model}B, \ref{fig:brainz}A), with time periods of thematic stability represented as dark blocks of varying sizes.  Inspecting the episode correlation matrix suggests that the episode's semantic content is highly temporally specific (i.e., the correlations between topic vectors from distant timepoints are almost all near zero).  By contrast, the activity patterns of individual (cubes of) voxels can encode relatively limited information on their own, and their activity frequently contributes to multiple separate functions \citep{FreeEtal01, SigmDeha08, CharKoec10, RishEtal13}.  By nature, these two attributes give rise to similarities in activity across large timescales that may not necessarily reflect a single task.  To \DIFdelbegin \DIFdel{enable a more sensitive analysis of }\DIFdelend \DIFaddbegin \DIFadd{identify }\DIFaddend brain regions whose shifts in activity patterns mirrored shifts in the semantic content of the episode or recalls, we restricted the temporal correlations we considered to the timescale of semantic information captured by our model.  Specifically, we isolated the upper triangle of the episode correlation matrix and created a ``proximal correlation mask'' that included only diagonals from the upper triangle of the episode correlation matrix up to the first diagonal that contained no positive correlations.  Applying this mask to the full episode correlation matrix was equivalent to excluding diagonals beyond the corner of the largest diagonal block.  In other words, the timescale of temporal correlations we considered corresponded to the longest period of thematic stability in the episode, and by extension the longest period of thematic stability in participants' recalls and the longest period of stability we might expect to see in voxel activity arising from processing or encoding episode content.  Figure \ref{fig:brainz} shows this proximal correlation mask applied to the temporal correlation matrices for the episode, an example participant's (warped) recall, and an example cube of voxels from our searchlight analyses.
+The temporal structure of the episode's content (as described by our model) is captured in the block-diagonal structure of the episode's temporal correlation matrix (e.g., Figs.~\ref{fig:model}B, \ref{fig:brainz}A), with time periods of thematic stability represented as dark blocks of varying sizes.  Inspecting the episode correlation matrix suggests that the episode's semantic content is highly temporally specific (i.e., the correlations between topic vectors from distant timepoints are almost all near zero).  By contrast, the activity patterns of individual (cubes of) voxels can encode relatively limited information on their own, and their activity frequently contributes to multiple separate functions \citep{FreeEtal01, SigmDeha08, CharKoec10, RishEtal13}.  By nature, these two attributes give rise to similarities in activity across large timescales that may not necessarily reflect a single task.  To identify brain regions whose shifts in activity patterns mirrored shifts in the semantic content of the episode or recalls, we restricted the temporal correlations we considered to the timescale of semantic information captured by our model.  Specifically, we isolated the upper triangle of the episode correlation matrix and created a ``proximal correlation mask'' that included only diagonals from the upper triangle of the episode correlation matrix up to the first diagonal that contained no positive correlations.  Applying this mask to the full episode correlation matrix was equivalent to excluding diagonals beyond the corner of the largest diagonal block.  In other words, the timescale of temporal correlations we considered corresponded to the longest period of thematic stability in the episode, and by extension the longest period of thematic stability in participants' recalls and the longest period of stability we might expect to see in voxel activity arising from processing or encoding episode content.  Figure \ref{fig:brainz} shows this proximal correlation mask applied to the temporal correlation matrices for the episode, an example participant's (warped) recall, and an example cube of voxels from our searchlight analyses.
 
 To determine which (cubes of) voxel responses matched the episode template, we correlated the proximal diagonals from the upper triangle of the voxel correlation matrix for each cube with the proximal diagonals from episode template matrix~\citep{KrieEtal08b}.  This yielded, for each participant, a voxelwise map of correlation values.  We then performed a one-sample $t$-test on the distribution of (Fisher $z$-transformed) correlations at each voxel, across participants.  This resulted in a value for each voxel (cube), describing how reliably its timecourse followed that of the episode.
 
@@ -576,657 +323,594 @@ \subsection*{Searchlight fMRI analyses}
 We used an analogous procedure to identify which voxels' responses reflected the recall templates.  For each participant, we correlated the proximal diagonals from the upper triangle of the correlation matrix for each cube of voxels with the proximal diagonals from the upper triangle of their (time-warped) recall correlation matrix.  As in the episode template analysis, this yielded a voxelwise map of correlation coefficients for each participant.  However, whereas the episode analysis compared every participant's responses to the same template, here the recall templates were unique for each participant.  As in the analysis described above, we $t$-scored the (Fisher $z$-transformed) voxelwise correlations, and used the same permutation procedure we developed for the episode responses to ensure specificity to the recall timeseries and assign significance values.  To create the map in Figure~\ref{fig:brainz}D we again thresholded out any voxels whose scores were below the 95\textsuperscript{th} percentile of the permutation-derived null distribution.
 
 \subsection*{Neurosynth decoding analyses}
-\texttt{Neurosynth}\DIFaddbegin \DIFadd{~\mbox{%DIFAUXCMD
-\citep{YarkEtal11} }\hspace{0pt}%DIFAUXCMD
-}\DIFaddend parses a massive online database of over \DIFdelbegin \DIFdel{14,000 }\DIFdelend \DIFaddbegin \DIFadd{14000 }\DIFaddend neuroimaging studies and constructs meta-analysis images for over \DIFdelbegin \DIFdel{13,000 }\DIFdelend \DIFaddbegin \DIFadd{13000 }\DIFaddend psychology- and neuroscience-related terms, based on NIfTI images accompanying studies where those terms appear at a high frequency.  Given a novel image (tagged with its value type; e.g., $z$-, $t$-, $F$- or $p$-statistics), \texttt{Neurosynth} returns a list of terms whose meta-analysis images are most similar.  Our permutation procedure yielded, for each of the two searchlight analyses, a voxelwise map of $z$-values.  These maps describe the extent to which each voxel \DIFdelbegin \textit{\DIFdel{specifically}} %DIFAUXCMD
-\DIFdelend \DIFaddbegin \DIFadd{specifically }\DIFaddend reflected the temporal structure of the episode or individuals' recalls (i.e., relative to the null distributions of phase-shifted values). We inputted the two statistical maps described above to \texttt{Neurosynth} to create a list of the 10 most representative terms for each map.
+\texttt{Neurosynth}~\citep{YarkEtal11} parses a massive online database of over 14000 neuroimaging studies and constructs meta-analysis images for over 13000 psychology- and neuroscience-related terms, based on NIfTI images accompanying studies where those terms appear at a high frequency.  Given a novel image (tagged with its value type; e.g., $z$-, $t$-, $F$- or $p$-statistics), \texttt{Neurosynth} returns a list of terms whose meta-analysis images are most similar.  Our permutation procedure yielded, for each of the two searchlight analyses, a voxelwise map of $z$-values.  These maps describe the extent to which each voxel specifically reflected the temporal structure of the episode or individuals' recalls (i.e., relative to the null distributions of phase-shifted values). We inputted the two statistical maps described above to \texttt{Neurosynth} to create a list of the 10 most representative terms for each map.
+
+\section*{Data availability}
+The fMRI data we analyzed are available online \DIFdelbegin %DIFDELCMD < \href{http://dataspace.princeton.edu/jspui/handle/88435/dsp01nz8062179}{\underline{here}}%%%
+\DIFdel{.}\DIFdelend \DIFaddbegin \DIFadd{at:
+}
+
+\DIFadd{https://dataspace.princeton.edu/jspui/handle/88435/dsp01nz8062179
+}
 
-%DIF <  \bibliography{../../CDL-bibliography/memlab}
-\DIFdelbegin %DIFDELCMD < \bibliography{CDL-bibliography/memlab}
-%DIFDELCMD < 
+\noindent \DIFaddend The behavioral data is available \DIFdelbegin %DIFDELCMD < \href{https://github.com/ContextLab/sherlock-topic-model-paper}{\underline{here}}%%%
+\DIFdel{.}\DIFdelend \DIFaddbegin \DIFadd{at:
+}\DIFaddend 
 
-%DIFDELCMD < %%%
-\DIFdelend \section*{\DIFdelbegin \DIFdel{Supporting information}\DIFdelend \DIFaddbegin \DIFadd{Data availability}\DIFaddend }
-\DIFdelbegin \DIFdel{Supporting information is available in the online version of the paper}\DIFdelend \DIFaddbegin \DIFadd{The fMRI data we analyzed are available online }\href{http://dataspace.princeton.edu/jspui/handle/88435/dsp01nz8062179}{\underline{here}}\DIFadd{.  The behavioral data is available  }\href{https://github.com/ContextLab/sherlock-topic-model-paper}{\underline{here}}\DIFaddend .
+\DIFaddbegin \DIFadd{https://github.com/ContextLab/sherlock-topic-model-paper/tree/master/data/raw
+}
 
-\DIFaddbegin \section*{\DIFadd{Code availability}}
-\DIFadd{All of our analysis code may be downloaded }\href{https://github.com/ContextLab/sherlock-topic-model-paper}{\underline{here}}\DIFadd{.
+\DIFaddend \section*{Code availability}
+All of our analysis code may be downloaded \DIFdelbegin %DIFDELCMD < \href{https://github.com/ContextLab/sherlock-topic-model-paper}{\underline{here}}%%%
+\DIFdel{.}\DIFdelend \DIFaddbegin \DIFadd{from:
 }
 
-%DIF > \bibliographystyle{naturemag}
-%DIF > \bibliography{CDL-bibliography/memlab}
+\DIFadd{https://github.com/ContextLab/sherlock-topic-model-paper
+}\DIFaddend 
+
+%\bibliographystyle{naturemag}
+%\bibliography{CDL-bibliography/memlab}
 
 \begin{thebibliography}{10}
-\expandafter\ifx\csname \DIFadd{url}\endcsname\relax
+\expandafter\ifx\csname url\endcsname\relax
   \def\url#1{\texttt{#1}}\fi
-\expandafter\ifx\csname \DIFadd{urlprefix}\endcsname\relax\def\urlprefix{URL }\fi
+\expandafter\ifx\csname urlprefix\endcsname\relax\def\urlprefix{URL }\fi
 \providecommand{\bibinfo}[2]{#2}
 \providecommand{\eprint}[2][]{\url{#2}}
 
 \bibitem{Murd62a}
 \bibinfo{author}{Murdock, B.~B.}
-\newblock \bibinfo{title}{The serial position effect of free recall}\DIFadd{.
-}\newblock \emph{\bibinfo{journal}{Journal of Experimental Psychology}}
-  \textbf{\bibinfo{volume}{64}}\DIFadd{, }\bibinfo{pages}{482--488}
-  \DIFadd{(}\bibinfo{year}{1962}\DIFadd{).
-}
+\newblock \bibinfo{title}{The serial position effect of free recall}.
+\newblock \emph{\bibinfo{journal}{Journal of Experimental Psychology}}
+  \textbf{\bibinfo{volume}{64}}, \bibinfo{pages}{482--488}
+  (\bibinfo{year}{1962}).
 
 \bibitem{Kaha96}
 \bibinfo{author}{Kahana, M.~J.}
-\newblock \bibinfo{title}{Associative retrieval processes in free recall}\DIFadd{.
-}\newblock \emph{\bibinfo{journal}{Memory \& Cognition}}
-  \textbf{\bibinfo{volume}{24}}\DIFadd{, }\bibinfo{pages}{103--109}
-  \DIFadd{(}\bibinfo{year}{1996}\DIFadd{).
-}
+\newblock \bibinfo{title}{Associative retrieval processes in free recall}.
+\newblock \emph{\bibinfo{journal}{Memory \& Cognition}}
+  \textbf{\bibinfo{volume}{24}}, \bibinfo{pages}{103--109}
+  (\bibinfo{year}{1996}).
 
 \bibitem{Yone02}
 \bibinfo{author}{Yonelinas, A.~P.}
 \newblock \bibinfo{title}{The nature of recollection and familiarity: A review
-  of 30 years of research}\DIFadd{.
-}\newblock \emph{\bibinfo{journal}{Journal of Memory and Language}}
-  \textbf{\bibinfo{volume}{46}}\DIFadd{, }\bibinfo{pages}{441--517}
-  \DIFadd{(}\bibinfo{year}{2002}\DIFadd{).
-}
+  of 30 years of research}.
+\newblock \emph{\bibinfo{journal}{Journal of Memory and Language}}
+  \textbf{\bibinfo{volume}{46}}, \bibinfo{pages}{441--517}
+  (\bibinfo{year}{2002}).
 
 \bibitem{Kaha12}
 \bibinfo{author}{Kahana, M.~J.}
 \newblock \emph{\bibinfo{title}{Foundations of Human Memory}}
-  \DIFadd{(}\bibinfo{publisher}{Oxford University Press}\DIFadd{, }\bibinfo{address}{New York,
-  NY}\DIFadd{, }\bibinfo{year}{2012}\DIFadd{).
-}
+  (\bibinfo{publisher}{Oxford University Press}, \bibinfo{address}{New York,
+  NY}, \bibinfo{year}{2012}).
 
 \bibitem{KoriGold94}
-\bibinfo{author}{Koriat, A.} \DIFadd{\& }\bibinfo{author}{Goldsmith, M.}
+\bibinfo{author}{Koriat, A.} \& \bibinfo{author}{Goldsmith, M.}
 \newblock \bibinfo{title}{Memory in naturalistic and laboratory contexts:
   distinguishing accuracy-oriented and quantity-oriented approaches to memory
-  assessment}\DIFadd{.
-}\newblock \emph{\bibinfo{journal}{Journal of Experimental Psychology: General}}
-  \textbf{\bibinfo{volume}{123}}\DIFadd{, }\bibinfo{pages}{297--315}
-  \DIFadd{(}\bibinfo{year}{1994}\DIFadd{).
-}
+  assessment}.
+\newblock \emph{\bibinfo{journal}{Journal of Experimental Psychology: General}}
+  \textbf{\bibinfo{volume}{123}}, \bibinfo{pages}{297--315}
+  (\bibinfo{year}{1994}).
 
 \bibitem{HukEtal18}
-\bibinfo{author}{Huk, A.}\DIFadd{, }\bibinfo{author}{Bonnen, K.} \DIFadd{\& }\bibinfo{author}{He,
+\bibinfo{author}{Huk, A.}, \bibinfo{author}{Bonnen, K.} \& \bibinfo{author}{He,
   B.~J.}
 \newblock \bibinfo{title}{Beyond trial-based paradigms: continuous behavior,
-  ongoing neural activity, and naturalistic stimuli}\DIFadd{.
-}\newblock \emph{\bibinfo{journal}{Journal of Neuroscience}}
+  ongoing neural activity, and naturalistic stimuli}.
+\newblock \emph{\bibinfo{journal}{Journal of Neuroscience}}
   \textbf{\bibinfo{volume}{10.1523/JNEUROSCI.1920-17.2018}}
-  \DIFadd{(}\bibinfo{year}{2018}\DIFadd{).
-}
+  (\bibinfo{year}{2018}).
 
 \bibitem{LernEtal11}
-\bibinfo{author}{Lerner, Y.}\DIFadd{, }\bibinfo{author}{Honey, C.~J.}\DIFadd{,
-  }\bibinfo{author}{Silbert, L.~J.} \DIFadd{\& }\bibinfo{author}{Hasson, U.}
+\bibinfo{author}{Lerner, Y.}, \bibinfo{author}{Honey, C.~J.},
+  \bibinfo{author}{Silbert, L.~J.} \& \bibinfo{author}{Hasson, U.}
 \newblock \bibinfo{title}{Topographic mapping of a hierarchy of temporal
-  receptive windows using a narrated story}\DIFadd{.
-}\newblock \emph{\bibinfo{journal}{Journal of Neuroscience}}
-  \textbf{\bibinfo{volume}{31}}\DIFadd{, }\bibinfo{pages}{2906--2915}
-  \DIFadd{(}\bibinfo{year}{2011}\DIFadd{).
-}
+  receptive windows using a narrated story}.
+\newblock \emph{\bibinfo{journal}{Journal of Neuroscience}}
+  \textbf{\bibinfo{volume}{31}}, \bibinfo{pages}{2906--2915}
+  (\bibinfo{year}{2011}).
 
 \bibitem{Mann19}
 \bibinfo{author}{Manning, J.~R.}
 \newblock \bibinfo{title}{Episodic memory: mental time travel or a quantum
   'memory wave' function?}
 \newblock \emph{\bibinfo{journal}{PsyArXiv}}
-  \textbf{\bibinfo{volume}{doi:10.31234/osf.io/6zjwb}} \DIFadd{(}\bibinfo{year}{2019}\DIFadd{).
-}
+  \textbf{\bibinfo{volume}{doi:10.31234/osf.io/6zjwb}} (\bibinfo{year}{2019}).
 
 \bibitem{Mann20}
 \bibinfo{author}{Manning, J.~R.}
-\newblock \bibinfo{title}{Context reinstatement}\DIFadd{.
-}\newblock \DIFadd{In }\bibinfo{editor}{Kahana, M.~J.} \DIFadd{\& }\bibinfo{editor}{Wagner, A.~D.}
-  \DIFadd{(eds.) }\emph{\bibinfo{booktitle}{Handbook of Human Memory}}
-  \DIFadd{(}\bibinfo{publisher}{Oxford University Press}\DIFadd{, }\bibinfo{year}{2020}\DIFadd{).
-}
+\newblock \bibinfo{title}{Context reinstatement}.
+\newblock In \bibinfo{editor}{Kahana, M.~J.} \& \bibinfo{editor}{Wagner, A.~D.}
+  (eds.) \emph{\bibinfo{booktitle}{Handbook of Human Memory}}
+  (\bibinfo{publisher}{Oxford University Press}, \bibinfo{year}{2020}).
 
 \bibitem{HowaKaha02a}
-\bibinfo{author}{Howard, M.~W.} \DIFadd{\& }\bibinfo{author}{Kahana, M.~J.}
-\newblock \bibinfo{title}{A distributed representation of temporal context}\DIFadd{.
-}\newblock \emph{\bibinfo{journal}{Journal of Mathematical Psychology}}
-  \textbf{\bibinfo{volume}{46}}\DIFadd{, }\bibinfo{pages}{269--299}
-  \DIFadd{(}\bibinfo{year}{2002}\DIFadd{).
-}
+\bibinfo{author}{Howard, M.~W.} \& \bibinfo{author}{Kahana, M.~J.}
+\newblock \bibinfo{title}{A distributed representation of temporal context}.
+\newblock \emph{\bibinfo{journal}{Journal of Mathematical Psychology}}
+  \textbf{\bibinfo{volume}{46}}, \bibinfo{pages}{269--299}
+  (\bibinfo{year}{2002}).
 
 \bibitem{HowaEtal14}
-\bibinfo{author}{Howard, M.~W.} \emph{\DIFadd{et~al.}}
+\bibinfo{author}{Howard, M.~W.} \emph{et~al.}
 \newblock \bibinfo{title}{A unified mathematical framework for coding time,
-  space, and sequences in the medial temporal lobe}\DIFadd{.
-}\newblock \emph{\bibinfo{journal}{Journal of Neuroscience}}
-  \textbf{\bibinfo{volume}{34}}\DIFadd{, }\bibinfo{pages}{4692--4707}
-  \DIFadd{(}\bibinfo{year}{2014}\DIFadd{).
-}
+  space, and sequences in the medial temporal lobe}.
+\newblock \emph{\bibinfo{journal}{Journal of Neuroscience}}
+  \textbf{\bibinfo{volume}{34}}, \bibinfo{pages}{4692--4707}
+  (\bibinfo{year}{2014}).
 
 \bibitem{MannEtal15}
-\bibinfo{author}{Manning, J.~R.}\DIFadd{, }\bibinfo{author}{Norman, K.~A.} \DIFadd{\&
-  }\bibinfo{author}{Kahana, M.~J.}
-\newblock \bibinfo{title}{The role of context in episodic memory}\DIFadd{.
-}\newblock \DIFadd{In }\bibinfo{editor}{Gazzaniga, M.} \DIFadd{(ed.)
-  }\emph{\bibinfo{booktitle}{The Cognitive Neurosciences, Fifth edition}}\DIFadd{,
-  }\bibinfo{pages}{557--566} \DIFadd{(}\bibinfo{publisher}{{MIT} Press}\DIFadd{,
-  }\bibinfo{year}{2015}\DIFadd{).
-}
+\bibinfo{author}{Manning, J.~R.}, \bibinfo{author}{Norman, K.~A.} \&
+  \bibinfo{author}{Kahana, M.~J.}
+\newblock \bibinfo{title}{The role of context in episodic memory}.
+\newblock In \bibinfo{editor}{Gazzaniga, M.} (ed.)
+  \emph{\bibinfo{booktitle}{The Cognitive Neurosciences, Fifth edition}},
+  \bibinfo{pages}{557--566} (\bibinfo{publisher}{{MIT} Press},
+  \bibinfo{year}{2015}).
 
 \bibitem{RangRitc12}
-\bibinfo{author}{Ranganath, C.} \DIFadd{\& }\bibinfo{author}{Ritchey, M.}
-\newblock \bibinfo{title}{Two cortical systems for memory-guided behavior}\DIFadd{.
-}\newblock \emph{\bibinfo{journal}{Nature Reviews Neuroscience}}
-  \textbf{\bibinfo{volume}{13}}\DIFadd{, }\bibinfo{pages}{713 -- 726}
-  \DIFadd{(}\bibinfo{year}{2012}\DIFadd{).
-}
+\bibinfo{author}{Ranganath, C.} \& \bibinfo{author}{Ritchey, M.}
+\newblock \bibinfo{title}{Two cortical systems for memory-guided behavior}.
+\newblock \emph{\bibinfo{journal}{Nature Reviews Neuroscience}}
+  \textbf{\bibinfo{volume}{13}}, \bibinfo{pages}{713 -- 726}
+  (\bibinfo{year}{2012}).
 
 \bibitem{ZackEtal07}
-\bibinfo{author}{Zacks, J.~M.}\DIFadd{, }\bibinfo{author}{Speer, N.~K.}\DIFadd{,
-  }\bibinfo{author}{Swallow, K.~M.}\DIFadd{, }\bibinfo{author}{Braver, T.~S.} \DIFadd{\&
-  }\bibinfo{author}{Reynolds, J.~R.}
-\newblock \bibinfo{title}{Event perception: a mind-brain perspective}\DIFadd{.
-}\newblock \emph{\bibinfo{journal}{Psychological Bulletin}}
-  \textbf{\bibinfo{volume}{133}}\DIFadd{, }\bibinfo{pages}{273--293}
-  \DIFadd{(}\bibinfo{year}{2007}\DIFadd{).
-}
+\bibinfo{author}{Zacks, J.~M.}, \bibinfo{author}{Speer, N.~K.},
+  \bibinfo{author}{Swallow, K.~M.}, \bibinfo{author}{Braver, T.~S.} \&
+  \bibinfo{author}{Reynolds, J.~R.}
+\newblock \bibinfo{title}{Event perception: a mind-brain perspective}.
+\newblock \emph{\bibinfo{journal}{Psychological Bulletin}}
+  \textbf{\bibinfo{volume}{133}}, \bibinfo{pages}{273--293}
+  (\bibinfo{year}{2007}).
 
 \bibitem{ZwaaRadv98}
-\bibinfo{author}{Zwaan, R.~A.} \DIFadd{\& }\bibinfo{author}{Radvansky, G.~A.}
+\bibinfo{author}{Zwaan, R.~A.} \& \bibinfo{author}{Radvansky, G.~A.}
 \newblock \bibinfo{title}{Situation models in language comprehension and
-  memory}\DIFadd{.
-}\newblock \emph{\bibinfo{journal}{Psychological Bulletin}}
-  \textbf{\bibinfo{volume}{123}}\DIFadd{, }\bibinfo{pages}{162 -- 185}
-  \DIFadd{(}\bibinfo{year}{1998}\DIFadd{).
-}
+  memory}.
+\newblock \emph{\bibinfo{journal}{Psychological Bulletin}}
+  \textbf{\bibinfo{volume}{123}}, \bibinfo{pages}{162 -- 185}
+  (\bibinfo{year}{1998}).
 
 \bibitem{RadvZack17}
-\bibinfo{author}{Radvansky, G.~A.} \DIFadd{\& }\bibinfo{author}{Zacks, J.~M.}
-\newblock \bibinfo{title}{Event boundaries in memory and cognition}\DIFadd{.
-}\newblock \emph{\bibinfo{journal}{Curr Opin Behav Sci}}
-  \textbf{\bibinfo{volume}{17}}\DIFadd{, }\bibinfo{pages}{133--140}
-  \DIFadd{(}\bibinfo{year}{2017}\DIFadd{).
-}
+\bibinfo{author}{Radvansky, G.~A.} \& \bibinfo{author}{Zacks, J.~M.}
+\newblock \bibinfo{title}{Event boundaries in memory and cognition}.
+\newblock \emph{\bibinfo{journal}{Curr Opin Behav Sci}}
+  \textbf{\bibinfo{volume}{17}}, \bibinfo{pages}{133--140}
+  (\bibinfo{year}{2017}).
 
 \bibitem{BrunEtal18}
-\bibinfo{author}{Brunec, I.~K.}\DIFadd{, }\bibinfo{author}{Moscovitch, M.~M.} \DIFadd{\&
-  }\bibinfo{author}{Barense, M.~D.}
+\bibinfo{author}{Brunec, I.~K.}, \bibinfo{author}{Moscovitch, M.~M.} \&
+  \bibinfo{author}{Barense, M.~D.}
 \newblock \bibinfo{title}{Boundaries shape cognitive representations of spaces
-  and events}\DIFadd{.
-}\newblock \emph{\bibinfo{journal}{{Trends in Cognitive Sciences}}}
-  \textbf{\bibinfo{volume}{22}}\DIFadd{, }\bibinfo{pages}{637--650}
-  \DIFadd{(}\bibinfo{year}{2018}\DIFadd{).
-}
+  and events}.
+\newblock \emph{\bibinfo{journal}{{Trends in Cognitive Sciences}}}
+  \textbf{\bibinfo{volume}{22}}, \bibinfo{pages}{637--650}
+  (\bibinfo{year}{2018}).
 
 \bibitem{HeusEtal18b}
-\bibinfo{author}{Heusser, A.~C.}\DIFadd{, }\bibinfo{author}{Ezzyat, Y.}\DIFadd{,
-  }\bibinfo{author}{Shiff, I.} \DIFadd{\& }\bibinfo{author}{Davachi, L.}
+\bibinfo{author}{Heusser, A.~C.}, \bibinfo{author}{Ezzyat, Y.},
+  \bibinfo{author}{Shiff, I.} \& \bibinfo{author}{Davachi, L.}
 \newblock \bibinfo{title}{Perceptual boundaries cause mnemonic trade-offs
-  between local boundary processing and across-trial associative binding}\DIFadd{.
-}\newblock \emph{\bibinfo{journal}{Journal of Experimental Psychology Learning,
-  Memory, and Cognition}} \textbf{\bibinfo{volume}{44}}\DIFadd{,
-  }\bibinfo{pages}{1075--1090} \DIFadd{(}\bibinfo{year}{2018}\DIFadd{).
-}
+  between local boundary processing and across-trial associative binding}.
+\newblock \emph{\bibinfo{journal}{Journal of Experimental Psychology Learning,
+  Memory, and Cognition}} \textbf{\bibinfo{volume}{44}},
+  \bibinfo{pages}{1075--1090} (\bibinfo{year}{2018}).
 
 \bibitem{ClewDava17}
-\bibinfo{author}{Clewett, D.} \DIFadd{\& }\bibinfo{author}{Davachi, L.}
+\bibinfo{author}{Clewett, D.} \& \bibinfo{author}{Davachi, L.}
 \newblock \bibinfo{title}{The ebb and flow of experience determines the
-  temporal structure of memory}\DIFadd{.
-}\newblock \emph{\bibinfo{journal}{Curr Opin Behav Sci}}
-  \textbf{\bibinfo{volume}{17}}\DIFadd{, }\bibinfo{pages}{186--193}
-  \DIFadd{(}\bibinfo{year}{2017}\DIFadd{).
-}
+  temporal structure of memory}.
+\newblock \emph{\bibinfo{journal}{Curr Opin Behav Sci}}
+  \textbf{\bibinfo{volume}{17}}, \bibinfo{pages}{186--193}
+  (\bibinfo{year}{2017}).
 
 \bibitem{EzzyDava11}
-\bibinfo{author}{Ezzyat, Y.} \DIFadd{\& }\bibinfo{author}{Davachi, L.}
+\bibinfo{author}{Ezzyat, Y.} \& \bibinfo{author}{Davachi, L.}
 \newblock \bibinfo{title}{What constitutes an episode in episodic memory?}
 \newblock \emph{\bibinfo{journal}{Psychological Science}}
-  \textbf{\bibinfo{volume}{22}}\DIFadd{, }\bibinfo{pages}{243--252}
-  \DIFadd{(}\bibinfo{year}{2011}\DIFadd{).
-}
+  \textbf{\bibinfo{volume}{22}}, \bibinfo{pages}{243--252}
+  (\bibinfo{year}{2011}).
 
 \bibitem{DuBrDava13}
-\bibinfo{author}{DuBrow, S.} \DIFadd{\& }\bibinfo{author}{Davachi, L.}
+\bibinfo{author}{DuBrow, S.} \& \bibinfo{author}{Davachi, L.}
 \newblock \bibinfo{title}{The influence of contextual boundaries on memory for
-  the sequential order of events}\DIFadd{.
-}\newblock \emph{\bibinfo{journal}{Journal of Experimental Psychology: General}}
-  \textbf{\bibinfo{volume}{142}}\DIFadd{, }\bibinfo{pages}{1277--1286}
-  \DIFadd{(}\bibinfo{year}{2013}\DIFadd{).
-}
+  the sequential order of events}.
+\newblock \emph{\bibinfo{journal}{Journal of Experimental Psychology: General}}
+  \textbf{\bibinfo{volume}{142}}, \bibinfo{pages}{1277--1286}
+  (\bibinfo{year}{2013}).
 
 \bibitem{TompDava17}
-\bibinfo{author}{Tompary, A.} \DIFadd{\& }\bibinfo{author}{Davachi, L.}
+\bibinfo{author}{Tompary, A.} \& \bibinfo{author}{Davachi, L.}
 \newblock \bibinfo{title}{Consolidation promotes the emergence of
-  representational overlap in the hippocampus and medial prefrontal cortex}\DIFadd{.
-}\newblock \emph{\bibinfo{journal}{Neuron}} \textbf{\bibinfo{volume}{96}}\DIFadd{,
-  }\bibinfo{pages}{228--241} \DIFadd{(}\bibinfo{year}{2017}\DIFadd{).
-}
+  representational overlap in the hippocampus and medial prefrontal cortex}.
+\newblock \emph{\bibinfo{journal}{Neuron}} \textbf{\bibinfo{volume}{96}},
+  \bibinfo{pages}{228--241} (\bibinfo{year}{2017}).
 
 \bibitem{ChenEtal17}
-\bibinfo{author}{Chen, J.} \emph{\DIFadd{et~al.}}
+\bibinfo{author}{Chen, J.} \emph{et~al.}
 \newblock \bibinfo{title}{Shared memories reveal shared structure in neural
-  activity across individuals}\DIFadd{.
-}\newblock \emph{\bibinfo{journal}{Nature Neuroscience}}
-  \textbf{\bibinfo{volume}{20}}\DIFadd{, }\bibinfo{pages}{115} \DIFadd{(}\bibinfo{year}{2017}\DIFadd{).
-}
+  activity across individuals}.
+\newblock \emph{\bibinfo{journal}{Nature Neuroscience}}
+  \textbf{\bibinfo{volume}{20}}, \bibinfo{pages}{115} (\bibinfo{year}{2017}).
 
 \bibitem{BleiEtal03}
-\bibinfo{author}{Blei, D.~M.}\DIFadd{, }\bibinfo{author}{Ng, A.~Y.} \DIFadd{\&
-  }\bibinfo{author}{Jordan, M.~I.}
-\newblock \bibinfo{title}{Latent dirichlet allocation}\DIFadd{.
-}\newblock \emph{\bibinfo{journal}{Journal of Machine Learning Research}}
-  \textbf{\bibinfo{volume}{3}}\DIFadd{, }\bibinfo{pages}{993 -- 1022}
-  \DIFadd{(}\bibinfo{year}{2003}\DIFadd{).
-}
+\bibinfo{author}{Blei, D.~M.}, \bibinfo{author}{Ng, A.~Y.} \&
+  \bibinfo{author}{Jordan, M.~I.}
+\newblock \bibinfo{title}{Latent dirichlet allocation}.
+\newblock \emph{\bibinfo{journal}{Journal of Machine Learning Research}}
+  \textbf{\bibinfo{volume}{3}}, \bibinfo{pages}{993 -- 1022}
+  (\bibinfo{year}{2003}).
 
 \bibitem{Rabi89}
 \bibinfo{author}{Rabiner, L.}
 \newblock \bibinfo{title}{A tutorial on {Hidden Markov Models} and selected
-  applications in speech recognition}\DIFadd{.
-}\newblock \emph{\bibinfo{journal}{Proceedings of the IEEE}}
-  \textbf{\bibinfo{volume}{77}}\DIFadd{, }\bibinfo{pages}{257--286}
-  \DIFadd{(}\bibinfo{year}{1989}\DIFadd{).
-}
+  applications in speech recognition}.
+\newblock \emph{\bibinfo{journal}{Proceedings of the IEEE}}
+  \textbf{\bibinfo{volume}{77}}, \bibinfo{pages}{257--286}
+  (\bibinfo{year}{1989}).
 
 \bibitem{BaldEtal17}
-\bibinfo{author}{Baldassano, C.} \emph{\DIFadd{et~al.}}
+\bibinfo{author}{Baldassano, C.} \emph{et~al.}
 \newblock \bibinfo{title}{Discovering event structure in continuous narrative
-  perception and memory}\DIFadd{.
-}\newblock \emph{\bibinfo{journal}{Neuron}} \textbf{\bibinfo{volume}{95}}\DIFadd{,
-  }\bibinfo{pages}{709--721} \DIFadd{(}\bibinfo{year}{2017}\DIFadd{).
-}
+  perception and memory}.
+\newblock \emph{\bibinfo{journal}{Neuron}} \textbf{\bibinfo{volume}{95}},
+  \bibinfo{pages}{709--721} (\bibinfo{year}{2017}).
 
 \bibitem{BleiLaff06}
-\bibinfo{author}{Blei, D.~M.} \DIFadd{\& }\bibinfo{author}{Lafferty, J.~D.}
-\newblock \bibinfo{title}{Dynamic topic models}\DIFadd{.
-}\newblock \DIFadd{In }\emph{\bibinfo{booktitle}{Proceedings of the 23rd International
-  Conference on Machine Learning}}\DIFadd{, ICML '06, }\bibinfo{pages}{113--120}
-  \DIFadd{(}\bibinfo{publisher}{ACM}\DIFadd{, }\bibinfo{address}{New York, NY, US}\DIFadd{,
-  }\bibinfo{year}{2006}\DIFadd{).
-}
+\bibinfo{author}{Blei, D.~M.} \& \bibinfo{author}{Lafferty, J.~D.}
+\newblock \bibinfo{title}{Dynamic topic models}.
+\newblock In \emph{\bibinfo{booktitle}{Proceedings of the 23rd International
+  Conference on Machine Learning}}, ICML '06, \bibinfo{pages}{113--120}
+  (\bibinfo{publisher}{ACM}, \bibinfo{address}{New York, NY, US},
+  \bibinfo{year}{2006}).
 
 \bibitem{MannEtal11}
-\bibinfo{author}{Manning, J.~R.}\DIFadd{, }\bibinfo{author}{Polyn, S.~M.}\DIFadd{,
-  }\bibinfo{author}{Baltuch, G.}\DIFadd{, }\bibinfo{author}{Litt, B.} \DIFadd{\&
-  }\bibinfo{author}{Kahana, M.~J.}
+\bibinfo{author}{Manning, J.~R.}, \bibinfo{author}{Polyn, S.~M.},
+  \bibinfo{author}{Baltuch, G.}, \bibinfo{author}{Litt, B.} \&
+  \bibinfo{author}{Kahana, M.~J.}
 \newblock \bibinfo{title}{Oscillatory patterns in temporal lobe reveal context
-  reinstatement during memory search}\DIFadd{.
-}\newblock \emph{\bibinfo{journal}{Proceedings of the National Academy of
-  Sciences, USA}} \textbf{\bibinfo{volume}{108}}\DIFadd{, }\bibinfo{pages}{12893--12897}
-  \DIFadd{(}\bibinfo{year}{2011}\DIFadd{).
-}
+  reinstatement during memory search}.
+\newblock \emph{\bibinfo{journal}{Proceedings of the National Academy of
+  Sciences, USA}} \textbf{\bibinfo{volume}{108}}, \bibinfo{pages}{12893--12897}
+  (\bibinfo{year}{2011}).
 
 \bibitem{HowaEtal12}
-\bibinfo{author}{Howard, M.~W.}\DIFadd{, }\bibinfo{author}{Viskontas, I.~V.}\DIFadd{,
-  }\bibinfo{author}{Shankar, K.~H.} \DIFadd{\& }\bibinfo{author}{Fried, I.}
+\bibinfo{author}{Howard, M.~W.}, \bibinfo{author}{Viskontas, I.~V.},
+  \bibinfo{author}{Shankar, K.~H.} \& \bibinfo{author}{Fried, I.}
 \newblock \bibinfo{title}{Ensembles of human {MTL} neurons ``jump back in
-  time'' in response to a repeated stimulus}\DIFadd{.
-}\newblock \emph{\bibinfo{journal}{Hippocampus}} \textbf{\bibinfo{volume}{22}}\DIFadd{,
-  }\bibinfo{pages}{1833--1847} \DIFadd{(}\bibinfo{year}{2012}\DIFadd{).
-}
+  time'' in response to a repeated stimulus}.
+\newblock \emph{\bibinfo{journal}{Hippocampus}} \textbf{\bibinfo{volume}{22}},
+  \bibinfo{pages}{1833--1847} (\bibinfo{year}{2012}).
 
 \bibitem{AtkiShif68}
-\bibinfo{author}{Atkinson, R.~C.} \DIFadd{\& }\bibinfo{author}{Shiffrin, R.~M.}
+\bibinfo{author}{Atkinson, R.~C.} \& \bibinfo{author}{Shiffrin, R.~M.}
 \newblock \bibinfo{title}{Human memory: {A} proposed system and its control
-  processes}\DIFadd{.
-}\newblock \DIFadd{In }\bibinfo{editor}{Spence, K.~W.} \DIFadd{\& }\bibinfo{editor}{Spence, J.~T.}
-  \DIFadd{(eds.) }\emph{\bibinfo{booktitle}{The psychology of learning and motivation}}\DIFadd{,
-  vol.~}\bibinfo{volume}{2}\DIFadd{, }\bibinfo{pages}{89--105}
-  \DIFadd{(}\bibinfo{publisher}{Academic Press}\DIFadd{, }\bibinfo{address}{New York}\DIFadd{,
-  }\bibinfo{year}{1968}\DIFadd{).
-}
+  processes}.
+\newblock In \bibinfo{editor}{Spence, K.~W.} \& \bibinfo{editor}{Spence, J.~T.}
+  (eds.) \emph{\bibinfo{booktitle}{The psychology of learning and motivation}},
+  vol.~\bibinfo{volume}{2}, \bibinfo{pages}{89--105}
+  (\bibinfo{publisher}{Academic Press}, \bibinfo{address}{New York},
+  \bibinfo{year}{1968}).
 
 \bibitem{PostPhil65}
-\bibinfo{author}{Postman, L.} \DIFadd{\& }\bibinfo{author}{Phillips, L.~W.}
-\newblock \bibinfo{title}{Short-term temporal changes in free recall}\DIFadd{.
-}\newblock \emph{\bibinfo{journal}{Quarterly Journal of Experimental
-  Psychology}} \textbf{\bibinfo{volume}{17}}\DIFadd{, }\bibinfo{pages}{132--138}
-  \DIFadd{(}\bibinfo{year}{1965}\DIFadd{).
-}
+\bibinfo{author}{Postman, L.} \& \bibinfo{author}{Phillips, L.~W.}
+\newblock \bibinfo{title}{Short-term temporal changes in free recall}.
+\newblock \emph{\bibinfo{journal}{Quarterly Journal of Experimental
+  Psychology}} \textbf{\bibinfo{volume}{17}}, \bibinfo{pages}{132--138}
+  (\bibinfo{year}{1965}).
 
 \bibitem{WelcBurn24}
-\bibinfo{author}{Welch, G.~B.} \DIFadd{\& }\bibinfo{author}{Burnett, C.~T.}
-\newblock \bibinfo{title}{Is primacy a factor in association-formation}\DIFadd{.
-}\newblock \emph{\bibinfo{journal}{American Journal of Psychology}}
-  \textbf{\bibinfo{volume}{35}}\DIFadd{, }\bibinfo{pages}{396--401}
-  \DIFadd{(}\bibinfo{year}{1924}\DIFadd{).
-}
+\bibinfo{author}{Welch, G.~B.} \& \bibinfo{author}{Burnett, C.~T.}
+\newblock \bibinfo{title}{Is primacy a factor in association-formation}.
+\newblock \emph{\bibinfo{journal}{American Journal of Psychology}}
+  \textbf{\bibinfo{volume}{35}}, \bibinfo{pages}{396--401}
+  (\bibinfo{year}{1924}).
 
 \bibitem{PolyEtal09}
-\bibinfo{author}{Polyn, S.~M.}\DIFadd{, }\bibinfo{author}{Norman, K.~A.} \DIFadd{\&
-  }\bibinfo{author}{Kahana, M.~J.}
+\bibinfo{author}{Polyn, S.~M.}, \bibinfo{author}{Norman, K.~A.} \&
+  \bibinfo{author}{Kahana, M.~J.}
 \newblock \bibinfo{title}{A context maintenance and retrieval model of
-  organizational processes in free recall}\DIFadd{.
-}\newblock \emph{\bibinfo{journal}{Psychological Review}}
-  \textbf{\bibinfo{volume}{116}}\DIFadd{, }\bibinfo{pages}{129--156}
-  \DIFadd{(}\bibinfo{year}{2009}\DIFadd{).
-}
+  organizational processes in free recall}.
+\newblock \emph{\bibinfo{journal}{Psychological Review}}
+  \textbf{\bibinfo{volume}{116}}, \bibinfo{pages}{129--156}
+  (\bibinfo{year}{2009}).
 
 \bibitem{MannKaha12}
-\bibinfo{author}{Manning, J.~R.} \DIFadd{\& }\bibinfo{author}{Kahana, M.~J.}
+\bibinfo{author}{Manning, J.~R.} \& \bibinfo{author}{Kahana, M.~J.}
 \newblock \bibinfo{title}{Interpreting semantic clustering effects in free
-  recall}\DIFadd{.
-}\newblock \emph{\bibinfo{journal}{Memory}} \textbf{\bibinfo{volume}{20}}\DIFadd{,
-  }\bibinfo{pages}{511--517} \DIFadd{(}\bibinfo{year}{2012}\DIFadd{).
-}
+  recall}.
+\newblock \emph{\bibinfo{journal}{Memory}} \textbf{\bibinfo{volume}{20}},
+  \bibinfo{pages}{511--517} (\bibinfo{year}{2012}).
 
 \bibitem{HeusEtal18a}
-\bibinfo{author}{Heusser, A.~C.}\DIFadd{, }\bibinfo{author}{Ziman, K.}\DIFadd{,
-  }\bibinfo{author}{Owen, L. L.~W.} \DIFadd{\& }\bibinfo{author}{Manning, J.~R.}
+\bibinfo{author}{Heusser, A.~C.}, \bibinfo{author}{Ziman, K.},
+  \bibinfo{author}{Owen, L. L.~W.} \& \bibinfo{author}{Manning, J.~R.}
 \newblock \bibinfo{title}{{HyperTools}: a {Python} toolbox for gaining
-  geometric insights into high-dimensional data}\DIFadd{.
-}\newblock \emph{\bibinfo{journal}{{Journal of Machine Learning Research}}}
-  \textbf{\bibinfo{volume}{18}}\DIFadd{, }\bibinfo{pages}{1--6} \DIFadd{(}\bibinfo{year}{2018}\DIFadd{).
-}
+  geometric insights into high-dimensional data}.
+\newblock \emph{\bibinfo{journal}{{Journal of Machine Learning Research}}}
+  \textbf{\bibinfo{volume}{18}}, \bibinfo{pages}{1--6} (\bibinfo{year}{2018}).
 
 \bibitem{McInEtal18}
-\bibinfo{author}{McInnes, L.}\DIFadd{, }\bibinfo{author}{Healy, J.} \DIFadd{\&
-  }\bibinfo{author}{Melville, J.}
+\bibinfo{author}{McInnes, L.}, \bibinfo{author}{Healy, J.} \&
+  \bibinfo{author}{Melville, J.}
 \newblock \bibinfo{title}{{UMAP}: Uniform manifold approximation and projection
-  for dimension reduction}\DIFadd{.
-}\newblock \emph{\bibinfo{journal}{arXiv}} \textbf{\bibinfo{volume}{1802}}
-  \DIFadd{(}\bibinfo{year}{2018}\DIFadd{).
-}
+  for dimension reduction}.
+\newblock \emph{\bibinfo{journal}{arXiv}} \textbf{\bibinfo{volume}{1802}}
+  (\bibinfo{year}{2018}).
 
 \bibitem{MuelEtal18}
-\bibinfo{author}{Mueller, A.} \emph{\DIFadd{et~al.}}
+\bibinfo{author}{Mueller, A.} \emph{et~al.}
 \newblock \bibinfo{title}{{WordCloud 1.5.0: a little word cloud generator in
-  Python}}\DIFadd{.
-}\newblock \emph{\bibinfo{journal}{Zenodo}}
+  Python}}.
+\newblock \emph{\bibinfo{journal}{Zenodo}}
   \textbf{\bibinfo{volume}{https://zenodo.org/record/1322068\#.W4tPKZNKh24}}
-  \DIFadd{(}\bibinfo{year}{2018}\DIFadd{).
-}
+  (\bibinfo{year}{2018}).
 
 \bibitem{PallWagn02}
-\bibinfo{author}{Paller, K.~A.} \DIFadd{\& }\bibinfo{author}{Wagner, A.~D.}
+\bibinfo{author}{Paller, K.~A.} \& \bibinfo{author}{Wagner, A.~D.}
 \newblock \bibinfo{title}{Observing the transformation of experience into
-  memory}\DIFadd{.
-}\newblock \emph{\bibinfo{journal}{Trends in Cognitive Sciences}}
-  \textbf{\bibinfo{volume}{6}}\DIFadd{, }\bibinfo{pages}{93--102}
-  \DIFadd{(}\bibinfo{year}{2002}\DIFadd{).
-}
+  memory}.
+\newblock \emph{\bibinfo{journal}{Trends in Cognitive Sciences}}
+  \textbf{\bibinfo{volume}{6}}, \bibinfo{pages}{93--102}
+  (\bibinfo{year}{2002}).
 
 \bibitem{YarkEtal11}
-\bibinfo{author}{Yarkoni, T.}\DIFadd{, }\bibinfo{author}{Poldrack, R.~A.}\DIFadd{,
-  }\bibinfo{author}{Nichols, T.~E.}\DIFadd{, }\bibinfo{author}{Van~Essen, D.~C.} \DIFadd{\&
-  }\bibinfo{author}{Wager, T.~D.}
+\bibinfo{author}{Yarkoni, T.}, \bibinfo{author}{Poldrack, R.~A.},
+  \bibinfo{author}{Nichols, T.~E.}, \bibinfo{author}{Van~Essen, D.~C.} \&
+  \bibinfo{author}{Wager, T.~D.}
 \newblock \bibinfo{title}{Large-scale automated synthesis of human functional
-  neuroimaging data}\DIFadd{.
-}\newblock \emph{\bibinfo{journal}{Nature Methods}}
-  \textbf{\bibinfo{volume}{8}}\DIFadd{, }\bibinfo{pages}{665} \DIFadd{(}\bibinfo{year}{2011}\DIFadd{).
-}
+  neuroimaging data}.
+\newblock \emph{\bibinfo{journal}{Nature Methods}}
+  \textbf{\bibinfo{volume}{8}}, \bibinfo{pages}{665} (\bibinfo{year}{2011}).
 
 \bibitem{BellEtal18}
-\bibinfo{author}{Bellmund, J. L.~S.}\DIFadd{, }\bibinfo{author}{G\"{a}rdenfors, P.}\DIFadd{,
-  }\bibinfo{author}{Moser, E.~I.} \DIFadd{\& }\bibinfo{author}{Doeller, C.~F.}
+\bibinfo{author}{Bellmund, J. L.~S.}, \bibinfo{author}{G\"{a}rdenfors, P.},
+  \bibinfo{author}{Moser, E.~I.} \& \bibinfo{author}{Doeller, C.~F.}
 \newblock \bibinfo{title}{Navigating cognition: spatial codes for human
-  thinking}\DIFadd{.
-}\newblock \emph{\bibinfo{journal}{Science}} \textbf{\bibinfo{volume}{362}}
-  \DIFadd{(}\bibinfo{year}{2018}\DIFadd{).
-}
+  thinking}.
+\newblock \emph{\bibinfo{journal}{Science}} \textbf{\bibinfo{volume}{362}}
+  (\bibinfo{year}{2018}).
 
 \bibitem{BellEtal20}
-\bibinfo{author}{Bellmund, J. L.~S.} \emph{\DIFadd{et~al.}}
+\bibinfo{author}{Bellmund, J. L.~S.} \emph{et~al.}
 \newblock \bibinfo{title}{Deforming the metric of cognitive maps distorts
-  memory}\DIFadd{.
-}\newblock \emph{\bibinfo{journal}{Nature Human Behavior}}
-  \textbf{\bibinfo{volume}{4}}\DIFadd{, }\bibinfo{pages}{177--188}
-  \DIFadd{(}\bibinfo{year}{2020}\DIFadd{).
-}
+  memory}.
+\newblock \emph{\bibinfo{journal}{Nature Human Behavior}}
+  \textbf{\bibinfo{volume}{4}}, \bibinfo{pages}{177--188}
+  (\bibinfo{year}{2020}).
 
 \bibitem{ConsEtal16}
-\bibinfo{author}{Constantinescu, A.~O.}\DIFadd{, }\bibinfo{author}{O'Reilly, J.~X.} \DIFadd{\&
-  }\bibinfo{author}{Behrens, T. E.~J.}
+\bibinfo{author}{Constantinescu, A.~O.}, \bibinfo{author}{O'Reilly, J.~X.} \&
+  \bibinfo{author}{Behrens, T. E.~J.}
 \newblock \bibinfo{title}{Organizing conceptual knowledge in humans with a
-  gridlike code}\DIFadd{.
-}\newblock \emph{\bibinfo{journal}{Science}} \textbf{\bibinfo{volume}{352}}\DIFadd{,
-  }\bibinfo{pages}{1464--1468} \DIFadd{(}\bibinfo{year}{2016}\DIFadd{).
-}
+  gridlike code}.
+\newblock \emph{\bibinfo{journal}{Science}} \textbf{\bibinfo{volume}{352}},
+  \bibinfo{pages}{1464--1468} (\bibinfo{year}{2016}).
 
 \bibitem{GilbMarl17}
-\bibinfo{author}{Gilboa, A.} \DIFadd{\& }\bibinfo{author}{Marlatte, H.}
-\newblock \bibinfo{title}{Neurobiology of schemas and schema-mediated memory}\DIFadd{.
-}\newblock \emph{\bibinfo{journal}{Trends Cogn Sci}}
-  \textbf{\bibinfo{volume}{21}}\DIFadd{, }\bibinfo{pages}{618--631}
-  \DIFadd{(}\bibinfo{year}{2017}\DIFadd{).
-}
+\bibinfo{author}{Gilboa, A.} \& \bibinfo{author}{Marlatte, H.}
+\newblock \bibinfo{title}{Neurobiology of schemas and schema-mediated memory}.
+\newblock \emph{\bibinfo{journal}{Trends Cogn Sci}}
+  \textbf{\bibinfo{volume}{21}}, \bibinfo{pages}{618--631}
+  (\bibinfo{year}{2017}).
 
 \bibitem{BaldEtal18}
-\bibinfo{author}{Baldassano, C.}\DIFadd{, }\bibinfo{author}{Hasson, U.} \DIFadd{\&
-  }\bibinfo{author}{Norman, K.~A.}
+\bibinfo{author}{Baldassano, C.}, \bibinfo{author}{Hasson, U.} \&
+  \bibinfo{author}{Norman, K.~A.}
 \newblock \bibinfo{title}{Representation of real-world event schemas during
-  narrative perception}\DIFadd{.
-}\newblock \emph{\bibinfo{journal}{{Journal of Neuroscience}}}
-  \textbf{\bibinfo{volume}{38}}\DIFadd{, }\bibinfo{pages}{9689--9699}
-  \DIFadd{(}\bibinfo{year}{2018}\DIFadd{).
-}
+  narrative perception}.
+\newblock \emph{\bibinfo{journal}{{Journal of Neuroscience}}}
+  \textbf{\bibinfo{volume}{38}}, \bibinfo{pages}{9689--9699}
+  (\bibinfo{year}{2018}).
 
 \bibitem{HuthEtal12}
-\bibinfo{author}{Huth, A.~G.}\DIFadd{, }\bibinfo{author}{Nisimoto, S.}\DIFadd{,
-  }\bibinfo{author}{Vu, A.~T.} \DIFadd{\& }\bibinfo{author}{Gallant, J.~L.}
+\bibinfo{author}{Huth, A.~G.}, \bibinfo{author}{Nisimoto, S.},
+  \bibinfo{author}{Vu, A.~T.} \& \bibinfo{author}{Gallant, J.~L.}
 \newblock \bibinfo{title}{A continuous semantic space describes the
   representation of thousands of object and action categories across the human
-  brain}\DIFadd{.
-}\newblock \emph{\bibinfo{journal}{Neuron}} \textbf{\bibinfo{volume}{76}}\DIFadd{,
-  }\bibinfo{pages}{1210--1224} \DIFadd{(}\bibinfo{year}{2012}\DIFadd{).
-}
+  brain}.
+\newblock \emph{\bibinfo{journal}{Neuron}} \textbf{\bibinfo{volume}{76}},
+  \bibinfo{pages}{1210--1224} (\bibinfo{year}{2012}).
 
 \bibitem{HuthEtal16}
-\bibinfo{author}{Huth, A.~G.}\DIFadd{, }\bibinfo{author}{de~Heer, W.~A.}\DIFadd{,
-  }\bibinfo{author}{Griffiths, T.~L.}\DIFadd{, }\bibinfo{author}{Theunissen, F.~E.} \DIFadd{\&
-  }\bibinfo{author}{Gallant, J.~L.}
+\bibinfo{author}{Huth, A.~G.}, \bibinfo{author}{de~Heer, W.~A.},
+  \bibinfo{author}{Griffiths, T.~L.}, \bibinfo{author}{Theunissen, F.~E.} \&
+  \bibinfo{author}{Gallant, J.~L.}
 \newblock \bibinfo{title}{Natural speech reveals the semantic maps that tile
-  human cerebral cortex}\DIFadd{.
-}\newblock \emph{\bibinfo{journal}{Nature}} \textbf{\bibinfo{volume}{532}}\DIFadd{,
-  }\bibinfo{pages}{453--458} \DIFadd{(}\bibinfo{year}{2016}\DIFadd{).
-}
+  human cerebral cortex}.
+\newblock \emph{\bibinfo{journal}{Nature}} \textbf{\bibinfo{volume}{532}},
+  \bibinfo{pages}{453--458} (\bibinfo{year}{2016}).
 
 \bibitem{GagnEtal20}
-\bibinfo{author}{Gagnepain, P.} \emph{\DIFadd{et~al.}}
+\bibinfo{author}{Gagnepain, P.} \emph{et~al.}
 \newblock \bibinfo{title}{Collective memory shapes the organization of
-  individual memories in the medial prefrontal cortex}\DIFadd{.
-}\newblock \emph{\bibinfo{journal}{Nature Human Behavior}}
-  \textbf{\bibinfo{volume}{4}}\DIFadd{, }\bibinfo{pages}{189--200}
-  \DIFadd{(}\bibinfo{year}{2020}\DIFadd{).
-}
+  individual memories in the medial prefrontal cortex}.
+\newblock \emph{\bibinfo{journal}{Nature Human Behavior}}
+  \textbf{\bibinfo{volume}{4}}, \bibinfo{pages}{189--200}
+  (\bibinfo{year}{2020}).
 
 \bibitem{SimoEtal16}
-\bibinfo{author}{Simony, E.}\DIFadd{, }\bibinfo{author}{Honey, C.~J.}\DIFadd{,
-  }\bibinfo{author}{Chen, J.} \DIFadd{\& }\bibinfo{author}{Hasson, U.}
+\bibinfo{author}{Simony, E.}, \bibinfo{author}{Honey, C.~J.},
+  \bibinfo{author}{Chen, J.} \& \bibinfo{author}{Hasson, U.}
 \newblock \bibinfo{title}{Dynamic reconfiguration of the default mode network
-  during narrative comprehension}\DIFadd{.
-}\newblock \emph{\bibinfo{journal}{Nature Communications}}
-  \textbf{\bibinfo{volume}{7}}\DIFadd{, }\bibinfo{pages}{1--13} \DIFadd{(}\bibinfo{year}{2016}\DIFadd{).
-}
+  during narrative comprehension}.
+\newblock \emph{\bibinfo{journal}{Nature Communications}}
+  \textbf{\bibinfo{volume}{7}}, \bibinfo{pages}{1--13} (\bibinfo{year}{2016}).
 
 \bibitem{ZadbEtal17}
-\bibinfo{author}{Zadbood, A.}\DIFadd{, }\bibinfo{author}{Chen, J.}\DIFadd{,
-  }\bibinfo{author}{Leong, Y.~C.}\DIFadd{, }\bibinfo{author}{Norman, K.~A.} \DIFadd{\&
-  }\bibinfo{author}{Hasson, U.}
+\bibinfo{author}{Zadbood, A.}, \bibinfo{author}{Chen, J.},
+  \bibinfo{author}{Leong, Y.~C.}, \bibinfo{author}{Norman, K.~A.} \&
+  \bibinfo{author}{Hasson, U.}
 \newblock \bibinfo{title}{How we transmit memories to other brains:
-  Constructing shared neural representations via communication}\DIFadd{.
-}\newblock \emph{\bibinfo{journal}{Cereb Cortex}} \textbf{\bibinfo{volume}{27}}\DIFadd{,
-  }\bibinfo{pages}{4988--5000} \DIFadd{(}\bibinfo{year}{2017}\DIFadd{).
-}
+  Constructing shared neural representations via communication}.
+\newblock \emph{\bibinfo{journal}{Cereb Cortex}} \textbf{\bibinfo{volume}{27}},
+  \bibinfo{pages}{4988--5000} (\bibinfo{year}{2017}).
 
 \bibitem{SimoChan20}
-\bibinfo{author}{Simony, E.} \DIFadd{\& }\bibinfo{author}{Chang, C.}
+\bibinfo{author}{Simony, E.} \& \bibinfo{author}{Chang, C.}
 \newblock \bibinfo{title}{Analysis of stimulus-induced brain dynamics during
-  naturalistic paradigms}\DIFadd{.
-}\newblock \emph{\bibinfo{journal}{{N}euro{I}mage}}
-  \textbf{\bibinfo{volume}{216}}\DIFadd{, }\bibinfo{pages}{116461}
-  \DIFadd{(}\bibinfo{year}{2020}\DIFadd{).
-}
+  naturalistic paradigms}.
+\newblock \emph{\bibinfo{journal}{{N}euro{I}mage}}
+  \textbf{\bibinfo{volume}{216}}, \bibinfo{pages}{116461}
+  (\bibinfo{year}{2020}).
 
 \bibitem{LandDuma97}
-\bibinfo{author}{Landauer, T.~K.} \DIFadd{\& }\bibinfo{author}{Dumais, S.~T.}
+\bibinfo{author}{Landauer, T.~K.} \& \bibinfo{author}{Dumais, S.~T.}
 \newblock \bibinfo{title}{A solution to {P}lato's problem: the latent semantic
-  analysis theory of acquisition, induction, and representation of knowledge}\DIFadd{.
-}\newblock \emph{\bibinfo{journal}{Psychological Review}}
-  \textbf{\bibinfo{volume}{104}}\DIFadd{, }\bibinfo{pages}{211--240}
-  \DIFadd{(}\bibinfo{year}{1997}\DIFadd{).
-}
+  analysis theory of acquisition, induction, and representation of knowledge}.
+\newblock \emph{\bibinfo{journal}{Psychological Review}}
+  \textbf{\bibinfo{volume}{104}}, \bibinfo{pages}{211--240}
+  (\bibinfo{year}{1997}).
 
 \bibitem{MikoEtal13a}
-\bibinfo{author}{Mikolov, T.}\DIFadd{, }\bibinfo{author}{Chen, K.}\DIFadd{,
-  }\bibinfo{author}{Corrado, G.} \DIFadd{\& }\bibinfo{author}{Dean, J.}
+\bibinfo{author}{Mikolov, T.}, \bibinfo{author}{Chen, K.},
+  \bibinfo{author}{Corrado, G.} \& \bibinfo{author}{Dean, J.}
 \newblock \bibinfo{title}{Efficient estimation of word representations in
-  vector space}\DIFadd{.
-}\newblock \emph{\bibinfo{journal}{{arXiv}}}
-  \textbf{\bibinfo{volume}{1301.3781}} \DIFadd{(}\bibinfo{year}{2013}\DIFadd{).
-}
+  vector space}.
+\newblock \emph{\bibinfo{journal}{{arXiv}}}
+  \textbf{\bibinfo{volume}{1301.3781}} (\bibinfo{year}{2013}).
 
 \bibitem{CerEtal18}
-\bibinfo{author}{Cer, D.} \emph{\DIFadd{et~al.}}
-\newblock \bibinfo{title}{Universal sentence encoder}\DIFadd{.
-}\newblock \emph{\bibinfo{journal}{{arXiv}}}
-  \textbf{\bibinfo{volume}{1803.11175}} \DIFadd{(}\bibinfo{year}{2018}\DIFadd{).
-}
+\bibinfo{author}{Cer, D.} \emph{et~al.}
+\newblock \bibinfo{title}{Universal sentence encoder}.
+\newblock \emph{\bibinfo{journal}{{arXiv}}}
+  \textbf{\bibinfo{volume}{1803.11175}} (\bibinfo{year}{2018}).
 
 \bibitem{RadfEtal19}
-\bibinfo{author}{Radford, A.} \emph{\DIFadd{et~al.}}
-\newblock \bibinfo{title}{Language models are unsupervised multitask learners}\DIFadd{.
-}\newblock \emph{\bibinfo{journal}{OpenAI Blog}} \textbf{\bibinfo{volume}{1}}
-  \DIFadd{(}\bibinfo{year}{2019}\DIFadd{).
-}
+\bibinfo{author}{Radford, A.} \emph{et~al.}
+\newblock \bibinfo{title}{Language models are unsupervised multitask learners}.
+\newblock \emph{\bibinfo{journal}{OpenAI Blog}} \textbf{\bibinfo{volume}{1}}
+  (\bibinfo{year}{2019}).
 
 \bibitem{BrowEtal20}
-\bibinfo{author}{Brown, T.~B.} \emph{\DIFadd{et~al.}}
-\newblock \bibinfo{title}{Language models are few-shot learners}\DIFadd{.
-}\newblock \emph{\bibinfo{journal}{{arXiv}}}
-  \textbf{\bibinfo{volume}{2005.14165}} \DIFadd{(}\bibinfo{year}{2020}\DIFadd{).
-}
+\bibinfo{author}{Brown, T.~B.} \emph{et~al.}
+\newblock \bibinfo{title}{Language models are few-shot learners}.
+\newblock \emph{\bibinfo{journal}{{arXiv}}}
+  \textbf{\bibinfo{volume}{2005.14165}} (\bibinfo{year}{2020}).
 
 \bibitem{PedrEtal11}
-\bibinfo{author}{Pedregosa, F.} \emph{\DIFadd{et~al.}}
-\newblock \bibinfo{title}{Scikit-learn: Machine learning in {P}ython}\DIFadd{.
-}\newblock \emph{\bibinfo{journal}{Journal of Machine Learning Research}}
-  \textbf{\bibinfo{volume}{12}}\DIFadd{, }\bibinfo{pages}{2825--2830}
-  \DIFadd{(}\bibinfo{year}{2011}\DIFadd{).
-}
+\bibinfo{author}{Pedregosa, F.} \emph{et~al.}
+\newblock \bibinfo{title}{Scikit-learn: Machine learning in {P}ython}.
+\newblock \emph{\bibinfo{journal}{Journal of Machine Learning Research}}
+  \textbf{\bibinfo{volume}{12}}, \bibinfo{pages}{2825--2830}
+  (\bibinfo{year}{2011}).
 
 \bibitem{Brainiak}
-\bibinfo{author}{Capota, M.} \emph{\DIFadd{et~al.}}
-\newblock \bibinfo{title}{Brain imaging analysis kit} \DIFadd{(}\bibinfo{year}{2017}\DIFadd{).
-}\newblock \urlprefix\url{https://doi.org/10.5281/zenodo.59780}\DIFadd{.
-}
+\bibinfo{author}{Capota, M.} \emph{et~al.}
+\newblock \bibinfo{title}{Brain imaging analysis kit} (\bibinfo{year}{2017}).
+\newblock \urlprefix\url{https://doi.org/10.5281/zenodo.59780}.
 
 \bibitem{HassEtal08}
-\bibinfo{author}{Hasson, U.}\DIFadd{, }\bibinfo{author}{Yang, E.}\DIFadd{,
-  }\bibinfo{author}{Vallines, I.}\DIFadd{, }\bibinfo{author}{Heeger, D.~J.} \DIFadd{\&
-  }\bibinfo{author}{Rubin, N.}
+\bibinfo{author}{Hasson, U.}, \bibinfo{author}{Yang, E.},
+  \bibinfo{author}{Vallines, I.}, \bibinfo{author}{Heeger, D.~J.} \&
+  \bibinfo{author}{Rubin, N.}
 \newblock \bibinfo{title}{A hierarchy of temporal receptive windows in human
-  cortex}\DIFadd{.
-}\newblock \emph{\bibinfo{journal}{Journal of Neuroscience}}
-  \textbf{\bibinfo{volume}{28}}\DIFadd{, }\bibinfo{pages}{2539--2550}
-  \DIFadd{(}\bibinfo{year}{2008}\DIFadd{).
-}
+  cortex}.
+\newblock \emph{\bibinfo{journal}{Journal of Neuroscience}}
+  \textbf{\bibinfo{volume}{28}}, \bibinfo{pages}{2539--2550}
+  (\bibinfo{year}{2008}).
 
 \bibitem{HassEtal15}
-\bibinfo{author}{Hasson, U.}\DIFadd{, }\bibinfo{author}{Chen, J.} \DIFadd{\&
-  }\bibinfo{author}{Honey, C.~J.}
+\bibinfo{author}{Hasson, U.}, \bibinfo{author}{Chen, J.} \&
+  \bibinfo{author}{Honey, C.~J.}
 \newblock \bibinfo{title}{Hierarchical process memory: memory as an integral
-  component of information processing}\DIFadd{.
-}\newblock \emph{\bibinfo{journal}{Trends in Cognitive Science}}
-  \textbf{\bibinfo{volume}{19}}\DIFadd{, }\bibinfo{pages}{304--315}
-  \DIFadd{(}\bibinfo{year}{2015}\DIFadd{).
-}
+  component of information processing}.
+\newblock \emph{\bibinfo{journal}{Trends in Cognitive Science}}
+  \textbf{\bibinfo{volume}{19}}, \bibinfo{pages}{304--315}
+  (\bibinfo{year}{2015}).
 
 \bibitem{Dobr70}
 \bibinfo{author}{Dobrushin, R.~L.}
 \newblock \bibinfo{title}{Prescribing a system of random variables by
-  conditional distributions}\DIFadd{.
-}\newblock \emph{\bibinfo{journal}{Theory of Probability \& Its Applications}}
-  \textbf{\bibinfo{volume}{15}}\DIFadd{, }\bibinfo{pages}{458--486}
-  \DIFadd{(}\bibinfo{year}{1970}\DIFadd{).
-}
+  conditional distributions}.
+\newblock \emph{\bibinfo{journal}{Theory of Probability \& Its Applications}}
+  \textbf{\bibinfo{volume}{15}}, \bibinfo{pages}{458--486}
+  (\bibinfo{year}{1970}).
 
 \bibitem{RamdEtal17}
-\bibinfo{author}{Ramdas, A.}\DIFadd{, }\bibinfo{author}{Trillos, N.} \DIFadd{\&
-  }\bibinfo{author}{Cuturi, M.}
+\bibinfo{author}{Ramdas, A.}, \bibinfo{author}{Trillos, N.} \&
+  \bibinfo{author}{Cuturi, M.}
 \newblock \bibinfo{title}{On wasserstein two-sample testing and related
-  families of nonparametric tests}\DIFadd{.
-}\newblock \emph{\bibinfo{journal}{Entropy}} \textbf{\bibinfo{volume}{19}}\DIFadd{,
-  }\bibinfo{pages}{47} \DIFadd{(}\bibinfo{year}{2017}\DIFadd{).
-}
+  families of nonparametric tests}.
+\newblock \emph{\bibinfo{journal}{Entropy}} \textbf{\bibinfo{volume}{19}},
+  \bibinfo{pages}{47} (\bibinfo{year}{2017}).
 
 \bibitem{HeusEtal17b}
-\bibinfo{author}{Heusser, A.~C.}\DIFadd{, }\bibinfo{author}{Fitzpatrick, P.~C.}\DIFadd{,
-  }\bibinfo{author}{Field, C.~E.}\DIFadd{, }\bibinfo{author}{Ziman, K.} \DIFadd{\&
-  }\bibinfo{author}{Manning, J.~R.}
+\bibinfo{author}{Heusser, A.~C.}, \bibinfo{author}{Fitzpatrick, P.~C.},
+  \bibinfo{author}{Field, C.~E.}, \bibinfo{author}{Ziman, K.} \&
+  \bibinfo{author}{Manning, J.~R.}
 \newblock \bibinfo{title}{Quail: a {Python} toolbox for analyzing and plotting
-  free recall data}\DIFadd{.
-}\newblock \emph{\bibinfo{journal}{The Journal of Open Source Software}}
-  \textbf{\bibinfo{volume}{10.21105/joss.00424}} \DIFadd{(}\bibinfo{year}{2017}\DIFadd{).
-}
+  free recall data}.
+\newblock \emph{\bibinfo{journal}{The Journal of Open Source Software}}
+  \textbf{\bibinfo{volume}{10.21105/joss.00424}} (\bibinfo{year}{2017}).
 
 \bibitem{Fish25}
 \bibinfo{author}{Fisher, R.~A.}
 \newblock \emph{\bibinfo{title}{Statistical Methods for Research Workers}}
-  \DIFadd{(}\bibinfo{publisher}{Oliver and Boyd}\DIFadd{, }\bibinfo{year}{1925}\DIFadd{).
-}
+  (\bibinfo{publisher}{Oliver and Boyd}, \bibinfo{year}{1925}).
 
 \bibitem{BernClif94}
-\bibinfo{author}{Berndt, D.~J.} \DIFadd{\& }\bibinfo{author}{Clifford, J.}
+\bibinfo{author}{Berndt, D.~J.} \& \bibinfo{author}{Clifford, J.}
 \newblock \bibinfo{title}{Using dynamic time warping to find patterns in time
-  series}\DIFadd{.
-}\newblock \DIFadd{In }\emph{\bibinfo{booktitle}{{KDD workshop}}}\DIFadd{,
-  vol.~}\bibinfo{volume}{10}\DIFadd{, }\bibinfo{pages}{359--370} \DIFadd{(}\bibinfo{year}{1994}\DIFadd{).
-}
+  series}.
+\newblock In \emph{\bibinfo{booktitle}{{KDD workshop}}},
+  vol.~\bibinfo{volume}{10}, \bibinfo{pages}{359--370} (\bibinfo{year}{1994}).
 
 \bibitem{FreeEtal01}
-\bibinfo{author}{Freedman, D.}\DIFadd{, }\bibinfo{author}{Riesenhuber, M.}\DIFadd{,
-  }\bibinfo{author}{Poggio, T.} \DIFadd{\& }\bibinfo{author}{Miller, E.}
+\bibinfo{author}{Freedman, D.}, \bibinfo{author}{Riesenhuber, M.},
+  \bibinfo{author}{Poggio, T.} \& \bibinfo{author}{Miller, E.}
 \newblock \bibinfo{title}{Categorical representation of visual stimuli in the
-  primate prefrontal cortex}\DIFadd{.
-}\newblock \emph{\bibinfo{journal}{Science}} \textbf{\bibinfo{volume}{291}}\DIFadd{,
-  }\bibinfo{pages}{312--316} \DIFadd{(}\bibinfo{year}{2001}\DIFadd{).
-}
+  primate prefrontal cortex}.
+\newblock \emph{\bibinfo{journal}{Science}} \textbf{\bibinfo{volume}{291}},
+  \bibinfo{pages}{312--316} (\bibinfo{year}{2001}).
 
 \bibitem{SigmDeha08}
-\bibinfo{author}{Sigman, M.} \DIFadd{\& }\bibinfo{author}{Dehaene, S.}
+\bibinfo{author}{Sigman, M.} \& \bibinfo{author}{Dehaene, S.}
 \newblock \bibinfo{title}{Brain mechanisms of serial and parallel processing
-  during dual-task performance}\DIFadd{.
-}\newblock \emph{\bibinfo{journal}{Journal of Neuroscience}}
-  \textbf{\bibinfo{volume}{28}}\DIFadd{, }\bibinfo{pages}{7585--7589}
-  \DIFadd{(}\bibinfo{year}{2008}\DIFadd{).
-}
+  during dual-task performance}.
+\newblock \emph{\bibinfo{journal}{Journal of Neuroscience}}
+  \textbf{\bibinfo{volume}{28}}, \bibinfo{pages}{7585--7589}
+  (\bibinfo{year}{2008}).
 
 \bibitem{CharKoec10}
-\bibinfo{author}{Charron, S.} \DIFadd{\& }\bibinfo{author}{Koechlin, E.}
+\bibinfo{author}{Charron, S.} \& \bibinfo{author}{Koechlin, E.}
 \newblock \bibinfo{title}{Divided representations of current goals in the human
-  frontal lobes}\DIFadd{.
-}\newblock \emph{\bibinfo{journal}{Science}} \textbf{\bibinfo{volume}{328}}\DIFadd{,
-  }\bibinfo{pages}{360--363} \DIFadd{(}\bibinfo{year}{2010}\DIFadd{).
-}
+  frontal lobes}.
+\newblock \emph{\bibinfo{journal}{Science}} \textbf{\bibinfo{volume}{328}},
+  \bibinfo{pages}{360--363} (\bibinfo{year}{2010}).
 
 \bibitem{RishEtal13}
-\bibinfo{author}{Rishel, C.~A.}\DIFadd{, }\bibinfo{author}{Huang, G.} \DIFadd{\&
-  }\bibinfo{author}{Freedman, D.~J.}
+\bibinfo{author}{Rishel, C.~A.}, \bibinfo{author}{Huang, G.} \&
+  \bibinfo{author}{Freedman, D.~J.}
 \newblock \bibinfo{title}{Independent category and spatial encoding in parietal
-  cortex}\DIFadd{.
-}\newblock \emph{\bibinfo{journal}{Neuron}} \textbf{\bibinfo{volume}{77}}\DIFadd{,
-  }\bibinfo{pages}{969--979} \DIFadd{(}\bibinfo{year}{2013}\DIFadd{).
-}
+  cortex}.
+\newblock \emph{\bibinfo{journal}{Neuron}} \textbf{\bibinfo{volume}{77}},
+  \bibinfo{pages}{969--979} (\bibinfo{year}{2013}).
 
 \bibitem{KrieEtal08b}
-\bibinfo{author}{Kriegeskorte, N.}\DIFadd{, }\bibinfo{author}{Mur, M.} \DIFadd{\&
-  }\bibinfo{author}{Bandettini, P.}
+\bibinfo{author}{Kriegeskorte, N.}, \bibinfo{author}{Mur, M.} \&
+  \bibinfo{author}{Bandettini, P.}
 \newblock \bibinfo{title}{Representational similarity analysis -- connecting
-  the branches of systems neuroscience}\DIFadd{.
-}\newblock \emph{\bibinfo{journal}{Frontiers in Systems Neuroscience}}
-  \textbf{\bibinfo{volume}{2}}\DIFadd{, }\bibinfo{pages}{1 -- 28}
-  \DIFadd{(}\bibinfo{year}{2008}\DIFadd{).
-}
+  the branches of systems neuroscience}.
+\newblock \emph{\bibinfo{journal}{Frontiers in Systems Neuroscience}}
+  \textbf{\bibinfo{volume}{2}}, \bibinfo{pages}{1 -- 28}
+  (\bibinfo{year}{2008}).
 
 \end{thebibliography}
 
 
 
-\DIFaddend \section*{Acknowledgements}
-We thank Luke Chang, Janice Chen, Chris Honey, \DIFaddbegin \DIFadd{Caroline Lee, }\DIFaddend Lucy Owen, Emily Whitaker, \DIFaddbegin \DIFadd{Xinming Xu, }\DIFaddend and Kirsten Ziman for feedback and scientific discussions. We also thank Janice Chen, Yuan Chang Leong, \DIFaddbegin \DIFadd{Chris Honey, Chung Yong, }\DIFaddend Kenneth Norman, and Uri Hasson for sharing the data used in our study.  Our work was supported in part by NSF EPSCoR Award Number 1632738. The content is solely the responsibility of the authors and does not necessarily represent the official views of our supporting organizations.  \DIFaddbegin \DIFadd{The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript.
-}\DIFaddend 
+\section*{Acknowledgements}
+We thank Luke Chang, Janice Chen, Chris Honey, Caroline Lee, Lucy Owen, Emily Whitaker, Xinming Xu, and Kirsten Ziman for feedback and scientific discussions. We also thank Janice Chen, Yuan Chang Leong, Chris Honey, Chung Yong, Kenneth Norman, and Uri Hasson for sharing the data used in our study.  Our work was supported in part by NSF EPSCoR Award Number 1632738. The content is solely the responsibility of the authors and does not necessarily represent the official views of our supporting organizations.  The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript.
 
 \section*{Author contributions}
 Conceptualization: A.C.H. and J.R.M.; Methodology: A.C.H., P.C.F. and J.R.M.; Software: A.C.H., P.C.F. and J.R.M.; Analysis: A.C.H., P.C.F. and J.R.M.; Writing, Reviewing, and Editing: A.C.H., P.C.F. and J.R.M.; Supervision: J.R.M.
 
-\section*{\DIFdelbegin \DIFdel{Author information}\DIFdelend \DIFaddbegin \DIFadd{Competing interests}\DIFaddend }
-The authors declare no competing \DIFdelbegin \DIFdel{financial interests.  Correspondence and requests for materials should be addressed to J.R.M. (jeremy.r.manning@dartmouth.edu)}\DIFdelend \DIFaddbegin \DIFadd{interests}\DIFaddend .
-
+\section*{Competing interests}
+The authors declare no competing interests.
 
 \end{document}
diff --git a/paper/figs/eventseg.pdf b/paper/figs/eventseg.pdf
index 93a2ef1..d4b5a25 100755
Binary files a/paper/figs/eventseg.pdf and b/paper/figs/eventseg.pdf differ
diff --git a/paper/figs/precision_distinctiveness.pdf b/paper/figs/precision_distinctiveness.pdf
index b12b6f4..7055aa0 100644
Binary files a/paper/figs/precision_distinctiveness.pdf and b/paper/figs/precision_distinctiveness.pdf differ
diff --git a/paper/figs/searchlights.pdf b/paper/figs/searchlights.pdf
index e427abd..9683440 100644
Binary files a/paper/figs/searchlights.pdf and b/paper/figs/searchlights.pdf differ
diff --git a/paper/main.pdf b/paper/main.pdf
index 2c01360..d55c9a8 100644
Binary files a/paper/main.pdf and b/paper/main.pdf differ
diff --git a/paper/main.tex b/paper/main.tex
index 556f3cd..07ee0d3 100644
--- a/paper/main.tex
+++ b/paper/main.tex
@@ -11,7 +11,7 @@
 \usepackage{hyperref}
 \usepackage{lineno}
 \usepackage{placeins}
-\usepackage[nofiglist, fighead]{endfloat}
+\usepackage[nofiglist, nomarkers, fighead]{endfloat}
 \renewcommand{\includegraphics}[2][]{}
 
 \newcommand{\argmax}{\mathop{\mathrm{argmax}}\limits}
@@ -19,10 +19,12 @@
 \newcommand{\topicopt}{1}
 \newcommand{\topics}{2}
 \newcommand{\featureimportance}{3}
-\newcommand{\corrmats}{5}
-\newcommand{\matchmats}{6}
-\newcommand{\kopt}{7}
-\newcommand{\arrows}{4}
+
+\newcommand{\arrows}{1}
+\newcommand{\corrmats}{2}
+\newcommand{\matchmats}{3}
+\newcommand{\kopt}{4}
+
 
 \doublespacing
 \linenumbers
@@ -31,6 +33,8 @@
 
 \author{Andrew C. Heusser\textsuperscript{1, 2, \textdagger}, Paxton C. Fitzpatrick\textsuperscript{1, \textdagger}, and Jeremy R. Manning\textsuperscript{1, *}\\\textsuperscript{1}Department of Psychological and Brain Sciences\\Dartmouth College, Hanover, NH 03755, USA\\\textsuperscript{2}Akili Interactive Labs\\Boston, MA 02110, USA\\\textsuperscript{\textdagger}Denotes equal contribution\\\textsuperscript{*}Corresponding author: Jeremy.R.Manning@Dartmouth.edu}
 
+\date{}
+
 \begin{document}
 \begin{titlepage}
   \maketitle
@@ -38,7 +42,8 @@
   \end{titlepage}
 
 \begin{abstract}
-The mental contexts in which we interpret experiences are often person-specific, even when the experiences themselves are shared. We developed a geometric framework for mathematically characterizing the subjective conceptual content of dynamic naturalistic experiences. We model experiences and memories as ``trajectories'' through word embedding spaces whose coordinates reflect the universe of thoughts under consideration. Memory encoding can then be modeled as geometrically preserving or distorting the ``shape'' of the original experience. We applied our approach to data collected as participants watched and verbally recounted a television episode while undergoing functional neuroimaging. Participants’ recountings all preserved coarse spatial properties (essential narrative elements), but not fine spatial scale (low-level) details, of the episode’s trajectory. We also identified networks of brain structures sensitive to these trajectory shapes. Our work provides insights into how we preserve and distort our ongoing experiences when we encode them into episodic memories.
+How do we preserve and distort our ongoing experiences when encoding them into episodic memories?
+The mental contexts in which we interpret experiences are often person-specific, even when the experiences themselves are shared. We developed a geometric framework for mathematically characterizing the subjective conceptual content of dynamic naturalistic experiences. We model experiences and memories as ``trajectories'' through word embedding spaces whose coordinates reflect the universe of thoughts under consideration. Memory encoding can then be modeled as geometrically preserving or distorting the ``shape'' of the original experience. We applied our approach to data collected as participants watched and verbally recounted a television episode while undergoing functional neuroimaging. Participants’ recountings all preserved coarse spatial properties (essential narrative elements), but not fine spatial scale (low-level) details, of the episode’s trajectory. We also identified networks of brain structures sensitive to these trajectory shapes.
 \end{abstract}
 
 
@@ -71,17 +76,17 @@ \section*{Results}
 \begin{figure}[tp]
 \centering
 \includegraphics[width=\textwidth]{figs/eventseg}
-\caption{\small \textbf{Modeling naturalistic stimuli and recalls.} All panels: darker colors indicate greater values; range: [0, 1].  \textbf{A.} Topic vectors ($K = 100$) for each of the 1976 episode timepoints.  \textbf{B.} Timepoint-by-timepoint correlation matrix of the topic vectors displayed in Panel A.  Event boundaries discovered by the HMM are denoted in yellow (30 events detected).  \textbf{C.} Average topic vectors for each of the 30 episode events. \textbf{D.} Topic vectors for each of 265 sliding windows of sentences spoken by an example participant while recalling the episode.  \textbf{E.} Timepoint-by-timepoint correlation matrix of the topic vectors displayed in Panel D. Event boundaries detected by the HMM are denoted in yellow (22 events detected).  For similar plots for all participants, see Supplementary Figure~\corrmats.  \textbf{F.} Average topic vectors for each of the 22 recall events from the example participant.  \textbf{G.} Correlations between the topic vectors for every pair of episode events (Panel C) and recall events (from the example participant; Panel F).  For similar plots for all participants, see Supplementary Figure~\matchmats.  \textbf{H.} Average correlations between each pair of episode events and recall events (across all 17 participants).  To create the figure, each recalled event was assigned to the episode event with the most correlated topic vector (yellow boxes in panels G and H).}
+\caption{\small \textbf{Modeling naturalistic stimuli and recalls.} All panels: darker colors indicate greater values; range: [0, 1].  \textbf{A.} Topic vectors ($K = 100$) for each of the 1976 episode timepoints.  \textbf{B.} Timepoint-by-timepoint correlation matrix of the topic vectors displayed in Panel A.  Event boundaries discovered by the HMM are denoted in yellow (30 events detected).  \textbf{C.} Average topic vectors for each of the 30 episode events. \textbf{D.} Topic vectors for each of 265 sliding windows of sentences spoken by an example participant while recalling the episode.  \textbf{E.} Timepoint-by-timepoint correlation matrix of the topic vectors displayed in Panel D. Event boundaries detected by the HMM are denoted in yellow (22 events detected).  For similar plots for all participants, see Extended Data Figure~\corrmats.  \textbf{F.} Average topic vectors for each of the 22 recall events from the example participant.  \textbf{G.} Correlations between the topic vectors for every pair of episode events (Panel C) and recall events (from the example participant; Panel F).  For similar plots for all participants, see Extended Data Figure~\matchmats.  \textbf{H.} Average correlations between each pair of episode events and recall events (across all 17 participants).  To create the figure, each recalled event was assigned to the episode event with the most correlated topic vector (yellow boxes in panels G and H).}
 \label{fig:model}
 \end{figure}
 
 Given that the time-varying content of the episode could be segmented cleanly into discrete events, we wondered whether participants' recalls of the episode also displayed a similar structure.  We applied the same topic model (already trained on the episode annotations) to each participant's recalls.  Analogously to how we parsed the time-varying content of the episode, to obtain similar estimates for each participant's recall transcript, we treated each overlapping  window of (up to) 10 sentences from their transcript as a document, and computed the most probable mix of topics reflected in each timepoint's sentences.  This yielded, for each participant, a number-of-windows by number-of-topics topic proportions matrix that characterized how the topics identified in the original episode were reflected in the participant's recalls.  An important feature of our approach is that it allows us to compare participants' recalls to events from the original episode, despite that different participants used widely varying language to describe the events, and that those descriptions often diverged in content, quality, and quantity from the episode annotations.  This ability to match up conceptually related text that differs in specific vocabulary, detail, and length is an important benefit of projecting the episode and recalls into a shared topic space.  An example topic proportions matrix from one participant's recalls is shown in Figure~\ref{fig:model}D.
 
-Although the example participant's recall topic proportions matrix has some visual similarity to the episode topic proportions matrix, the time-varying topic proportions for the example participant's recalls are not as sparse as those for the episode (compare Figs.~\ref{fig:model}A and D).  Similarly, although there do appear to be periods of stability in the recall topic dynamics (i.e., most topics are active or inactive over contiguous blocks of time), the changes in topic activations that define event boundaries appear less clearly delineated in participants' recalls than in the episode's annotations.  To examine these patterns in detail, we computed the timepoint-by-timepoint correlation matrix for the example participant's recall topic proportions matrix (Fig.~\ref{fig:model}E).  As in the episode correlation matrix (Fig.~\ref{fig:model}B), the example participant's recall correlation matrix has a strong block diagonal structure, indicating that their recalls are discretized into separated events.  We used the same HMM-based optimization procedure that we had applied to the episode's topic proportions matrix (see \textit{Methods}) to estimate an analogous set of event boundaries in the participant's recounting of the episode (outlined in yellow).  We carried out this analysis on all 17 participants' recall topic proportions matrices (Supp.\ Fig.~\corrmats).
+Although the example participant's recall topic proportions matrix has some visual similarity to the episode topic proportions matrix, the time-varying topic proportions for the example participant's recalls are not as sparse as those for the episode (compare Figs.~\ref{fig:model}A and D).  Similarly, although there do appear to be periods of stability in the recall topic dynamics (i.e., most topics are active or inactive over contiguous blocks of time), the changes in topic activations that define event boundaries appear less clearly delineated in participants' recalls than in the episode's annotations.  To examine these patterns in detail, we computed the timepoint-by-timepoint correlation matrix for the example participant's recall topic proportions matrix (Fig.~\ref{fig:model}E).  As in the episode correlation matrix (Fig.~\ref{fig:model}B), the example participant's recall correlation matrix has a strong block diagonal structure, indicating that their recalls are discretized into separated events.  We used the same HMM-based optimization procedure that we had applied to the episode's topic proportions matrix (see \textit{Methods}) to estimate an analogous set of event boundaries in the participant's recounting of the episode (outlined in yellow).  We carried out this analysis on all 17 participants' recall topic proportions matrices (Extended Data Fig.~\corrmats).
 
-Two clear patterns emerged from this set of analyses.  First, although every individual participant's recalls could be segmented into discrete events (i.e., every individual participant's recall correlation matrix exhibited clear block diagonal structure; Supp.\ Fig.~\corrmats), each participant appeared to have a unique ``recall resolution,'' reflected in the sizes of those blocks.  While some participants' recall topic proportions segmented into just a few events (e.g., Participants P4, P5, and P7), others' segmented into many shorter-duration events (e.g., Participants P12, P13, and P17).  This suggests that different participants may be recalling the episode with different levels of detail---i.e., some might recount only high-level essential plot details, whereas others might recount low-level details instead (or in addition).  The second clear pattern present in every individual participant's recall correlation matrix was that, unlike in the episode correlation matrix, there were substantial off-diagonal correlations.  One potential explanation for this finding is that the topic models, trained only on episode annotations, do not capture topic proportions in participants' ``held-out'' recalls as accurately.  A second possibility is that, whereas each event in the original episode was (largely) separable from the others (Fig.~\ref{fig:model}B), in transforming those separable events into memory, participants appeared to be integrating across multiple events, blending elements of previously recalled and not-yet-recalled content into each newly recalled event (Fig.~\ref{fig:model}E, Supp.\ Fig.~\corrmats)~\citep{MannEtal11, HowaEtal12, Mann19}.
+Two clear patterns emerged from this set of analyses.  First, although every individual participant's recalls could be segmented into discrete events (i.e., every individual participant's recall correlation matrix exhibited clear block diagonal structure; Extended Data Fig.~\corrmats), each participant appeared to have a unique ``recall resolution,'' reflected in the sizes of those blocks.  While some participants' recall topic proportions segmented into just a few events (e.g., Participants P4, P5, and P7), others' segmented into many shorter-duration events (e.g., Participants P12, P13, and P17).  This suggests that different participants may be recalling the episode with different levels of detail---i.e., some might recount only high-level essential plot details, whereas others might recount low-level details instead (or in addition).  The second clear pattern present in every individual participant's recall correlation matrix was that, unlike in the episode correlation matrix, there were substantial off-diagonal correlations.  One potential explanation for this finding is that the topic models, trained only on episode annotations, do not capture topic proportions in participants' ``held-out'' recalls as accurately.  A second possibility is that, whereas each event in the original episode was (largely) separable from the others (Fig.~\ref{fig:model}B), in transforming those separable events into memory, participants appeared to be integrating across multiple events, blending elements of previously recalled and not-yet-recalled content into each newly recalled event (Fig.~\ref{fig:model}E, Extended Data Fig.~\corrmats)~\citep{MannEtal11, HowaEtal12, Mann19}.
 
-The above results demonstrate that topic models capture the dynamic conceptual content of the episode and participants' recalls of the episode.  Further, the episode and recalls exhibit event boundaries that can be identified automatically using HMMs to segment the dynamic content.  Next, we asked whether some correspondence might be made between the specific content of the events the participants experienced while viewing the episode, and the events they later recalled.  We labeled each recall event as matching the episode event with the most similar (i.e., most highly correlated) topic vector (Fig.~\ref{fig:model}G, Supp.\ Fig.~\matchmats).  This yielded a sequence of ``presented'' events from the original episode, and a (potentially differently ordered) sequence of ``recalled'' events for each participant.  Analogous to classic list-learning studies, we can then examine participants' recall sequences by asking which events they tended to recall first (probability of first recall~\citep{AtkiShif68, PostPhil65, WelcBurn24}; Fig.~\ref{fig:list-learning}A); how participants most often transitioned between recalls of the events as a function of the temporal distance between them (lag-conditional response probability~\citep{Kaha96}; Fig.~\ref{fig:list-learning}B); and which events they were likely to remember overall (serial position recall analyses~\citep{Murd62a}; Fig.~\ref{fig:list-learning}C). Some of the patterns we observed appeared to be similar to classic effects from the list-learning literature.  For example, participants had a higher probability of initiating recall with early events (Fig.~\ref{fig:list-learning}A) and a higher probability of transitioning to neighboring events with an asymmetric forward bias (Fig.~\ref{fig:list-learning}B). However, unlike what is typically observed in list-learning studies, we did not observe patterns comparable to the primacy or recency serial position effects (Fig.~\ref{fig:list-learning}C).  We hypothesized that participants might be leveraging meaningful narrative associations and references over long timescales throughout the episode.
+The above results demonstrate that topic models capture the dynamic conceptual content of the episode and participants' recalls of the episode.  Further, the episode and recalls exhibit event boundaries that can be identified automatically using HMMs to segment the dynamic content.  Next, we asked whether some correspondence might be made between the specific content of the events the participants experienced while viewing the episode, and the events they later recalled.  We labeled each recall event as matching the episode event with the most similar (i.e., most highly correlated) topic vector (Fig.~\ref{fig:model}G, Extended Data Fig.~\matchmats).  This yielded a sequence of ``presented'' events from the original episode, and a (potentially differently ordered) sequence of ``recalled'' events for each participant.  Analogous to classic list-learning studies, we can then examine participants' recall sequences by asking which events they tended to recall first (probability of first recall~\citep{AtkiShif68, PostPhil65, WelcBurn24}; Fig.~\ref{fig:list-learning}A); how participants most often transitioned between recalls of the events as a function of the temporal distance between them (lag-conditional response probability~\citep{Kaha96}; Fig.~\ref{fig:list-learning}B); and which events they were likely to remember overall (serial position recall analyses~\citep{Murd62a}; Fig.~\ref{fig:list-learning}C). Some of the patterns we observed appeared to be similar to classic effects from the list-learning literature.  For example, participants had a higher probability of initiating recall with early events (Fig.~\ref{fig:list-learning}A) and a higher probability of transitioning to neighboring events with an asymmetric forward bias (Fig.~\ref{fig:list-learning}B). However, unlike what is typically observed in list-learning studies, we did not observe patterns comparable to the primacy or recency serial position effects (Fig.~\ref{fig:list-learning}C).  We hypothesized that participants might be leveraging meaningful narrative associations and references over long timescales throughout the episode.
 
 \begin{figure}[tp]
   \centering
@@ -97,7 +102,7 @@ \section*{Results}
 \begin{figure}[tp]
   \centering
   \includegraphics[width=1\textwidth]{figs/precision_distinctiveness}
-  \caption{\small \textbf{Novel content-based metrics of naturalistic memory: precision and distinctiveness.} \textbf{A.} The episode-recall correlation matrix for an example participant (P17), chosen for their large number of recall events (for analogous figures for other participants, see Supp.\ Fig.~\corrmats).  The yellow boxes highlight the maximum correlation in each column.  The example participant's overall precision score was computed as the average across the (Fisher $z$-transformed) correlation values in the yellow boxes.  Their distinctiveness score was computed as the average (over recall events) of the $z$-scored (within column) event precisions. \textbf{B.} The across-participants (Pearson's) correlation between precision and hand-counted number of recalled scenes.  \textbf{C.} The correlation between distinctiveness and hand-counted number of recalled scenes. \textbf{D.} The correlation between precision and the number of recalled episode events, as determined by our model. \textbf{E.} The correlation between distinctiveness and the number of recalled episode events, as determined by our model.}
+  \caption{\small \textbf{Novel content-based metrics of naturalistic memory: precision and distinctiveness.} \textbf{A.} The episode-recall correlation matrix for an example participant (P17), chosen for their large number of recall events (for analogous figures for other participants, see Extended Data Fig.~\corrmats).  The yellow boxes highlight the maximum correlation in each column.  The example participant's overall precision score was computed as the average across the (Fisher $z$-transformed) correlation values in the yellow boxes.  Their distinctiveness score was computed as the average (over recall events) of the $z$-scored (within column) event precisions. \textbf{B.} The across-participants (Pearson's) correlation between precision and hand-counted number of recalled scenes.  \textbf{C.} The correlation between distinctiveness and hand-counted number of recalled scenes. \textbf{D.} The correlation between precision and the number of recalled episode events, as determined by our model. \textbf{E.} The correlation between distinctiveness and the number of recalled episode events, as determined by our model.}
   \label{fig:precision-distinctiveness}
 \end{figure}
 
@@ -107,7 +112,7 @@ \section*{Results}
 
 Distinctiveness is intended to capture the ``specificity'' of recall.  In other words, distinctiveness quantifies the extent to which a given recall event reflects the most similar episode event over and above other episode events.  Intuitively, distinctiveness is like a normalized variant of our precision metric.  Whereas precision solely measures how much detail about an event was captured in someone's recall, distinctiveness penalizes details that also pertain to other episode events.  We define the distinctiveness of an event's recall as its precision expressed in standard deviation units with respect to other episode events.  Specifically, for a given recall event, we compute the correlation between its topic vector and that of each episode event.  This yields a distribution of correlation coefficients (one per episode event).  We subtract the mean and divide by the standard deviation of this distribution to $z$-score the coefficients.  The maximum value in this distribution (which, by definition, belongs to the episode event that best matches the given recall event) is that recall event's distinctiveness score.  In this way, recall events that match one episode event far better than all other episode events will receive a high distinctiveness score.  By contrast, a recall event that matches all episode events roughly equally will receive a comparatively low distinctiveness score.
 
-In addition to examining how precisely and distinctively participants recalled individual events, one may also use these metrics to summarize each participant's performance by averaging across a participant's event-wise precision or distinctiveness scores.  This enables us to quantify how precisely a participant tended to recall subtle within-event details, as well as how specific (distinctive) those details were to individual events from the episode.  Participants' average precision and distinctiveness scores were strongly correlated ($r(15) = 0.90,~p < 0.001,~95\%~\mathrm{CI} = [0.66, 0.96]$).  This indicates that participants who tended to precisely recount low-level details of episode events also tended to do so in an event-specific way (e.g., as opposed to detailing recurring themes that were present in most or all episode events; this behavior would have resulted in high precision but low distinctiveness).  We found that, across participants, higher precision scores were positively correlated with the numbers of both hand-annotated scenes ($r(15) = 0.60,~p = 0.010,~95\%~\mathrm{CI} = [0.02, 0.83]$) and model-estimated events ($r(15) = 0.90,~p < 0.001,~95\%~\mathrm{CI} = [0.54, 0.96]$) that participants recalled.  Participants' average distinctiveness scores were also marginally correlated with both the hand-annotated ($r(15) = 0.45,~p = 0.068,~95\%~\mathrm{CI} = [-0.21, 0.79]$) and model-estimated ($r(15) = 0.71,~p = 0.001,~95\%~\mathrm{CI} = [-0.07, 0.90]$) numbers of recalled events.
+In addition to examining how precisely and distinctively participants recalled individual events, one may also use these metrics to summarize each participant's performance by averaging across a participant's event-wise precision or distinctiveness scores.  This enables us to quantify how precisely a participant tended to recall subtle within-event details, as well as how specific (distinctive) those details were to individual events from the episode.  Participants' average precision and distinctiveness scores were strongly correlated ($r(15) = 0.90,~p < 0.001,~95\%~\mathrm{CI} = [0.66, 0.96]$).  This indicates that participants who tended to precisely recount low-level details of episode events also tended to do so in an event-specific way (e.g., as opposed to detailing recurring themes that were present in most or all episode events; this behavior would have resulted in high precision but low distinctiveness).  We found that, across participants, higher precision scores were positively correlated with the numbers of both model-estimated events ($r(15) = 0.90,~p < 0.001,~95\%~\mathrm{CI} = [0.54, 0.96]$) and hand-annotated scenes ($r(15) = 0.60,~p = 0.010,~95\%~\mathrm{CI} = [0.02, 0.83]$) that participants recalled.  Participants' average distinctiveness scores were also correlated with their numbers of model-estimated recalled events ($r(15) = 0.71,~p = 0.001,~95\%~\mathrm{CI} = [-0.07, 0.90]$) and marginally significantly correlated with their numbers of hand-annotated ($r(15) = 0.45,~p = 0.068,~95\%~\mathrm{CI} = [-0.21, 0.79]$).
 
 
 \begin{figure}[tp]
@@ -128,11 +133,11 @@ \section*{Results}
 \begin{figure}[tp]
 \centering
 \includegraphics[width=1\textwidth]{figs/trajectory}
-\caption{\small \textbf{Trajectories through topic space capture the dynamic content of the episode and recalls.}  All panels: the topic proportion matrices have been projected onto a shared two-dimensional space using UMAP.  \textbf{A.} The two-dimensional topic trajectory taken by the episode of \textit{Sherlock}.  Each dot indicates an event identified using the HMM (see \textit{Methods}); the dot colors denote the order of the events (early events are in red; later events are in blue), and the connecting lines indicate the transitions between successive events.  \textbf{B.} The average two-dimensional trajectory captured by participants' recall sequences, with the same format and coloring as the trajectory in Panel A.  To compute the event positions, we matched each recalled event with an event from the original episode (see \textit{Results}), and then we averaged the positions of all events with the same label.  The arrows reflect the average transition direction through topic space taken by any participants whose trajectories crossed that part of topic space; blue denotes reliable agreement across participants via a Rayleigh test ($p < 0.05$, corrected).  For additional detail see \textit{Methods} and Supplementary Figure~\arrows.  \textbf{C.} The recall topic trajectories (gray) taken by each individual participant (P1--P17).  The episode's trajectory is shown in black for reference.  Here, events (dots) are colored by their matched episode event (Panel A).}
+\caption{\small \textbf{Trajectories through topic space capture the dynamic content of the episode and recalls.}  All panels: the topic proportion matrices have been projected onto a shared two-dimensional space using UMAP.  \textbf{A.} The two-dimensional topic trajectory taken by the episode of \textit{Sherlock}.  Each dot indicates an event identified using the HMM (see \textit{Methods}); the dot colors denote the order of the events (early events are in red; later events are in blue), and the connecting lines indicate the transitions between successive events.  \textbf{B.} The average two-dimensional trajectory captured by participants' recall sequences, with the same format and coloring as the trajectory in Panel A.  To compute the event positions, we matched each recalled event with an event from the original episode (see \textit{Results}), and then we averaged the positions of all events with the same label.  The arrows reflect the average transition direction through topic space taken by any participants whose trajectories crossed that part of topic space; blue denotes reliable agreement across participants via a Rayleigh test ($p < 0.05$, corrected).  For additional detail see \textit{Methods} and Extended Data Figure~\arrows.  \textbf{C.} The recall topic trajectories (gray) taken by each individual participant (P1--P17).  The episode's trajectory is shown in black for reference.  Here, events (dots) are colored by their matched episode event (Panel A).}
 \label{fig:trajectory}
 \end{figure}
 
-Visual inspection of the episode and recall topic trajectories reveals a striking pattern.  First, the topic trajectory of the episode (which reflects its dynamic content; Fig.~\ref{fig:trajectory}A) is captured nearly perfectly by the averaged topic trajectories of participants' recalls (Fig.~\ref{fig:trajectory}B).  To assess the consistency of these recall trajectories across participants, we asked: given that a participant's recall trajectory had entered a particular location in the reduced topic space, could the position of their next recalled event be predicted reliably?  For each location in the reduced topic space, we computed the set of line segments connecting successively recalled events (across all participants) that intersected that location (see \textit{Methods}, Supp.\ Fig.~\arrows).  We then computed (for each location) the distribution of angles formed by the lines defined by those line segments and a fixed reference line (the $x$-axis).  Rayleigh tests revealed the set of locations in topic space at which these across-participant distributions exhibited reliable peaks (blue arrows in Fig.~\ref{fig:trajectory}B reflect significant peaks at $p < 0.05$, corrected).  We observed that the locations traversed by nearly the entire episode trajectory exhibited such peaks.  In other words, participants' recalls exhibited similar trajectories to each other that also matched the trajectory of the original episode (Fig.~\ref{fig:trajectory}C).  This is especially notable when considering the fact that the number of HMM-identified recall events (dots in Fig.~\ref{fig:trajectory}C) varied considerably across people, and that every participant used different words to describe what they had remembered happening in the episode.  Differences in the numbers of recall events appear in participants' trajectories as differences in the sampling resolution along the trajectory.  We note that this framework also provides a means of disentangling classic ``proportion recalled'' measures (i.e., the proportion of episode events described in participants' recalls) from participants' abilities to recapitulate the episode's essence (i.e., the similarity between the shapes of the original episode trajectory and that defined by each participant's recounting of the episode).
+Visual inspection of the episode and recall topic trajectories reveals a striking pattern.  First, the topic trajectory of the episode (which reflects its dynamic content; Fig.~\ref{fig:trajectory}A) is captured nearly perfectly by the averaged topic trajectories of participants' recalls (Fig.~\ref{fig:trajectory}B).  To assess the consistency of these recall trajectories across participants, we asked: given that a participant's recall trajectory had entered a particular location in the reduced topic space, could the position of their next recalled event be predicted reliably?  For each location in the reduced topic space, we computed the set of line segments connecting successively recalled events (across all participants) that intersected that location (see \textit{Methods}, Extended Data Fig.~\arrows).  We then computed (for each location) the distribution of angles formed by the lines defined by those line segments and a fixed reference line (the $x$-axis).  Rayleigh tests revealed the set of locations in topic space at which these across-participant distributions exhibited reliable peaks (blue arrows in Fig.~\ref{fig:trajectory}B reflect significant peaks at $p < 0.05$, corrected).  We observed that the locations traversed by nearly the entire episode trajectory exhibited such peaks.  In other words, participants' recalls exhibited similar trajectories to each other that also matched the trajectory of the original episode (Fig.~\ref{fig:trajectory}C).  This is especially notable when considering the fact that the number of HMM-identified recall events (dots in Fig.~\ref{fig:trajectory}C) varied considerably across people, and that every participant used different words to describe what they had remembered happening in the episode.  Differences in the numbers of recall events appear in participants' trajectories as differences in the sampling resolution along the trajectory.  We note that this framework also provides a means of disentangling classic ``proportion recalled'' measures (i.e., the proportion of episode events described in participants' recalls) from participants' abilities to recapitulate the episode's essence (i.e., the similarity between the shapes of the original episode trajectory and that defined by each participant's recounting of the episode).
 
 In addition to enabling us to visualize the episode's high-level essence, describing the episode as a geometric trajectory also enables us to drill down to individual words and quantify how each word relates to the memorability of each event.  This provides another approach to examining participants' recall for low-level details beyond the precision and distinctiveness measures we defined above.  The results displayed in Figures \ref{fig:list-learning}C and \ref{fig:precision-detail}A suggest that certain events were remembered better than others.  Given this, we next asked whether the events that were generally remembered precisely or imprecisely tended to reflect particular content.  Because our analysis framework projects the dynamic episode content and participants' recalls into a shared space, and because the dimensions of that space represent topics (which are, in turn, sets of weights over known words in the vocabulary), we are able to recover the weighted combination of words that make up any point (i.e., topic vector) in this space.  We first computed the average precision with which participants recalled each of the 30 episode events (Fig.~\ref{fig:topics}A; note that this result is analogous to a serial position curve created from our precision metric).  We then computed a weighted average of the topic vectors for each episode event, where the weights reflected how precisely each event was recalled.  To visualize the result, we created a ``wordle'' image~\citep{MuelEtal18} where words weighted more heavily by more precisely remembered topics appear in a larger font (Fig.~\ref{fig:topics}B, green box).  Across the full episode, content that weighted heavily on topics and words central to the major foci of the episode (e.g., the names of the two main characters, ``Sherlock'' and ``John,'' and the address of a major recurring location, ``221B Baker Street'') was best remembered.  An analogous analysis revealed which themes were less-precisely remembered.  Here, in computing the weighted average over events' topic vectors, we weighted each event in inverse proportion to its average precision (Fig.~\ref{fig:topics}B, red box).  The least precisely remembered episode content reflected information that was extraneous to the episode's essence, such as the proper names of relatively minor characters (e.g., ``Mike,'' ``Molly,'' and ``Lestrade'') and locations (e.g., ``St. Bartholomew's Hospital'').
 
@@ -204,7 +209,7 @@ \subsubsection*{Segmenting topic proportions matrices into discrete events using
 \[
   \argmax_K \left[W_{1}(a, b)\right],
 \]
-where $a$ was the distribution of within-state topic vector correlations, and $b$ was the distribution of across-state topic vector correlations .  We computed the first Wasserstein distance ($W_{1}$, also known as ``Earth mover's distance''~\citep{Dobr70, RamdEtal17}) between these distributions for a large range of possible $K$-values (range [2, 50]), and selected the $K$ that yielded the maximum value.  Figure~\ref{fig:model}B displays the event boundaries returned for the episode, and Supplementary Figure~\corrmats~displays the event boundaries returned for each participant's recalls.  See Supplementary Figure \kopt~for the optimization functions for the episode and recalls.  After obtaining these event boundaries, we created stable estimates of the content represented in each event by averaging the topic vectors across timepoints between each pair of event boundaries.  This yielded a number-of-events by number-of-topics matrix for the episode and recalls from each participant.
+where $a$ was the distribution of within-state topic vector correlations, and $b$ was the distribution of across-state topic vector correlations .  We computed the first Wasserstein distance ($W_{1}$, also known as ``Earth mover's distance''~\citep{Dobr70, RamdEtal17}) between these distributions for a large range of possible $K$-values (range [2, 50]), and selected the $K$ that yielded the maximum value.  Figure~\ref{fig:model}B displays the event boundaries returned for the episode, and Extended Data Figure~\corrmats~displays the event boundaries returned for each participant's recalls.  See Extended Data Figure~\kopt~for the optimization functions for the episode and recalls.  After obtaining these event boundaries, we created stable estimates of the content represented in each event by averaging the topic vectors across timepoints between each pair of event boundaries.  This yielded a number-of-events by number-of-topics matrix for the episode and recalls from each participant.
 
 \subsubsection*{Naturalistic extensions of classic list-learning analyses}
 In traditional list-learning experiments, participants view a list of items (e.g., words) and then recall the items later.  Our episode-recall event matching approach affords us the ability to analyze memory in a similar way. The episode and recall events can be treated analogously to studied and recalled ``items'' in a list-learning study.  We can then extend classic analyses of memory performance and dynamics (originally designed for list-learning experiments) to the more naturalistic episode recall task used in this study.
@@ -230,12 +235,12 @@ \subsubsection*{Visualizing the episode and recall topic trajectories}
 We optimized the manifold space for visualization based on two criteria: First, that the 2D embedding of the episode trajectory should reflect its original 100-dimensional structure as faithfully as possible. Second, that the path traversed by the embedded episode trajectory should intersect itself a minimal number of times.  The first criteria helps bolster the validity of visual intuitions about relationships between sections of episode content, based on their locations in the embedding space.  The second criteria was motivated by the observed low off-diagonal values in the episode trajectory's temporal correlation matrix (suggesting that the same topic-space coordinates should not be revisited; see Fig.~2A). For further details on how we created this low-dimensional embedding space, see \textit{Supplementary Information}.
 
 \subsubsection*{Estimating the consistency of flow through topic space across participants}
-In Figure~\ref{fig:trajectory}B, we present an analysis aimed at characterizing locations in topic space that different participants move through in a consistent way (via their recall topic trajectories; also see Supp.\ Fig.~\arrows).  The two-dimensional topic space used in our visualizations (Fig.~\ref{fig:trajectory}) comprised a $60 \times 60$ (arbitrary units) square.  We tiled this space with a $50 \times 50$ grid of evenly spaced vertices, and defined a circular area centered on each vertex whose radius was two times the distance between adjacent vertices (i.e., 2.4 units).  For each vertex, we examined the set of line segments formed by connecting each pair successively recalled events, across all participants, that passed through this circle.  We computed the distribution of angles formed by those segments and the $x$-axis, and used a Rayleigh test to determine whether the distribution of angles was reliably ``peaked'' (i.e., consistent across all transitions that passed through that local portion of topic space).  To create Figure~\ref{fig:trajectory}B, we drew an arrow originating from each grid vertex, pointing in the direction of the average angle formed by the line segments that passed within 2.4 units.  We set the arrow lengths to be inversely proportional to the $p$-values of the Rayleigh tests at each vertex.  Specifically, for each vertex we converted all of the angles of segments that passed within 2.4 units to unit vectors, and we set the arrow lengths at each vertex proportional to the length of the (circular) mean vector.  We also indicated any significant results ($p < 0.05$, corrected using the Benjamani-Hochberg procedure) by coloring the arrows in blue (darker blue denotes a lower $p$-value, i.e., a longer mean vector); all tests with $p \geq 0.05$ are displayed in gray and given a lower opacity value.
+In Figure~\ref{fig:trajectory}B, we present an analysis aimed at characterizing locations in topic space that different participants move through in a consistent way (via their recall topic trajectories; also see Extended Data Fig.~\arrows).  The two-dimensional topic space used in our visualizations (Fig.~\ref{fig:trajectory}) comprised a $60 \times 60$ (arbitrary units) square.  We tiled this space with a $50 \times 50$ grid of evenly spaced vertices, and defined a circular area centered on each vertex whose radius was two times the distance between adjacent vertices (i.e., 2.4 units).  For each vertex, we examined the set of line segments formed by connecting each pair successively recalled events, across all participants, that passed through this circle.  We computed the distribution of angles formed by those segments and the $x$-axis, and used a Rayleigh test to determine whether the distribution of angles was reliably ``peaked'' (i.e., consistent across all transitions that passed through that local portion of topic space).  To create Figure~\ref{fig:trajectory}B, we drew an arrow originating from each grid vertex, pointing in the direction of the average angle formed by the line segments that passed within 2.4 units.  We set the arrow lengths to be inversely proportional to the $p$-values of the Rayleigh tests at each vertex.  Specifically, for each vertex we converted all of the angles of segments that passed within 2.4 units to unit vectors, and we set the arrow lengths at each vertex proportional to the length of the (circular) mean vector.  We also indicated any significant results ($p < 0.05$, corrected using the Benjamani-Hochberg procedure) by coloring the arrows in blue (darker blue denotes a lower $p$-value, i.e., a longer mean vector); all tests with $p \geq 0.05$ are displayed in gray and given a lower opacity value.
 
 \subsection*{Searchlight fMRI analyses}
 In Figure~\ref{fig:brainz}, we present two analyses aimed at identifying brain regions whose responses (as participants viewed the episode) exhibited a particular temporal structure.  We developed a searchlight analysis wherein we constructed a $5 \times 5 \times 5$ cube of voxels centered on each voxel in the brain~\citep{ChenEtal17}, and for each of these cubes, computed the temporal correlation matrix of the voxel responses during episode viewing.  Specifically, for each of the 1976 volumes collected during episode viewing, we correlated the activity patterns in the given cube with the activity patterns (in the same cube) collected during every other timepoint.  This yielded a $1976 \times 1976$ correlation matrix for each cube.  Note: participant 5's scan ended 75s early, and in Chen et al. (2017)~\cite{ChenEtal17}'s publicly released dataset, their scan data was zero-padded to match the length of the other participants'.  For our searchlight analyses, we removed this padded data (i.e., the last 50 TRs), resulting in a $1925 \times 1925$ correlation matrix for each cube in participant 5's brain.
 
-Next, we constructed a series of ``template'' matrices.  The first template reflected the timecourse of the episode's topic proportions matrix, and the others reflected the timecourse of each participant's recall topic proportions matrix.  To construct the episode template, we computed the correlations between the topic proportions estimated for every pair of TRs (prior to segmenting the topic proportions matrices into discrete events; i.e., the correlation matrix shown in Figs.~\ref{fig:model}B and \ref{fig:brainz}A).  We constructed similar temporal correlation matrices for each participant's recall topic proportions matrix (Fig.~\ref{fig:model}D, Supp.\ Fig.~\corrmats).  However, to correct for length differences and potential non-linear transformations between viewing time and recall time, we first used dynamic time warping~\citep{BernClif94} to temporally align participants' recall topic proportions matrices with the episode topic proportions matrix.  An example correlation matrix before and after warping is shown in Fig.~\ref{fig:brainz}B.  This yielded a $1976 \times 1976$ correlation matrix for the episode template and for each participant's recall template.
+Next, we constructed a series of ``template'' matrices.  The first template reflected the timecourse of the episode's topic proportions matrix, and the others reflected the timecourse of each participant's recall topic proportions matrix.  To construct the episode template, we computed the correlations between the topic proportions estimated for every pair of TRs (prior to segmenting the topic proportions matrices into discrete events; i.e., the correlation matrix shown in Figs.~\ref{fig:model}B and \ref{fig:brainz}A).  We constructed similar temporal correlation matrices for each participant's recall topic proportions matrix (Fig.~\ref{fig:model}D, Extended Data Fig.~\corrmats).  However, to correct for length differences and potential non-linear transformations between viewing time and recall time, we first used dynamic time warping~\citep{BernClif94} to temporally align participants' recall topic proportions matrices with the episode topic proportions matrix.  An example correlation matrix before and after warping is shown in Fig.~\ref{fig:brainz}B.  This yielded a $1976 \times 1976$ correlation matrix for the episode template and for each participant's recall template.
 
 The temporal structure of the episode's content (as described by our model) is captured in the block-diagonal structure of the episode's temporal correlation matrix (e.g., Figs.~\ref{fig:model}B, \ref{fig:brainz}A), with time periods of thematic stability represented as dark blocks of varying sizes.  Inspecting the episode correlation matrix suggests that the episode's semantic content is highly temporally specific (i.e., the correlations between topic vectors from distant timepoints are almost all near zero).  By contrast, the activity patterns of individual (cubes of) voxels can encode relatively limited information on their own, and their activity frequently contributes to multiple separate functions \citep{FreeEtal01, SigmDeha08, CharKoec10, RishEtal13}.  By nature, these two attributes give rise to similarities in activity across large timescales that may not necessarily reflect a single task.  To identify brain regions whose shifts in activity patterns mirrored shifts in the semantic content of the episode or recalls, we restricted the temporal correlations we considered to the timescale of semantic information captured by our model.  Specifically, we isolated the upper triangle of the episode correlation matrix and created a ``proximal correlation mask'' that included only diagonals from the upper triangle of the episode correlation matrix up to the first diagonal that contained no positive correlations.  Applying this mask to the full episode correlation matrix was equivalent to excluding diagonals beyond the corner of the largest diagonal block.  In other words, the timescale of temporal correlations we considered corresponded to the longest period of thematic stability in the episode, and by extension the longest period of thematic stability in participants' recalls and the longest period of stability we might expect to see in voxel activity arising from processing or encoding episode content.  Figure \ref{fig:brainz} shows this proximal correlation mask applied to the temporal correlation matrices for the episode, an example participant's (warped) recall, and an example cube of voxels from our searchlight analyses.
 
@@ -249,10 +254,18 @@ \subsection*{Neurosynth decoding analyses}
 \texttt{Neurosynth}~\citep{YarkEtal11} parses a massive online database of over 14000 neuroimaging studies and constructs meta-analysis images for over 13000 psychology- and neuroscience-related terms, based on NIfTI images accompanying studies where those terms appear at a high frequency.  Given a novel image (tagged with its value type; e.g., $z$-, $t$-, $F$- or $p$-statistics), \texttt{Neurosynth} returns a list of terms whose meta-analysis images are most similar.  Our permutation procedure yielded, for each of the two searchlight analyses, a voxelwise map of $z$-values.  These maps describe the extent to which each voxel specifically reflected the temporal structure of the episode or individuals' recalls (i.e., relative to the null distributions of phase-shifted values). We inputted the two statistical maps described above to \texttt{Neurosynth} to create a list of the 10 most representative terms for each map.
 
 \section*{Data availability}
-The fMRI data we analyzed are available online \href{http://dataspace.princeton.edu/jspui/handle/88435/dsp01nz8062179}{\underline{here}}.  The behavioral data is available  \href{https://github.com/ContextLab/sherlock-topic-model-paper}{\underline{here}}.
+The fMRI data we analyzed are available online at:
+
+https://dataspace.princeton.edu/jspui/handle/88435/dsp01nz8062179
+
+\noindent The behavioral data is available at:
+
+https://github.com/ContextLab/sherlock-topic-model-paper/tree/master/data/raw
 
 \section*{Code availability}
-All of our analysis code may be downloaded \href{https://github.com/ContextLab/sherlock-topic-model-paper}{\underline{here}}.
+All of our analysis code may be downloaded from:
+
+https://github.com/ContextLab/sherlock-topic-model-paper
 
 %\bibliographystyle{naturemag}
 %\bibliography{CDL-bibliography/memlab}
@@ -819,5 +832,4 @@ \section*{Author contributions}
 \section*{Competing interests}
 The authors declare no competing interests.
 
-
 \end{document}
diff --git a/paper/old.tex b/paper/old.tex
index a978077..556f3cd 100644
--- a/paper/old.tex
+++ b/paper/old.tex
@@ -3,7 +3,7 @@
 \usepackage[english]{babel}
 \usepackage[font=small,labelfont=bf]{caption}
 \usepackage{geometry}
-\usepackage[sort]{natbib}
+\usepackage[sort&compress, numbers, super]{natbib}
 \usepackage{pxfonts}
 \usepackage{graphicx}
 \usepackage{newfloat}
@@ -11,24 +11,25 @@
 \usepackage{hyperref}
 \usepackage{lineno}
 \usepackage{placeins}
+\usepackage[nofiglist, fighead]{endfloat}
+\renewcommand{\includegraphics}[2][]{}
 
 \newcommand{\argmax}{\mathop{\mathrm{argmax}}\limits}
 
-\newcommand{\topicopt}{S1}
-\newcommand{\topics}{S2}
-\newcommand{\featureimportance}{S3}
-\newcommand{\corrmats}{S4}
-\newcommand{\matchmats}{S5}
-\newcommand{\kopt}{S6}
+\newcommand{\topicopt}{1}
+\newcommand{\topics}{2}
+\newcommand{\featureimportance}{3}
+\newcommand{\corrmats}{5}
+\newcommand{\matchmats}{6}
+\newcommand{\kopt}{7}
+\newcommand{\arrows}{4}
 
 \doublespacing
 \linenumbers
 
-\title{Geometric models reveal behavioral and neural signatures of transforming naturalistic experiences into episodic memories}
+\title{Geometric models reveal behavioral and neural signatures of transforming experiences into memories}
 
-\author{Andrew C. Heusser\textsuperscript{1, 2, \textdagger}, Paxton C. Fitzpatrick\textsuperscript{1, \textdagger}, and Jeremy R. Manning\textsuperscript{1, *}\\\textsuperscript{1}Department of Psychological and Brain Sciences\\Dartmouth College, Hanover, NH 03755, USA\\\textsuperscript{2}Akili Interactive Labs\\Boston, MA 02110\\\textsuperscript{\textdagger}Denotes equal contribution\\\textsuperscript{*}Corresponding author: Jeremy.R.Manning@Dartmouth.edu}
-
-\bibliographystyle{apa}
+\author{Andrew C. Heusser\textsuperscript{1, 2, \textdagger}, Paxton C. Fitzpatrick\textsuperscript{1, \textdagger}, and Jeremy R. Manning\textsuperscript{1, *}\\\textsuperscript{1}Department of Psychological and Brain Sciences\\Dartmouth College, Hanover, NH 03755, USA\\\textsuperscript{2}Akili Interactive Labs\\Boston, MA 02110, USA\\\textsuperscript{\textdagger}Denotes equal contribution\\\textsuperscript{*}Corresponding author: Jeremy.R.Manning@Dartmouth.edu}
 
 \begin{document}
 \begin{titlepage}
@@ -37,101 +38,103 @@
   \end{titlepage}
 
 \begin{abstract}
-The mental contexts in which we interpret experiences are often person-specific, even when the experiences themselves are shared. We developed a geometric framework for mathematically characterizing the subjective conceptual content of dynamic naturalistic experiences. We model experiences and memories as \textit{trajectories} through word embedding spaces whose coordinates reflect the universe of thoughts under consideration. Memory encoding can then be modeled as geometrically preserving or distorting the \textit{shape} of the original experience. We applied our approach to data collected as participants watched and verbally recounted a television episode while undergoing functional neuroimaging. Participants’ recountings all preserved coarse spatial properties (essential narrative elements), but not fine spatial scale (low-level) details, of the episode’s trajectory. We also identified networks of brain structures sensitive to these trajectory shapes. Our work provides insights into how we preserve and distort our ongoing experiences when we encode them into episodic memories.
+The mental contexts in which we interpret experiences are often person-specific, even when the experiences themselves are shared. We developed a geometric framework for mathematically characterizing the subjective conceptual content of dynamic naturalistic experiences. We model experiences and memories as ``trajectories'' through word embedding spaces whose coordinates reflect the universe of thoughts under consideration. Memory encoding can then be modeled as geometrically preserving or distorting the ``shape'' of the original experience. We applied our approach to data collected as participants watched and verbally recounted a television episode while undergoing functional neuroimaging. Participants’ recountings all preserved coarse spatial properties (essential narrative elements), but not fine spatial scale (low-level) details, of the episode’s trajectory. We also identified networks of brain structures sensitive to these trajectory shapes. Our work provides insights into how we preserve and distort our ongoing experiences when we encode them into episodic memories.
 \end{abstract}
 
 
 \section*{Introduction}
-What does it mean to \textit{remember} something? In traditional episodic memory experiments \citep[e.g., list-learning or trial-based experiments;][]{Murd62a, Kaha96}, remembering is often cast as a discrete, binary operation: each studied item may be separated from the rest of one's experience and labeled as having been either recalled or forgotten. More nuanced studies might incorporate self-reported confidence measures as a proxy for memory strength, or ask participants to discriminate between recollecting the (contextual) details of an experience and having a general feeling of familiarity~\citep{Yone02}. Using well-controlled, trial-based experimental designs, the field has amassed a wealth of information regarding human episodic memory~\citep[for review see][]{Kaha12}.  However, there are fundamental properties of the external world and our memories that trial-based experiments are not well suited to capture~\citep[for review, also see][]{KoriGold94, HukEtal18}.  First, our experiences and memories are continuous, rather than discrete---isolating a naturalistic event from the context in which it occurs can substantially change its meaning.  Second, whether or not the rememberer has precisely reproduced a specific set of words in describing a given experience is nearly orthogonal to how well they were actually able to remember it.  In classic (e.g., list-learning) memory studies, by contrast, the number or proportion of \textit{exact} recalls is often considered to be a primary metric for assessing the quality of participants' memories.  Third, one might remember the essence (or a general summary) of an experience but forget (or neglect to recount) particular low-level details.  Capturing the essence of what happened is often a main goal of recounting an episodic memory to a listener, whereas the inclusion of specific low-level details is often less pertinent.
+What does it mean to remember something? In traditional episodic memory experiments (e.g., list-learning or trial-based experiments\citep{Murd62a, Kaha96}), remembering is often cast as a discrete, binary operation: each studied item may be separated from the rest of one's experience and labeled as having been either recalled or forgotten. More nuanced studies might incorporate self-reported confidence measures as a proxy for memory strength, or ask participants to discriminate between recollecting the (contextual) details of an experience and having a general feeling of familiarity~\citep{Yone02}. Using well-controlled, trial-based experimental designs, the field has amassed a wealth of information regarding human episodic memory~\citep{Kaha12}.  However, there are fundamental properties of the external world and our memories that trial-based experiments are not well suited to capture~\citep{KoriGold94, HukEtal18}.  First, our experiences and memories are continuous, rather than discrete---isolating a naturalistic event from the context in which it occurs can substantially change its meaning.  Second, whether or not the rememberer has precisely reproduced a specific set of words in describing a given experience is nearly orthogonal to how well they were actually able to remember it.  In classic (e.g., list-learning) memory studies, by contrast, the number or proportion of exact recalls is often considered to be a primary metric for assessing the quality of participants' memories.  Third, one might remember the essence (or a general summary) of an experience but forget (or neglect to recount) particular low-level details.  Capturing the essence of what happened is often a main goal of recounting an episodic memory to a listener, whereas the inclusion of specific low-level details is often less pertinent.
 
-How might we formally characterize the \textit{essence} of an experience, and whether it has been recovered by the rememberer?  And how might we distinguish an experience's overarching essence from its low-level details?  One approach is to start by considering some fundamental properties of the dynamics of our experiences.  Each given moment of an experience tends to derive meaning from surrounding moments, as well as from longer-range temporal associations~\citep{LernEtal11, Mann19, Mann20}.  Therefore, the timecourse describing how an event unfolds is fundamental to its overall meaning.  Further, this hierarchy formed by our subjective experiences at different timescales defines a \textit{context} for each new moment~\citep[e.g.,][]{HowaKaha02a, HowaEtal14}, and plays an important role in how we interpret that moment and remember it later~\citep[for review see][]{MannEtal15, Mann20}.  Our memory systems can leverage these associations to form predictions that help guide our behaviors~\citep{RangRitc12}.  For example, as we navigate the world, the features of our subjective experiences tend to change gradually (e.g., the room or situation we find ourselves in at any given moment is strongly temporally autocorrelated), allowing us to form stable estimates of our current situation and behave accordingly~\citep{ZackEtal07, ZwaaRadv98}.
+How might we formally characterize the ``essence'' of an experience, and whether it has been recovered by the rememberer?  And how might we distinguish an experience's overarching essence from its low-level details?  One approach is to start by considering some fundamental properties of the dynamics of our experiences.  Each given moment of an experience tends to derive meaning from surrounding moments, as well as from longer-range temporal associations~\citep{LernEtal11, Mann19, Mann20}.  Therefore, the timecourse describing how an event unfolds is fundamental to its overall meaning.  Further, this hierarchy formed by our subjective experiences at different timescales defines a context for each new moment~\citep{HowaKaha02a, HowaEtal14}, and plays an important role in how we interpret that moment and remember it later~\citep{MannEtal15, Mann20}.  Our memory systems can leverage these associations to form predictions that help guide our behaviors~\citep{RangRitc12}.  For example, as we navigate the world, the features of our subjective experiences tend to change gradually (e.g., the room or situation we find ourselves in at any given moment is strongly temporally autocorrelated), allowing us to form stable estimates of our current situation and behave accordingly~\citep{ZackEtal07, ZwaaRadv98}.
 
-Occasionally, this gradual drift of our ongoing experience is punctuated by sudden changes, or shifts~\citep[e.g., when we walk through a doorway;][]{RadvZack17}.  Prior research suggests that these sharp transitions (termed \textit{event boundaries}) help to discretize our experiences (and their mental representations) into \textit{events}~\citep{RadvZack17, BrunEtal18, HeusEtal18b, ClewDava17, EzzyDava11, DuBrDava13}.  The interplay between the stable (within-event) and transient (across-event) temporal dynamics of an experience also provides a potential framework for transforming experiences into memories that distills those experiences down to their essences.  For example, prior work has shown that event boundaries can influence how we learn sequences of items~\citep{HeusEtal18b, DuBrDava13}, navigate~\citep{BrunEtal18}, and remember and understand narratives~\citep{ZwaaRadv98, EzzyDava11}.  This work also suggests a means of distinguishing the essence of an experience from its low-level details:  The overall structure of events and event transitions reflects how the high-level experience unfolds (i.e., its essence), while subtler event-level properties reflect its low-level details.  Prior research has also implicated a network of brain regions (including the hippocampus and the medial prefrontal cortex) in playing a critical role in transforming experiences into structured and consolidated memories ~\citep{TompDava17}.
+Occasionally, this gradual drift of our ongoing experience is punctuated by sudden changes, or shifts (e.g., when we walk through a doorway~\citep{RadvZack17}).  Prior research suggests that these sharp transitions (termed ``event boundaries'') help to discretize our experiences (and their mental representations) into ``events''~\citep{RadvZack17, BrunEtal18, HeusEtal18b, ClewDava17, EzzyDava11, DuBrDava13}.  The interplay between the stable (within-event) and transient (across-event) temporal dynamics of an experience also provides a potential framework for transforming experiences into memories that distills those experiences down to their essences.  For example, prior work has shown that event boundaries can influence how we learn sequences of items~\citep{HeusEtal18b, DuBrDava13}, navigate~\citep{BrunEtal18}, and remember and understand narratives~\citep{ZwaaRadv98, EzzyDava11}.  This work also suggests a means of distinguishing the essence of an experience from its low-level details:  The overall structure of events and event transitions reflects how the high-level experience unfolds (i.e., its essence), while subtler event-level properties reflect its low-level details.  Prior research has also implicated a network of brain regions (including the hippocampus and the medial prefrontal cortex) in playing a critical role in transforming experiences into structured and consolidated memories ~\citep{TompDava17}.
 
-Here, we sought to examine how the temporal dynamics of a naturalistic experience were later reflected in participants' memories.  We also sought to leverage the above conceptual insights into the distinctions between an experience's essence and its low-level details to build models that explicitly quantified these distinctions.  We analyzed an open dataset that comprised behavioral and functional Magnetic Resonance Imaging (fMRI) data collected as participants viewed and then verbally recounted an episode of the BBC television show \textit{Sherlock}~\citep{ChenEtal17}.  We developed a computational framework for characterizing the temporal dynamics of the moment-by-moment content of the episode and of participants' verbal recalls.  Our framework uses topic modeling~\citep{BleiEtal03} to characterize the thematic conceptual (semantic) content present in each moment of the episode and recalls by projecting each moment into a word embedding space.  We then use hidden Markov models~\citep{Rabi89, BaldEtal17} to discretize this evolving semantic content into events.  In this way, we cast both naturalistic experiences and memories of those experiences as geometric \textit{trajectories} through word embedding space that describe how they evolve over time. Under this framework, successful remembering entails verbally traversing the content trajectory of the episode, thereby reproducing the shape (essence) of the original experience.  Our framework captures the episode's essence in the sequence of geometric coordinates for its events, and its low-level details by examining its within-event geometric properties.
+Here, we sought to examine how the temporal dynamics of a naturalistic experience were later reflected in participants' memories.  We also sought to leverage the above conceptual insights into the distinctions between an experience's essence and its low-level details to build models that explicitly quantified these distinctions.  We analyzed an open dataset that comprised behavioral and functional Magnetic Resonance Imaging (fMRI) data collected as participants viewed and then verbally recounted an episode of the BBC television show \textit{Sherlock}~\citep{ChenEtal17}.  We developed a computational framework for characterizing the temporal dynamics of the moment-by-moment content of the episode and of participants' verbal recalls.  Our framework uses topic modeling~\citep{BleiEtal03} to characterize the thematic conceptual (semantic) content present in each moment of the episode and recalls by projecting each moment into a word embedding space.  We then use hidden Markov models~\citep{Rabi89, BaldEtal17} to discretize this evolving semantic content into events.  In this way, we cast both naturalistic experiences and memories of those experiences as geometric ``trajectories'' through word embedding space that describe how they evolve over time. Under this framework, successful remembering entails verbally traversing the content trajectory of the episode, thereby reproducing the shape (essence) of the original experience.  Our framework captures the episode's essence in the sequence of geometric coordinates for its events, and its low-level details by examining its within-event geometric properties.
 
-Comparing the overall shapes of the topic trajectories for the episode and participants' recalls reveals which aspects of the episode's essence were preserved (or lost) in the translation into memory.  We also develop two metrics for assessing participants' memories for low-level details: (1) the \textit{precision} with which a participant recounts details about each event, and (2) the \textit{distinctiveness} of their recall for each event, relative to other events.  We examine how these metrics relate to overall memory performance as judged by third-party human annotators.  We also compare and contrast our general approach to studying memory for naturalistic experiences with standard metrics for assessing performance on more traditional memory tasks, such as list-learning.  Last, we leverage our framework to identify networks of brain structures whose responses (as participants watched the episode) reflected the temporal dynamics of the episode and/or how participants would later recount it.
+Comparing the overall shapes of the topic trajectories for the episode and participants' recalls reveals which aspects of the episode's essence were preserved (or lost) in the translation into memory.  We also develop two metrics for assessing participants' memories for low-level details: (1) the ``precision'' with which a participant recounts details about each event, and (2) the ``distinctiveness'' of their recall for each event, relative to other events.  We examine how these metrics relate to overall memory performance as judged by third-party human annotators.  We also compare and contrast our general approach to studying memory for naturalistic experiences with standard metrics for assessing performance on more traditional memory tasks, such as list-learning.  Last, we leverage our framework to identify networks of brain structures whose responses (as participants watched the episode) reflected the temporal dynamics of the episode and/or how participants would later recount it.
 
 
 \section*{Results}
-To characterize the dynamic content of the \textit{Sherlock} episode and participants' subsequent recountings, we used a topic model~\citep{BleiEtal03} to discover the episode's latent themes.  Topic models take as inputs a vocabulary of words to consider and a collection of text documents, and return two output matrices.  The first of these is a \textit{topics matrix} whose rows are \textit{topics} (or latent themes) and whose columns correspond to words in the vocabulary. The entries in the topics matrix reflect how each word in the vocabulary is weighted by each discovered topic.  For example, a detective-themed topic might weight heavily on words like ``crime,'' and ``search.''  The second output is a \textit{topic proportions matrix} with one row per document and one column per topic.  The topic proportions matrix describes the mixture of discovered topics reflected in each document.
+To characterize the dynamic content of the \textit{Sherlock} episode and participants' subsequent recountings, we used a topic model~\citep{BleiEtal03} to discover the episode's latent themes.  Topic models take as inputs a vocabulary of words to consider and a collection of text documents, and return two output matrices.  The first of these is a ``topics matrix'' whose rows are ``topics'' (or latent themes) and whose columns correspond to words in the vocabulary. The entries in the topics matrix reflect how each word in the vocabulary is weighted by each discovered topic.  For example, a detective-themed topic might weight heavily on words like ``crime,'' and ``search.''  The second output is a ``topic proportions matrix'' with one row per document and one column per topic.  The topic proportions matrix describes the mixture of discovered topics reflected in each document.
 
-\cite{ChenEtal17} collected hand-annotated information about each of 1,000 (manually delineated) time segments spanning the roughly 50 minute video used in their study.  Each annotation included: a brief narrative description of what was happening, the location where the action took place, the names of any characters on the screen, and other similar details (for a full list of annotated features, see \textit{Methods}).  We took the union of all unique words (excluding stop words, such as ``and,'' ``or,'' ``but,'' etc.) across all features from all annotations as the vocabulary for the topic model.  We then concatenated the sets of words across all features contained in overlapping sliding windows of (up to) 50 annotations, and treated each window as a single document for the purpose of fitting the topic model.  Next, we fit a topic model with (up to) $K = 100$ topics to this collection of documents.  We found that 32 unique topics (with non-zero weights) were sufficient to describe the time-varying content of the episode (see \textit{Methods}; Figs.~\ref{fig:schematic}, \topics).  We note that our approach is similar in some respects to Dynamic Topic Models~\citep{BleiLaff06} in that we sought to characterize how the thematic content of the episode evolved over time.  However, whereas Dynamic Topic Models are designed to characterize how the properties of \textit{collections} of documents change over time, our sliding window approach allows us to examine the topic dynamics within a single document (or video).  Specifically, our approach yielded (via the topic proportions matrix) a single \textit{topic vector} for each sliding window of annotations transformed by the topic model.  We then stretched (interpolated) the resulting windows-by-topics matrix to match the time series of the 1,976 fMRI volumes collected as participants viewed the episode.
+Chen et al. (2017) collected hand-annotated information about each of 1000 (manually delineated) time segments spanning the roughly 50 minute video used in their study~\cite{ChenEtal17}.  Each annotation included: a brief narrative description of what was happening, the location where the action took place, the names of any characters on the screen, and other similar details (for a full list of annotated features, see \textit{Methods}).  We took the union of all unique words (excluding stop words, such as ``and,'' ``or,'' ``but,'' etc.) across all features from all annotations as the vocabulary for the topic model.  We then concatenated the sets of words across all features contained in overlapping sliding windows of (up to) 50 annotations, and treated each window as a single document for the purpose of fitting the topic model.  Next, we fit a topic model with (up to) $K = 100$ topics to this collection of documents.  We found that 32 unique topics (with non-zero weights) were sufficient to describe the time-varying content of the episode (see \textit{Methods}; Fig.~\ref{fig:schematic}, Supp.\ Fig.~\topics).  We note that our approach is similar in some respects to Dynamic Topic Models~\citep{BleiLaff06} in that we sought to characterize how the thematic content of the episode evolved over time.  However, whereas Dynamic Topic Models are designed to characterize how the properties of collections of documents change over time, our sliding window approach allows us to examine the topic dynamics within a single document (or video).  Specifically, our approach yielded (via the topic proportions matrix) a single ``topic vector'' for each sliding window of annotations transformed by the topic model.  We then stretched (interpolated) the resulting windows-by-topics matrix to match the time series of the 1976 fMRI volumes collected as participants viewed the episode.
 
 \begin{figure}[tp]
 \centering
 \includegraphics[width=1\textwidth]{figs/schematic}
-\caption{\small \textbf{Topic weights in episode and recall content.} We used detailed, hand-generated annotations describing each manually identified time segment from the episode to fit a topic model.  Three example frames from the episode (first row) are displayed, along with their descriptions from the corresponding episode annotation (second row) and an example participant's recall transcript (third row).  We used the topic model (fit to the episode annotations) to estimate topic vectors for each moment of the episode and each sentence of participants' recalls.  Example topic vectors are displayed in the bottom row (blue: episode annotations; green: example participant's recalls).  Three topic dimensions are shown (the highest-weighted topics for each of the three example scenes, respectively), along with the 10 highest-weighted words for each topic.  Figure~\topics~provides a full list of the top 10 words from each of the discovered topics.}
+\caption{\small \textbf{Topic weights in episode and recall content.} We used detailed, hand-generated annotations describing each manually identified time segment from the episode to fit a topic model.  Three example frames from the episode (first row) are displayed, along with their descriptions from the corresponding episode annotation (second row) and an example participant's recall transcript (third row).  We used the topic model (fit to the episode annotations) to estimate topic vectors for each moment of the episode and each sentence of participants' recalls.  Example topic vectors are displayed in the bottom row (blue: episode annotations; green: example participant's recalls).  Three topic dimensions are shown (the highest-weighted topics for each of the three example scenes, respectively), along with the 10 highest-weighted words for each topic.  Supplementary Figure~\topics~provides a full list of the top 10 words from each of the discovered topics.}
 \label{fig:schematic}
 \end{figure}
 
-The 32 topics we found were heavily character-focused (i.e., the top-weighted word in each topic was nearly always a character) and could be roughly divided into themes centered around Sherlock Holmes (the titular character), John Watson (Sherlock's close confidant and assistant), supporting characters (e.g., Inspector Lestrade, Sergeant Donovan, or Sherlock's brother Mycroft), or the interactions between various groupings of these characters (Fig.~\topics).  This likely follows from the frequency with which these terms appeared in the episode annotations.  Several of the identified topics were highly similar, which we hypothesized might allow us to distinguish between subtle narrative differences if the distinctions between those overlapping topics were meaningful.  The topic vectors for each timepoint were also \textit{sparse}, in that only a small number of topics (typically one or two) tended to be ``active'' in any given timepoint (Fig.~\ref{fig:model}A).  Further, the dynamics of the topic activations appeared to exhibit \textit{persistence} (i.e., given that a topic was active in one timepoint, it was likely to be active in the following timepoint) along with \textit{occasional rapid changes} (i.e., occasionally topic weights would change abruptly from one timepoint to the next).  These two properties of the topic dynamics may be seen in the block diagonal structure of the timepoint-by-timepoint correlation matrix (Fig.~\ref{fig:model}B) and reflect the gradual drift and sudden shifts fundamental to the temporal dynamics of many real-world experiences, as well as television episodes.  Given this observation, we adapted an approach devised by \cite{BaldEtal17}, and used a hidden Markov model (HMM) to identify the \textit{event boundaries} where the topic activations changed rapidly (i.e., the boundaries of the blocks in the temporal correlation matrix; event boundaries identified by the HMM are outlined in yellow in Fig.~\ref{fig:model}B).  Part of our model fitting procedure required selecting an appropriate number of events into which the topic trajectory should be segmented.  To accomplish this, we used an optimization procedure that maximized the difference between the topic weights for timepoints within an event versus timepoints across multiple events (see \textit{Methods}).  We then created a stable summary of the content within each episode event by averaging the topic vectors across the timepoints spanned by each event (Fig.~\ref{fig:model}C).
+The 32 topics we found were heavily character-focused (i.e., the top-weighted word in each topic was nearly always a character) and could be roughly divided into themes centered around Sherlock Holmes (the titular character), John Watson (Sherlock's close confidant and assistant), supporting characters (e.g., Inspector Lestrade, Sergeant Donovan, or Sherlock's brother Mycroft), or the interactions between various groupings of these characters (Supp.\ Fig.~\topics).  This likely follows from the frequency with which these terms appeared in the episode annotations.  Several of the identified topics were highly similar, which we hypothesized might allow us to distinguish between subtle narrative differences if the distinctions between those overlapping topics were meaningful.  The topic vectors for each timepoint were also sparse, in that only a small number of topics (typically one or two) tended to be ``active'' in any given timepoint (Fig.~\ref{fig:model}A).  Further, the dynamics of the topic activations appeared to exhibit persistence (i.e., given that a topic was active in one timepoint, it was likely to be active in the following timepoint) along with occasional rapid changes (i.e., occasionally topic weights would change abruptly from one timepoint to the next).  These two properties of the topic dynamics may be seen in the block diagonal structure of the timepoint-by-timepoint correlation matrix (Fig.~\ref{fig:model}B) and reflect the gradual drift and sudden shifts fundamental to the temporal dynamics of many real-world experiences, as well as television episodes.  Given this observation, we adapted an approach devised by Baldassano et al. (2017)~\cite{BaldEtal17}, and used a hidden Markov model (HMM) to identify the ``event boundaries'' where the topic activations changed rapidly (i.e., the boundaries of the blocks in the temporal correlation matrix; event boundaries identified by the HMM are outlined in yellow in Fig.~\ref{fig:model}B).  Part of our model fitting procedure required selecting an appropriate number of events into which the topic trajectory should be segmented.  To accomplish this, we used an optimization procedure that maximized the difference between the topic weights for timepoints within an event versus timepoints across multiple events (see \textit{Methods}).  We then created a stable summary of the content within each episode event by averaging the topic vectors across the timepoints spanned by each event (Fig.~\ref{fig:model}C).
 
 \begin{figure}[tp]
 \centering
 \includegraphics[width=\textwidth]{figs/eventseg}
-\caption{\small \textbf{Modeling naturalistic stimuli and recalls.} All panels: darker colors indicate greater values; range: [0, 1].  \textbf{A.} Topic vectors ($K = 100$) for each of the 1976 episode timepoints.  \textbf{B.} Timepoint-by-timepoint correlation matrix of the topic vectors displayed in Panel A.  Event boundaries discovered by the HMM are denoted in yellow (30 events detected).  \textbf{C.} Average topic vectors for each of the 30 episode events. \textbf{D.} Topic vectors for each of 265 sliding windows of sentences spoken by an example participant while recalling the episode.  \textbf{E.} Timepoint-by-timepoint correlation matrix of the topic vectors displayed in Panel D. Event boundaries detected by the HMM are denoted in yellow (22 events detected).  For similar plots for all participants, see Figure~\corrmats.  \textbf{F.} Average topic vectors for each of the 22 recall events from the example participant.  \textbf{G.} Correlations between the topic vectors for every pair of episode events (Panel C) and recall events (from the example participant; Panel F).  For similar plots for all participants, see Figure~\matchmats.  \textbf{H.} Average correlations between each pair of episode events and recall events (across all 17 participants).  To create the figure, each recalled event was assigned to the episode event with the most correlated topic vector (yellow boxes in panels G and H).}
+\caption{\small \textbf{Modeling naturalistic stimuli and recalls.} All panels: darker colors indicate greater values; range: [0, 1].  \textbf{A.} Topic vectors ($K = 100$) for each of the 1976 episode timepoints.  \textbf{B.} Timepoint-by-timepoint correlation matrix of the topic vectors displayed in Panel A.  Event boundaries discovered by the HMM are denoted in yellow (30 events detected).  \textbf{C.} Average topic vectors for each of the 30 episode events. \textbf{D.} Topic vectors for each of 265 sliding windows of sentences spoken by an example participant while recalling the episode.  \textbf{E.} Timepoint-by-timepoint correlation matrix of the topic vectors displayed in Panel D. Event boundaries detected by the HMM are denoted in yellow (22 events detected).  For similar plots for all participants, see Supplementary Figure~\corrmats.  \textbf{F.} Average topic vectors for each of the 22 recall events from the example participant.  \textbf{G.} Correlations between the topic vectors for every pair of episode events (Panel C) and recall events (from the example participant; Panel F).  For similar plots for all participants, see Supplementary Figure~\matchmats.  \textbf{H.} Average correlations between each pair of episode events and recall events (across all 17 participants).  To create the figure, each recalled event was assigned to the episode event with the most correlated topic vector (yellow boxes in panels G and H).}
 \label{fig:model}
 \end{figure}
 
 Given that the time-varying content of the episode could be segmented cleanly into discrete events, we wondered whether participants' recalls of the episode also displayed a similar structure.  We applied the same topic model (already trained on the episode annotations) to each participant's recalls.  Analogously to how we parsed the time-varying content of the episode, to obtain similar estimates for each participant's recall transcript, we treated each overlapping  window of (up to) 10 sentences from their transcript as a document, and computed the most probable mix of topics reflected in each timepoint's sentences.  This yielded, for each participant, a number-of-windows by number-of-topics topic proportions matrix that characterized how the topics identified in the original episode were reflected in the participant's recalls.  An important feature of our approach is that it allows us to compare participants' recalls to events from the original episode, despite that different participants used widely varying language to describe the events, and that those descriptions often diverged in content, quality, and quantity from the episode annotations.  This ability to match up conceptually related text that differs in specific vocabulary, detail, and length is an important benefit of projecting the episode and recalls into a shared topic space.  An example topic proportions matrix from one participant's recalls is shown in Figure~\ref{fig:model}D.
 
-Although the example participant's recall topic proportions matrix has some visual similarity to the episode topic proportions matrix, the time-varying topic proportions for the example participant's recalls are not as sparse as those for the episode (compare Figs.~\ref{fig:model}A and D).  Similarly, although there do appear to be periods of stability in the recall topic dynamics (i.e., most topics are active or inactive over contiguous blocks of time), the changes in topic activations that define event boundaries appear less clearly delineated in participants' recalls than in the episode's annotations.  To examine these patterns in detail, we computed the timepoint-by-timepoint correlation matrix for the example participant's recall topic proportions matrix (Fig.~\ref{fig:model}E).  As in the episode correlation matrix (Fig.~\ref{fig:model}B), the example participant's recall correlation matrix has a strong block diagonal structure, indicating that their recalls are discretized into separated events.  We used the same HMM-based optimization procedure that we had applied to the episode's topic proportions matrix (see \textit{Methods}) to estimate an analogous set of event boundaries in the participant's recounting of the episode (outlined in yellow).  We carried out this analysis on all 17 participants' recall topic proportions matrices (Fig.~\corrmats).
-
-Two clear patterns emerged from this set of analyses.  First, although every individual participant's recalls could be segmented into discrete events (i.e., every individual participant's recall correlation matrix exhibited clear block diagonal structure; Fig.~\corrmats), each participant appeared to have a unique \textit{recall resolution}, reflected in the sizes of those blocks.  While some participants' recall topic proportions segmented into just a few events (e.g., Participants P4, P5, and P7), others' segmented into many shorter-duration events (e.g., Participants P12, P13, and P17).  This suggests that different participants may be recalling the episode with different levels of detail---i.e., some might recount only high-level essential plot details, whereas others might recount low-level details instead (or in addition).  The second clear pattern present in every individual participant's recall correlation matrix was that, unlike in the episode correlation matrix, there were substantial off-diagonal correlations.  Whereas each event in the original episode was (largely) separable from the others (Fig.~\ref{fig:model}B), in transforming those separable events into memory, participants appeared to be integrating across multiple events, blending elements of previously recalled and not-yet-recalled content into each newly recalled event~\citep[Figs.~\ref{fig:model}E, \corrmats; also see][]{MannEtal11, HowaEtal12, Mann19}.
+Although the example participant's recall topic proportions matrix has some visual similarity to the episode topic proportions matrix, the time-varying topic proportions for the example participant's recalls are not as sparse as those for the episode (compare Figs.~\ref{fig:model}A and D).  Similarly, although there do appear to be periods of stability in the recall topic dynamics (i.e., most topics are active or inactive over contiguous blocks of time), the changes in topic activations that define event boundaries appear less clearly delineated in participants' recalls than in the episode's annotations.  To examine these patterns in detail, we computed the timepoint-by-timepoint correlation matrix for the example participant's recall topic proportions matrix (Fig.~\ref{fig:model}E).  As in the episode correlation matrix (Fig.~\ref{fig:model}B), the example participant's recall correlation matrix has a strong block diagonal structure, indicating that their recalls are discretized into separated events.  We used the same HMM-based optimization procedure that we had applied to the episode's topic proportions matrix (see \textit{Methods}) to estimate an analogous set of event boundaries in the participant's recounting of the episode (outlined in yellow).  We carried out this analysis on all 17 participants' recall topic proportions matrices (Supp.\ Fig.~\corrmats).
 
-The above results demonstrate that topic models capture the dynamic conceptual content of the episode and participants' recalls of the episode.  Further, the episode and recalls exhibit event boundaries that can be identified automatically using HMMs to segment the dynamic content.  Next, we asked whether some correspondence might be made between the specific content of the events the participants experienced while viewing the episode, and the events they later recalled.  We labeled each recall event as matching the episode event with the most similar (i.e., most highly correlated) topic vector (Figs.~\ref{fig:model}G, \matchmats).  This yielded a sequence of ``presented'' events from the original episode, and a (potentially differently ordered) sequence of ``recalled'' events for each participant.  Analogous to classic list-learning studies, we can then examine participants' recall sequences by asking which events they tended to recall first~\citep[probability of first recall; Fig.~\ref{fig:list-learning}A;][]{AtkiShif68, PostPhil65, WelcBurn24}; how participants most often transitioned between recalls of the events as a function of the temporal distance between them~\citep[lag-conditional response probability; Fig.~\ref{fig:list-learning}B;][]{Kaha96}; and which events they were likely to remember overall~\citep[serial position recall analyses; Fig.~\ref{fig:list-learning}C;][]{Murd62a}. Some of the patterns we observed appeared to be similar to classic effects from the list-learning literature.  For example, participants had a higher probability of initiating recall with early events (Fig.~\ref{fig:list-learning}A) and a higher probability of transitioning to neighboring events with an asymmetric forward bias (Fig.~\ref{fig:list-learning}B). However, unlike what is typically observed in list-learning studies, we did not observe patterns comparable to the primacy or recency serial position effects (Fig.~\ref{fig:list-learning}C).  We hypothesized that participants might be leveraging meaningful narrative associations and references over long timescales throughout the episode.
+Two clear patterns emerged from this set of analyses.  First, although every individual participant's recalls could be segmented into discrete events (i.e., every individual participant's recall correlation matrix exhibited clear block diagonal structure; Supp.\ Fig.~\corrmats), each participant appeared to have a unique ``recall resolution,'' reflected in the sizes of those blocks.  While some participants' recall topic proportions segmented into just a few events (e.g., Participants P4, P5, and P7), others' segmented into many shorter-duration events (e.g., Participants P12, P13, and P17).  This suggests that different participants may be recalling the episode with different levels of detail---i.e., some might recount only high-level essential plot details, whereas others might recount low-level details instead (or in addition).  The second clear pattern present in every individual participant's recall correlation matrix was that, unlike in the episode correlation matrix, there were substantial off-diagonal correlations.  One potential explanation for this finding is that the topic models, trained only on episode annotations, do not capture topic proportions in participants' ``held-out'' recalls as accurately.  A second possibility is that, whereas each event in the original episode was (largely) separable from the others (Fig.~\ref{fig:model}B), in transforming those separable events into memory, participants appeared to be integrating across multiple events, blending elements of previously recalled and not-yet-recalled content into each newly recalled event (Fig.~\ref{fig:model}E, Supp.\ Fig.~\corrmats)~\citep{MannEtal11, HowaEtal12, Mann19}.
 
-Clustering scores are often used by memory researchers to characterize how people organize their memories of words on a studied list~\citep[for review, see][]{PolyEtal09}.  We defined analogous measures to characterize how participants organized their memories for episodic events (see \textit{Methods} for details).  Temporal clustering refers to the extent to which participants group their recall responses according to encoding position.  Overall, we found that sequentially viewed episode events tended to appear nearby in participants' recall event sequences (mean clustering score: 0.732, SEM: 0.033).  Participants with higher temporal clustering scores tended to exhibit better overall memory for the episode, according to both \cite{ChenEtal17}'s hand-counted numbers of recalled scenes from the episode (Pearson's $r(15) = 0.49,~p = 0.046$) and the numbers of episode events that best-matched at least one recall event (i.e., model-estimated number of events recalled; Pearson's $r(15) = 0.59,~p = 0.013$).  Semantic clustering measures the extent to which participants cluster their recall responses according to semantic similarity.  We found that participants tended to recall semantically similar episode events together (mean clustering score: 0.650, SEM: 0.032), and that semantic clustering score was also related to both hand-counted  (Pearson's $r(15) = 0.65,~p = 0.005$) and model-estimated (Pearson's $r(15) = 0.58,~p = 0.015$) numbers of recalled events.
+The above results demonstrate that topic models capture the dynamic conceptual content of the episode and participants' recalls of the episode.  Further, the episode and recalls exhibit event boundaries that can be identified automatically using HMMs to segment the dynamic content.  Next, we asked whether some correspondence might be made between the specific content of the events the participants experienced while viewing the episode, and the events they later recalled.  We labeled each recall event as matching the episode event with the most similar (i.e., most highly correlated) topic vector (Fig.~\ref{fig:model}G, Supp.\ Fig.~\matchmats).  This yielded a sequence of ``presented'' events from the original episode, and a (potentially differently ordered) sequence of ``recalled'' events for each participant.  Analogous to classic list-learning studies, we can then examine participants' recall sequences by asking which events they tended to recall first (probability of first recall~\citep{AtkiShif68, PostPhil65, WelcBurn24}; Fig.~\ref{fig:list-learning}A); how participants most often transitioned between recalls of the events as a function of the temporal distance between them (lag-conditional response probability~\citep{Kaha96}; Fig.~\ref{fig:list-learning}B); and which events they were likely to remember overall (serial position recall analyses~\citep{Murd62a}; Fig.~\ref{fig:list-learning}C). Some of the patterns we observed appeared to be similar to classic effects from the list-learning literature.  For example, participants had a higher probability of initiating recall with early events (Fig.~\ref{fig:list-learning}A) and a higher probability of transitioning to neighboring events with an asymmetric forward bias (Fig.~\ref{fig:list-learning}B). However, unlike what is typically observed in list-learning studies, we did not observe patterns comparable to the primacy or recency serial position effects (Fig.~\ref{fig:list-learning}C).  We hypothesized that participants might be leveraging meaningful narrative associations and references over long timescales throughout the episode.
 
 \begin{figure}[tp]
   \centering
   \includegraphics[width=1\textwidth]{figs/list_learning}
-  \caption{\small \textbf{Naturalistic extensions of classic list-learning memory analyses.} \textbf{A.} The probability of first recall as a function of the serial position of the event in the episode. \textbf{B}.  The probability of recalling each event, conditioned on having most recently recalled the event \textit{lag} events away in the episode.  \textbf{C.} The proportion of participants who recalled each event, as a function of the serial position of the events in the episode.  All panels: error ribbons denote bootstrap-estimated standard error of the mean.}
+  \caption{\small \textbf{Naturalistic extensions of classic list-learning memory analyses.} \textbf{A.} The probability of first recall as a function of the serial position of the event in the episode. \textbf{B}.  The probability of recalling each event, conditioned on having most recently recalled the event \textit{lag} events away in the episode.  \textbf{C.} The proportion of participants who recalled each event, as a function of the serial position of the events in the episode.  All panels: error ribbons denote the bootstrap-estimated 95\% confidence interval.}
   \label{fig:list-learning}
 \end{figure}
 
+Clustering scores are often used by memory researchers to characterize how people organize their memories of words on a studied list~\citep{PolyEtal09}.  We defined analogous measures to characterize how participants organized their memories for episodic events (see \textit{Methods} for details).  Temporal clustering refers to the extent to which participants group their recall responses according to encoding position.  Overall, we found that sequentially viewed episode events tended to appear nearby in participants' recall event sequences (mean clustering score: 0.732, SEM: 0.033).  Participants with higher temporal clustering scores tended to exhibit better overall memory for the episode, according to both Chen et al. (2017)~\cite{ChenEtal17}'s hand-counted numbers of recalled scenes from the episode (Pearson's $r(15) = 0.49,~p = 0.046,~95\%~\mathrm{CI} = [0.25, 0.76]$) and the numbers of episode events that best-matched at least one recall event (i.e., model-estimated number of events recalled; Pearson's $r(15) = 0.59,~p = 0.013,~95\%~\mathrm{CI} = [0.31, 0.80]$).  Semantic clustering measures the extent to which participants cluster their recall responses according to semantic similarity~\citep{MannKaha12}.  We found that participants tended to recall semantically similar episode events together (mean clustering score: 0.650, SEM: 0.032), and that semantic clustering scores were also related to both hand-counted  (Pearson's $r(15) = 0.65,~p = 0.004,~95\%~\mathrm{CI} = [0.31, 0.85]$) and model-estimated (Pearson's $r(15) = 0.58,~p = 0.015,~95\%~\mathrm{CI} = [0.10, 0.83]$) numbers of recalled events.
+
+
+
 \begin{figure}[tp]
   \centering
   \includegraphics[width=1\textwidth]{figs/precision_distinctiveness}
-  \caption{\small \textbf{Novel content-based metrics of naturalistic memory: precision and distinctiveness.} \textbf{A.} The episode-recall correlation matrix for a representative participant (P17).  The yellow boxes highlight the maximum correlation in each column.  The example participant's overall precision score was computed as the average across the (Fisher $z$-transformed) correlation values in the yellow boxes.  Their distinctiveness score was computed as the average (over recall events) of the $z$-scored (within column) event precisions. \textbf{B.} The (Pearson's) correlation between precision and hand-counted number of recalled scenes.  \textbf{C.} The correlation between distinctiveness and hand-counted number of recalled scenes. \textbf{D.} The correlation between precision and the number of recalled episode events, as determined by our model. \textbf{E.} The correlation between distinctiveness and the number of recalled episode events, as determined by our model.}
+  \caption{\small \textbf{Novel content-based metrics of naturalistic memory: precision and distinctiveness.} \textbf{A.} The episode-recall correlation matrix for an example participant (P17), chosen for their large number of recall events (for analogous figures for other participants, see Supp.\ Fig.~\corrmats).  The yellow boxes highlight the maximum correlation in each column.  The example participant's overall precision score was computed as the average across the (Fisher $z$-transformed) correlation values in the yellow boxes.  Their distinctiveness score was computed as the average (over recall events) of the $z$-scored (within column) event precisions. \textbf{B.} The across-participants (Pearson's) correlation between precision and hand-counted number of recalled scenes.  \textbf{C.} The correlation between distinctiveness and hand-counted number of recalled scenes. \textbf{D.} The correlation between precision and the number of recalled episode events, as determined by our model. \textbf{E.} The correlation between distinctiveness and the number of recalled episode events, as determined by our model.}
   \label{fig:precision-distinctiveness}
 \end{figure}
 
-The above analyses illustrate how our framework for characterizing the dynamic conceptual content of naturalistic episodes enables us to carry out analyses that have traditionally been applied to much simpler list-learning paradigms.  However, perhaps the most interesting aspects of memory for naturalistic episodes are those that have no list-learning analogs.  The nuances of how one's memory for an event might capture some details, yet distort or neglect others, is central to how we use our memory systems in daily life.  Yet when researchers study memory in highly simplified paradigms, those nuances are not typically observable.  We next developed two novel, continuous metrics, termed \textit{precision} and \textit{distinctiveness}, aimed at characterizing distortions in the conceptual content of individual recall events, and the conceptual overlap between how people described different events.
+The above analyses illustrate how our framework for characterizing the dynamic conceptual content of naturalistic episodes enables us to carry out analyses that have traditionally been applied to much simpler list-learning paradigms.  However, perhaps the most interesting aspects of memory for naturalistic episodes are those that have no list-learning analogs.  The nuances of how one's memory for an event might capture some details, yet distort or neglect others, is central to how we use our memory systems in daily life.  Yet when researchers study memory in highly simplified paradigms, those nuances are not typically observable.  We next developed two novel, continuous metrics, termed ``precision'' and ``distinctiveness,'' aimed at characterizing distortions in the conceptual content of individual recall events, and the conceptual overlap between how people described different events.
 
-\textit{Precision} is intended to capture the ``completeness'' of recall, or how fully the presented content was recapitulated in a participant's recounting.  We define a recall event's precision as the maximum correlation between the topic proportions of that recall event and any episode event (Fig.~\ref{fig:precision-distinctiveness}).  In other words, given that a recall event best matches a particular episode event, more precisely recalled events overlap more strongly with the conceptual content of the original episode event.  When a given event is assigned a blend of several topics, as is often the case (Fig.~\ref{fig:model}), a high precision score requires recapitulating the relative topic proportions during recall.
+Precision is intended to capture the ``completeness'' of recall, or how fully the presented content was recapitulated in a participant's recounting.  We define a recall event's precision as the maximum correlation between the topic proportions of that recall event and any episode event (Fig.~\ref{fig:precision-distinctiveness}).  In other words, given that a recall event best matches a particular episode event, more precisely recalled events overlap more strongly with the conceptual content of the original episode event.  When a given event is assigned a blend of several topics, as is often the case (Fig.~\ref{fig:model}), a high precision score requires recapitulating the relative topic proportions during recall.
 
-\textit{Distinctiveness} is intended to capture the ``specificity'' of recall.  In other words, distinctiveness quantifies the extent to which a given recall event reflects the most similar episode event over and above other episode events.  Intuitively, distinctiveness is like a normalized variant of our precision metric.  Whereas precision solely measures how much detail about an episode was captured in someone's recall, distinctiveness penalizes details that also pertain to other episode events.  We define the distinctiveness of an event's recall as its precision expressed in standard deviation units with respect to other episode events.  Specifically, for a given recall event, we compute the correlation between its topic vector and that of each episode event.  This yields a distribution of correlation coefficients (one per episode event).  We subtract the mean and divide by the standard deviation of this distribution to $z$-score the coefficients.  The maximum value in this distribution (which, by definition, belongs to the episode event that best matches the given recall event) is that recall event's distinctiveness score.  In this way, recall events that match one episode event far better than all other episode events will receive a high distinctiveness score.  By contrast, a recall event that matches all episode events roughly equally will receive a comparatively low distinctiveness score.
+Distinctiveness is intended to capture the ``specificity'' of recall.  In other words, distinctiveness quantifies the extent to which a given recall event reflects the most similar episode event over and above other episode events.  Intuitively, distinctiveness is like a normalized variant of our precision metric.  Whereas precision solely measures how much detail about an event was captured in someone's recall, distinctiveness penalizes details that also pertain to other episode events.  We define the distinctiveness of an event's recall as its precision expressed in standard deviation units with respect to other episode events.  Specifically, for a given recall event, we compute the correlation between its topic vector and that of each episode event.  This yields a distribution of correlation coefficients (one per episode event).  We subtract the mean and divide by the standard deviation of this distribution to $z$-score the coefficients.  The maximum value in this distribution (which, by definition, belongs to the episode event that best matches the given recall event) is that recall event's distinctiveness score.  In this way, recall events that match one episode event far better than all other episode events will receive a high distinctiveness score.  By contrast, a recall event that matches all episode events roughly equally will receive a comparatively low distinctiveness score.
 
-In addition to examining how precisely and distinctively participants recalled individual events, one may also use these metrics to summarize each participant's performance by averaging across a participant's event-wise precision or distinctiveness scores.  This enables us to quantify how precisely a participant tended to recall subtle within-event details, as well as how specific (distinctive) those details were to individual events from the episode.  Participants' average precision and distinctiveness scores were strongly correlated ($r(15) = 0.90, p < 0.001$).  This indicates that participants who tended to precisely recount low-level details of episode events also tended to do so in an event-specific way (e.g., as opposed to detailing recurring themes that were present in most or all episode events; this behavior would have resulted in high precision but low distinctiveness).  We found that, across participants, higher precision scores were positively correlated with the numbers of both hand-annotated scenes ($r(15) = 0.60, p = 0.010$) and model-estimated events ($r(15) = 0.90, p < 0.001$) that participants recalled.  Participants' average distinctiveness scores were also correlated with both the hand-annotated ($r(15) = 0.45, p = 0.068$) and model-estimated ($r(15) = 0.71, p = 0.001$) numbers of recalled events.
+In addition to examining how precisely and distinctively participants recalled individual events, one may also use these metrics to summarize each participant's performance by averaging across a participant's event-wise precision or distinctiveness scores.  This enables us to quantify how precisely a participant tended to recall subtle within-event details, as well as how specific (distinctive) those details were to individual events from the episode.  Participants' average precision and distinctiveness scores were strongly correlated ($r(15) = 0.90,~p < 0.001,~95\%~\mathrm{CI} = [0.66, 0.96]$).  This indicates that participants who tended to precisely recount low-level details of episode events also tended to do so in an event-specific way (e.g., as opposed to detailing recurring themes that were present in most or all episode events; this behavior would have resulted in high precision but low distinctiveness).  We found that, across participants, higher precision scores were positively correlated with the numbers of both hand-annotated scenes ($r(15) = 0.60,~p = 0.010,~95\%~\mathrm{CI} = [0.02, 0.83]$) and model-estimated events ($r(15) = 0.90,~p < 0.001,~95\%~\mathrm{CI} = [0.54, 0.96]$) that participants recalled.  Participants' average distinctiveness scores were also marginally correlated with both the hand-annotated ($r(15) = 0.45,~p = 0.068,~95\%~\mathrm{CI} = [-0.21, 0.79]$) and model-estimated ($r(15) = 0.71,~p = 0.001,~95\%~\mathrm{CI} = [-0.07, 0.90]$) numbers of recalled events.
 
 
 \begin{figure}[tp]
   \centering
   \vspace*{-1cm}
   \includegraphics[width=.7\textwidth]{figs/precision_distinctiveness_detail}
-  \caption{\small \textbf{Precision reflects the completeness of recall, whereas distinctiveness reflects recall specificity.} \textbf{A.} Recall precision by episode event.  Grey violin plots display kernel density estimates for the distribution of recall precision scores for a single episode event.  Colored dots within each violin plot represent individual participants' recall precision for the given event.  \textbf{B.} Recall distinctiveness by episode event, analogous to Panel A.  \textbf{C.} The set of ``Narrative Details'' episode annotations (generated by \citealp{ChenEtal17}) comprising an example episode event (22) identified by the HMM.  Each action or feature is highlighted in a different color.  \textbf{D.} Sentences comprising the most precise (P17) and least precise (P6) participants' recalls of episode event 21.  Descriptions of specific actions or features reflecting those highlighted in Panel B are highlighted in the corresponding color.  The text highlighted in gray denotes a (rare) false recall.  The unhighlighted text denotes correctly recalled information about other episode events.  \textbf{E.} The sets of ``Narrative Details'' episode annotations (generated by \citealp{ChenEtal17}) for scenes comprising episode events described by the example participants in Panel F.  Each event's text is highlighted in a different color.  \textbf{F.} The sentences comprising the most distinctive (P9) and least distinctive (P6) participants' recalls of episode event 21.  Sections of recall describing each each episode event in Panel E are highlighted with the corresponding color.}
+  \caption{\small \textbf{Precision reflects the completeness of recall, whereas distinctiveness reflects recall specificity.} \textbf{A.} Recall precision by episode event.  Grey violin plots display kernel density estimates for the distribution of recall precision scores for a single episode event.  Colored dots within each violin plot represent individual participants' recall precisions for the given event.  \textbf{B.} Recall distinctiveness by episode event, analogous to Panel A.  \textbf{C.} The set of ``Narrative Details'' episode annotations~\citep{ChenEtal17} comprising an example episode event (22) identified by the HMM.  Each action or feature is highlighted in a different color.  \textbf{D.} Sentences comprising the most precise (P17) and least precise (P6) participants' recalls of episode event 21.  Descriptions of specific actions or features reflecting those highlighted in Panel B are highlighted in the corresponding color.  The text highlighted in gray denotes a (rare) false recall.  The unhighlighted text denotes correctly recalled information about other episode events.  \textbf{E.} The sets of ``Narrative Details'' episode annotations~\citep{ChenEtal17} for scenes comprising episode events described by the example participants in Panel F.  Each event's text is highlighted in a different color.  \textbf{F.} The sentences comprising the most distinctive (P9) and least distinctive (P6) participants' recalls of episode event 21.  Sections of recall describing each each episode event in Panel E are highlighted with the corresponding color.}
   \label{fig:precision-detail}
 \end{figure}
 
 
 Examining individual recalls of the same episode event can provide insights into how the above precision and distinctiveness scores may be used to characterize similarities and differences in how different people describe the same shared experience.  In Figure \ref{fig:precision-detail}, we compare recalls for the same episode event from the participants with the highest (P17) and lowest (P6) precision scores.  From the HMM-identified episode event boundaries, we recovered the set of annotations describing the content of a single episode event (event 21; Fig.~\ref{fig:precision-detail}C), and divided them into different color-coded sections for each action or feature described.  Next, we used an analogous approach to identify the set of sentences comprising the corresponding recall event from each of the two example participants (Fig.~\ref{fig:precision-detail}D).  We then colored all words describing actions and features in the transcripts shown in Panel D according to the color-coded annotations in Panel C.  Visual comparison of these example recalls reveals that the more precise recall captures more of the episode event's content, and in greater detail.
 
-Figure \ref{fig:precision-detail} also illustrates the differences between high and low distinctiveness scores.  We extracted the set of sentences comprising the most distinctive recall event (P9) and least distinctive recall event (P6) corresponding to the example episode event shown in Panel C (event 21).  We also extracted the annotations for all episode events whose content these participants' single recall events described.  We assigned each episode event a unique color (Fig.~\ref{fig:precision-detail}E), and colored each recalled sentence (Panel F) according to the episode events they best matched.  Visual inspection of Panel F reveals that the most distinctive recall's content is tightly concentrated around event 21, whereas the least distinctive recall incorporates content from a much wider range of episode events.
+Figure \ref{fig:precision-detail} also illustrates the differences between high and low distinctiveness scores.  We extracted the set of sentences comprising the most distinctive recall event (P9) and least distinctive recall event (P6) corresponding to the example episode event shown in Panel C (event 21).  We also extracted the annotations for all episode events whose content these participants' single recall events touched on.  We assigned each episode event a unique color (Fig.~\ref{fig:precision-detail}E), and colored each recalled sentence (Panel F) according to the episode events they best matched.  Visual inspection of Panel F reveals that the most distinctive recall's content is tightly concentrated around event 21, whereas the least distinctive recall incorporates content from a much wider range of episode events.
 
-The preceding analyses sought to characterize how participants' recountings of individual episode events captured the low-level details of each event.  Next, we sought to characterize how participants' recountings of the full episode captured its high-level essence---i.e., the shape of the episode's trajectory through word embedding (topic) space.  To visualize the essence of the episode and each participant's recall trajectory~\citep{HeusEtal18a}, we projected the topic proportions matrices for the episode and recalls onto a shared two-dimensional space using Uniform Manifold Approximation and Projection~\citep[UMAP; ][]{McInEtal18}.  In this lower-dimensional space, each point represents a single episode or recall event, and the distances between the points reflect the distances between the events' associated topic vectors (Fig.~\ref{fig:trajectory}). In other words, events that are nearer to each other in this space are more semantically similar, and those that are farther apart are less so.
+The preceding analyses sought to characterize how participants' recountings of individual episode events captured the low-level details of each event.  Next, we sought to characterize how participants' recountings of the full episode captured its high-level essence---i.e., the shape of the episode's trajectory through word embedding (topic) space.  To visualize the essence of the episode and each participant's recall trajectory~\citep{HeusEtal18a}, we projected the topic proportions matrices for the episode and recalls onto a shared two-dimensional space using Uniform Manifold Approximation and Projection (UMAP)~\citep{McInEtal18}.  In this lower-dimensional space, each point represents a single episode or recall event, and the distances between the points reflect the distances between the events' associated topic vectors (Fig.~\ref{fig:trajectory}). In other words, events that are nearer to each other in this space are more semantically similar, and those that are farther apart are less so.
 
 \begin{figure}[tp]
 \centering
 \includegraphics[width=1\textwidth]{figs/trajectory}
-\caption{\small \textbf{Trajectories through topic space capture the dynamic content of the episode and recalls.}  All panels: the topic proportion matrices have been projected onto a shared two-dimensional space using UMAP.  \textbf{A.} The two-dimensional topic trajectory taken by the episode of \textit{Sherlock}.  Each dot indicates an event identified using the HMM (see \textit{Methods}); the dot colors denote the order of the events (early events are in red; later events are in blue), and the connecting lines indicate the transitions between successive events.  \textbf{B.} The average two-dimensional trajectory captured by participants' recall sequences, with the same format and coloring as the trajectory in Panel A.  To compute the event positions, we matched each recalled event with an event from the original episode (see \textit{Results}), and then we averaged the positions of all events with the same label.  The arrows reflect the average transition direction through topic space taken by any participants whose trajectories crossed that part of topic space; blue denotes reliable agreement across participants via a Rayleigh test ($p < 0.05$, corrected).  \textbf{C.} The recall topic trajectories (gray) taken by each individual participant (P1--P17).  The episode's trajectory is shown in black for reference.  Here, events (dots) are colored by their matched episode event (Panel A).}
+\caption{\small \textbf{Trajectories through topic space capture the dynamic content of the episode and recalls.}  All panels: the topic proportion matrices have been projected onto a shared two-dimensional space using UMAP.  \textbf{A.} The two-dimensional topic trajectory taken by the episode of \textit{Sherlock}.  Each dot indicates an event identified using the HMM (see \textit{Methods}); the dot colors denote the order of the events (early events are in red; later events are in blue), and the connecting lines indicate the transitions between successive events.  \textbf{B.} The average two-dimensional trajectory captured by participants' recall sequences, with the same format and coloring as the trajectory in Panel A.  To compute the event positions, we matched each recalled event with an event from the original episode (see \textit{Results}), and then we averaged the positions of all events with the same label.  The arrows reflect the average transition direction through topic space taken by any participants whose trajectories crossed that part of topic space; blue denotes reliable agreement across participants via a Rayleigh test ($p < 0.05$, corrected).  For additional detail see \textit{Methods} and Supplementary Figure~\arrows.  \textbf{C.} The recall topic trajectories (gray) taken by each individual participant (P1--P17).  The episode's trajectory is shown in black for reference.  Here, events (dots) are colored by their matched episode event (Panel A).}
 \label{fig:trajectory}
 \end{figure}
 
-Visual inspection of the episode and recall topic trajectories reveals a striking pattern.  First, the topic trajectory of the episode (which reflects its dynamic content; Fig.~\ref{fig:trajectory}A) is captured nearly perfectly by the averaged topic trajectories of participants' recalls (Fig.~\ref{fig:trajectory}B).  To assess the consistency of these recall trajectories across participants, we asked: given that a participant's recall trajectory had entered a particular location in the reduced topic space, could the position of their \textit{next} recalled event be predicted reliably?  For each location in the reduced topic space, we computed the set of line segments connecting successively recalled events (across all participants) that intersected that location (see \textit{Methods}).  We then computed (for each location) the distribution of angles formed by the lines defined by those line segments and a fixed reference line (the $x$-axis).  Rayleigh tests revealed the set of locations in topic space at which these across-participant distributions exhibited reliable peaks (blue arrows in Fig.~\ref{fig:trajectory}B reflect significant peaks at $p < 0.05$, corrected).  We observed that the locations traversed by nearly the entire episode trajectory exhibited such peaks.  In other words, participants' recalls exhibited similar trajectories to each other that also matched the trajectory of the original episode (Fig.~\ref{fig:trajectory}C).  This is especially notable when considering the fact that the number of HMM-identified recall events (dots in Fig.~\ref{fig:trajectory}C) varied considerably across people, and that every participant used different words to describe what they had remembered happening in the episode.  Differences in the numbers of recall events appear in participants' trajectories as differences in the sampling resolution along the trajectory.  We note that this framework also provides a means of disentangling classic ``proportion recalled'' measures (i.e., the proportion of episode events described in participants' recalls) from participants' abilities to recapitulate the episode's essence (i.e., the similarity between the shapes of the original episode trajectory and that defined by each participant's recounting of the episode).
+Visual inspection of the episode and recall topic trajectories reveals a striking pattern.  First, the topic trajectory of the episode (which reflects its dynamic content; Fig.~\ref{fig:trajectory}A) is captured nearly perfectly by the averaged topic trajectories of participants' recalls (Fig.~\ref{fig:trajectory}B).  To assess the consistency of these recall trajectories across participants, we asked: given that a participant's recall trajectory had entered a particular location in the reduced topic space, could the position of their next recalled event be predicted reliably?  For each location in the reduced topic space, we computed the set of line segments connecting successively recalled events (across all participants) that intersected that location (see \textit{Methods}, Supp.\ Fig.~\arrows).  We then computed (for each location) the distribution of angles formed by the lines defined by those line segments and a fixed reference line (the $x$-axis).  Rayleigh tests revealed the set of locations in topic space at which these across-participant distributions exhibited reliable peaks (blue arrows in Fig.~\ref{fig:trajectory}B reflect significant peaks at $p < 0.05$, corrected).  We observed that the locations traversed by nearly the entire episode trajectory exhibited such peaks.  In other words, participants' recalls exhibited similar trajectories to each other that also matched the trajectory of the original episode (Fig.~\ref{fig:trajectory}C).  This is especially notable when considering the fact that the number of HMM-identified recall events (dots in Fig.~\ref{fig:trajectory}C) varied considerably across people, and that every participant used different words to describe what they had remembered happening in the episode.  Differences in the numbers of recall events appear in participants' trajectories as differences in the sampling resolution along the trajectory.  We note that this framework also provides a means of disentangling classic ``proportion recalled'' measures (i.e., the proportion of episode events described in participants' recalls) from participants' abilities to recapitulate the episode's essence (i.e., the similarity between the shapes of the original episode trajectory and that defined by each participant's recounting of the episode).
 
-In addition to enabling us to visualize the episode's high-level essence, describing the episode as a geometric trajectory also enables us to drill down to individual words and quantify how each word relates to the memorability of each event.  This provides another approach to examining participants' recall for low-level details beyond the precision and distinctiveness measures we defined above.  The results displayed in Figures \ref{fig:list-learning}C and \ref{fig:precision-detail}A suggest that certain events were remembered better than others.  Given this, we next asked asked whether the events that were generally remembered precisely or imprecisely tended to reflect particular content.  Because our analysis framework projects the dynamic episode content and participants' recalls into a shared space, and because the dimensions of that space represent topics (which are, in turn, sets of weights over known words in the vocabulary), we are able to recover the weighted combination of words that make up any point (i.e., topic vector) in this space.  We first computed the average precision with which participants recalled each of the 30 episode events (Fig.~\ref{fig:topics}A; note that this result is analogous to a serial position curve created from our precision metric).  We then computed a weighted average of the topic vectors for each episode event, where the weights reflected how precisely each event was recalled.  To visualize the result, we created a ``wordle'' image~\citep{MuelEtal18} where words weighted more heavily by more precisely-remembered topics appear in a larger font (Fig.~\ref{fig:topics}B, green box).  Across the full episode, content that weighted heavily on topics and words central to the major foci of the episode (e.g., the names of the two main characters, ``Sherlock'' and ``John,'' and the address of a major recurring location, ``221B Baker Street'') was best remembered.  An analogous analysis revealed which themes were less-precisely remembered.  Here in computing the weighted average over events' topic vectors, we weighted each event in \textit{inverse} proportion to its average precision (Fig.~\ref{fig:topics}B, red box).  The least precisely remembered episode content reflected information that was extraneous to the episode's essence, such as the proper names of relatively minor characters (e.g., ``Mike,'' ``Molly,'' and ``Lestrade'') and locations (e.g., ``St. Bartholomew's Hospital'').
+In addition to enabling us to visualize the episode's high-level essence, describing the episode as a geometric trajectory also enables us to drill down to individual words and quantify how each word relates to the memorability of each event.  This provides another approach to examining participants' recall for low-level details beyond the precision and distinctiveness measures we defined above.  The results displayed in Figures \ref{fig:list-learning}C and \ref{fig:precision-detail}A suggest that certain events were remembered better than others.  Given this, we next asked whether the events that were generally remembered precisely or imprecisely tended to reflect particular content.  Because our analysis framework projects the dynamic episode content and participants' recalls into a shared space, and because the dimensions of that space represent topics (which are, in turn, sets of weights over known words in the vocabulary), we are able to recover the weighted combination of words that make up any point (i.e., topic vector) in this space.  We first computed the average precision with which participants recalled each of the 30 episode events (Fig.~\ref{fig:topics}A; note that this result is analogous to a serial position curve created from our precision metric).  We then computed a weighted average of the topic vectors for each episode event, where the weights reflected how precisely each event was recalled.  To visualize the result, we created a ``wordle'' image~\citep{MuelEtal18} where words weighted more heavily by more precisely remembered topics appear in a larger font (Fig.~\ref{fig:topics}B, green box).  Across the full episode, content that weighted heavily on topics and words central to the major foci of the episode (e.g., the names of the two main characters, ``Sherlock'' and ``John,'' and the address of a major recurring location, ``221B Baker Street'') was best remembered.  An analogous analysis revealed which themes were less-precisely remembered.  Here, in computing the weighted average over events' topic vectors, we weighted each event in inverse proportion to its average precision (Fig.~\ref{fig:topics}B, red box).  The least precisely remembered episode content reflected information that was extraneous to the episode's essence, such as the proper names of relatively minor characters (e.g., ``Mike,'' ``Molly,'' and ``Lestrade'') and locations (e.g., ``St. Bartholomew's Hospital'').
 
 \begin{figure}[tp]
 \centering
@@ -142,105 +145,99 @@ \section*{Results}
 
 A similar result emerged from assessing the topic vectors for individual episode and recall events (Fig.~\ref{fig:topics}C).  Here, for each of the three most and least precisely remembered episode events, we have constructed two wordles: one from the original episode event's topic vector (left) and a second from the average recall topic vector for that event (right).  The three most precisely remembered events (circled in green) correspond to scenes integral to the central plot-line: a mysterious figure spying on John in a phone booth; John meeting Sherlock at Baker St.~to discuss the murders; and Sherlock laying a trap to catch the killer.  Meanwhile, the least precisely remembered events (circled in red) reflect scenes that comprise minor plot points: a video of singing cartoon characters that participants viewed in an introductory clip prior to the main episode; John asking Molly about Sherlock's habit of over-analyzing people; and Sherlock noticing evidence of Anderson's and Donovan's affair.
 
-The results this far inform us about which aspects of the dynamic content in the episode participants watched were preserved or altered in participants' memories.  We next carried out a series of analyses aimed at understanding which brain structures might facilitate these preservations and transformations between the participants' shared experience of watching the episode and their subsequent memories of the episode.  In the first analysis, we sought to identify brain structures that were sensitive to the dynamic unfolding of the episode's content, as characterized by its topic trajectory.  We used a searchlight procedure to identify clusters of voxels whose activity patterns displayed a proximal temporal correlation structure (as participants watched the episode) matching that of the original episode's topic proportions (Fig.~\ref{fig:brainz}A; see \textit{Methods} for additional details).  In a second analysis, we sought to identify brain structures whose responses (during episode viewing) reflected how each participant would later structure their \textit{recounting} of the episode.  We used a searchlight procedure to identify clusters of voxels whose proximal temporal correlation matrices matched that of the topic proportions matrix for each participant's recall transcript (Figs.~\ref{fig:brainz}B; see \textit{Methods} for additional details).  To ensure our searchlight procedure identified regions \textit{specifically} sensitive to the temporal structure of the episode or recalls (i.e., rather than those with a temporal autocorrelation length similar to that of the episode and recalls), we performed a phase shift-based permutation correction (see \textit{Methods}). As shown in Figure~\ref{fig:brainz}C, the episode-driven searchlight analysis revealed a distributed network of regions that may play a role in processing information relevant to the narrative structure of the episode.  The recall-driven searchlight analysis revealed a second network of regions (Fig.~\ref{fig:brainz}D) that may facilitate a person-specific transformation of one's experience into memory.  In identifying regions whose responses to ongoing experiences reflect how those experiences will be remembered later, this latter analysis extends classic \textit{subsequent memory effect analyses}~\citep[e.g.,][]{PallWagn02} to the domain of naturalistic experiences.
+The results this far inform us about which aspects of the dynamic content in the episode participants watched were preserved or altered in participants' memories.  We next carried out a series of analyses aimed at understanding which brain structures might facilitate these preservations and transformations between the participants' shared experience of watching the episode and their subsequent memories of the episode.  In the first analysis, we sought to identify brain structures that were sensitive to the dynamic unfolding of the episode's content, as characterized by its topic trajectory.  We used a searchlight procedure to identify clusters of voxels whose activity patterns displayed a proximal temporal correlation structure (as participants watched the episode) matching that of the original episode's topic proportions (Fig.~\ref{fig:brainz}A; see \textit{Methods} for additional details).  In a second analysis, we sought to identify brain structures whose responses (during episode viewing) reflected how each participant would later structure their recounting of the episode.  We used a searchlight procedure to identify clusters of voxels whose proximal temporal correlation matrices matched that of the topic proportions matrix for each participant's recall transcript (Figs.~\ref{fig:brainz}B; see \textit{Methods} for additional details).  To ensure our searchlight procedure identified regions specifically sensitive to the temporal structure of the episode or recalls (i.e., rather than those with a temporal autocorrelation length similar to that of the episode and recalls), we performed a phase shift-based permutation correction (see \textit{Methods}). As shown in Figure~\ref{fig:brainz}C, the episode-driven searchlight analysis revealed a distributed network of regions that may play a role in processing information relevant to the narrative structure of the episode.  The recall-driven searchlight analysis revealed a second network of regions (Fig.~\ref{fig:brainz}D) that may facilitate a person-specific transformation of one's experience into memory.  In identifying regions whose responses to ongoing experiences reflect how those experiences will be remembered later, this latter analysis extends classic ``subsequent memory effect analyses''~\citep{PallWagn02} to the domain of naturalistic experiences.
 
 \begin{figure}[tp]
 \centering
 \includegraphics[width=1\textwidth]{figs/searchlights}
-\caption{\small \textbf{Brain structures that underlie the transformation of experience into memory.} \textbf{A.} We isolated the proximal diagonals from the upper triangle of the episode correlation matrix, and applied this same diagonal mask to the voxel response correlation matrix for each cube of voxels in the brain. We then searched for brain regions whose activation timeseries consistently exhibited a similar proximal correlational structure to the episode model, across participants.  \textbf{B.} We used dynamic time warping \citep{BernClif94} to align each participant's recall timeseries to the TR timeseries of the episode.  We then applied the same diagonal mask used in Panel A to isolate the proximal temporal correlations and searched for brain regions whose activation timeseries for an individual consistently exhibited a similar proximal correlational structure to each individual's recall.  \textbf{C.} We identified a network of regions sensitive to the narrative structure of participants' ongoing experience.  The map shown is thresholded at $p < 0.05$, corrected.  The top ten \texttt{Neurosynth} terms displayed in the panel were computed using the unthresholded map.  \textbf{D}. We also identified a network of regions sensitive to how individuals would later structure the episode's content in their recalls.  The map shown is thresholded at $p < 0.05$, corrected.  The top ten \texttt{Neurosynth} terms displayed in the panel were computed using the unthresholded map.}
+\caption{\small \textbf{Brain structures that underlie the transformation of experience into memory.} \textbf{A.} We isolated the proximal diagonals from the upper triangle of the episode correlation matrix, and applied this same diagonal mask to the voxel response correlation matrix for each cube of voxels in the brain. We then searched for brain regions whose activation timeseries consistently exhibited a similar proximal correlational structure to the episode model, across participants.  \textbf{B.} We used dynamic time warping \citep{BernClif94} to align each participant's recall timeseries to the TR timeseries of the episode.  We then computed the temporal correlation matrix of each participant's warped recalls.  Next, we applied the same diagonal mask used in Panel A to isolate the proximal temporal correlations and searched for brain regions whose activation timeseries for each participant consistently exhibited a similar proximal correlational structure to that participant's recalls.  \textbf{C.} We identified a network of regions sensitive to the narrative structure of participants' ongoing experience.  The map shown is thresholded at $p < 0.05$, corrected.  The top ten \texttt{Neurosynth} terms displayed in the panel were computed using the unthresholded map.  \textbf{D}. We also identified a network of regions sensitive to how individuals would later structure the episode's content in their recalls.  The map shown is thresholded at $p < 0.05$, corrected.  The top ten \texttt{Neurosynth} terms displayed in the panel were computed using the unthresholded map.}
 \label{fig:brainz}
 \end{figure}
 
-The searchlight analyses described above yielded two distributed networks of brain regions whose activity timecourses tracked with the temporal structure of the episode (Fig.~\ref{fig:brainz}C) or participants' subsequent recalls (Fig.~\ref{fig:brainz}D).  We next sought to gain greater insight into the structures and functional networks our results reflected.  To accomplish this, we performed an additional, exploratory analysis using \texttt{Neurosynth} \citep{YarkEtal11}.  Given an arbitrary statistical map as input, \texttt{Neurosynth} performs a massive automated meta-analysis, returning a ranked list of terms frequently used in neuroimaging papers that report similar statistical maps. We ran \texttt{Neurosynth} on the (unthresholded) permutation-corrected maps for the episode- and recall-driven searchlight analyses. The top ten terms with maximally similar meta-analysis images identified by \texttt{Neurosynth} are shown in Figure \ref{fig:brainz}.
+The searchlight analyses described above yielded two distributed networks of brain regions whose activity timecourses tracked with the temporal structure of the episode (Fig.~\ref{fig:brainz}C) or participants' subsequent recalls (Fig.~\ref{fig:brainz}D).  We next sought to gain greater insight into the structures and functional networks our results reflected.  To accomplish this, we performed an additional, exploratory analysis using \texttt{Neurosynth}~\citep{YarkEtal11}.  Given an arbitrary statistical map as input, \texttt{Neurosynth} performs a massive automated meta-analysis, returning a frequency-ranked list of terms used in neuroimaging papers that report similar statistical maps. We ran \texttt{Neurosynth} on the (unthresholded) permutation-corrected maps for the episode- and recall-driven searchlight analyses. The top ten terms with maximally similar meta-analysis images identified by \texttt{Neurosynth} are shown in Figure \ref{fig:brainz}.
 
 \section*{Discussion}
 \label{sec:discussion}
 
-Explicitly modeling the dynamic content of a naturalistic stimulus and participants' memories enabled us to connect the present study of naturalistic recall with an extensive prior literature that has used list-learning paradigms to study memory~\citep[for review see][]{Kaha12}, as in Figure~\ref{fig:list-learning}.  We found some similarities between how participants in the present study recounted a television episode and how participants typically recall memorized random word lists.  However, our broader claim is that word lists miss out on fundamental aspects of naturalistic memory more like the sort of memory we rely on in everyday life.  For example, there are no random word list analogs of character interactions, conceptual dependencies between temporally distant episode events, the sense of solving a mystery that pervades the \textit{Sherlock} episode, or the myriad other features of the episode that convey deep meaning and capture interest.  Nevertheless, each of these properties affects how people process and engage with the episode as they are watching it, and how they remember it later.  The overarching goal of the present study is to characterize how the rich dynamics of the episode affect the rich behavioral and neural dynamics of how people remember it.
+Explicitly modeling the dynamic content of a naturalistic stimulus and participants' memories enabled us to connect the present study of naturalistic recall with an extensive prior literature that has used list-learning paradigms to study memory~\citep{Kaha12}, as in Figure~\ref{fig:list-learning}.  We found some similarities between how participants in the present study recounted a television episode and how participants typically recall memorized random word lists.  However, our broader claim is that word lists miss out on fundamental aspects of naturalistic memory more like the sort of memory we rely on in everyday life.  For example, there are no random word list analogs of character interactions, conceptual dependencies between temporally distant episode events, the sense of solving a mystery that pervades the \textit{Sherlock} episode, or the myriad other features of the episode that convey deep meaning and capture interest.  Nevertheless, each of these properties affects how people process and engage with the episode as they are watching it, and how they remember it later.  The overarching goal of the present study is to characterize how the rich dynamics of the episode affect the rich behavioral and neural dynamics of how people remember it.
 
-Our work casts remembering as reproducing (behaviorally and neurally) the topic trajectory, or ``shape,'' of an experience.  When we characterized memory for a television episode using this framework, we found that every participant's recounting of the episode recapitulated the low spatial frequency details of the shape of its trajectory through topic space (Fig.~\ref{fig:trajectory}).  We termed this narrative scaffolding the episode's \textit{essence}.  Where participants' behaviors varied most was in their tendencies to recount specific low-level details from each episode event.  Geometrically, this appears as high spatial frequency distortions in participants' recall trajectories relative to the trajectory of the original episode (Fig.~\ref{fig:topics}).  We developed metrics to characterize the precision (recovery of any and all event-level information) and distinctiveness (recovery of event-specific information).  We also used word cloud visualizations to interpret the details of these event-level distortions.
+Our work casts remembering as reproducing (behaviorally and neurally) the topic trajectory, or ``shape,'' of an experience, thereby drawing implicit analogies between mentally navigating through word embedding spaces and physically navigating through spatial environments~\citep{BellEtal18, BellEtal20, ConsEtal16}.  When we characterized memory for a television episode using this framework, we found that every participant's recounting of the episode recapitulated the low spatial frequency details of the shape of its trajectory through topic space (Fig.~\ref{fig:trajectory}).  We termed this narrative scaffolding the episode's essence.  Where participants' behaviors varied most was in their tendencies to recount specific low-level details from each episode event.  Geometrically, this appears as high spatial frequency distortions in participants' recall trajectories relative to the trajectory of the original episode (Fig.~\ref{fig:topics}).  We developed metrics to characterize the precision (recovery of any and all event-level information) and distinctiveness (recovery of event-specific information).  We also used word cloud visualizations to interpret the details of these event-level distortions.
 
-The neural analyses we carried out (Fig.~\ref{fig:brainz}) also leveraged our geometric framework for characterizing the shapes of the episode and participants' recountings.  We identified one network of regions whose responses tracked with temporal correlations in the conceptual content of the episode (as quantified by topic models applied to a set of annotations about the episode).  This network included orbitofrontal cortex, ventromedial prefrontal cortex, and striatum, among others.  As reviewed by \cite{RangRitc12}, several of these regions are members of the \textit{anterior temporal system}, which has been implicated in assessing and processing the familiarity of ongoing experiences, emotions, social cognition, and reward.  A second network we identified tracked with temporal correlations in the idiosyncratic conceptual content of participants' subsequent recountings of the episode.  This network included occipital cortex, extrastriate cortex, fusiform gyrus, and the precuneus.  Several of these regions are members of the \textit{posterior medial system}~\citep{RangRitc12}, which has been implicated in matching incoming cues about the current situation to internally maintained \textit{situation models} that specify the parameters and expectations inherent to the current situation~\citep[also see][]{ZackEtal07, ZwaaRadv98}.  Taken together, our results support the notion that these two (partially overlapping) networks work in coordination to make sense of our ongoing experiences, distort them in a way that links them with our prior knowledge and experiences, and encodes those distorted representations into memory for our later use.
+The neural analyses we carried out (Fig.~\ref{fig:brainz}) also leveraged our geometric framework for characterizing the shapes of the episode and participants' recountings.  We identified one network of regions whose responses tracked with temporal correlations in the conceptual content of the episode (as quantified by topic models applied to a set of annotations about the episode).  This network included orbitofrontal cortex, ventromedial prefrontal cortex, and striatum, among others.  As reviewed by Ranganath and Ritchey (2012)~\cite{RangRitc12}, several of these regions are members of the ``anterior temporal system,'' which has been implicated in assessing and processing the familiarity of ongoing experiences, emotions, social cognition, and reward.  A second network we identified tracked with temporal correlations in the idiosyncratic conceptual content of participants' subsequent recountings of the episode.  This network included occipital cortex, extrastriate cortex, fusiform gyrus, and the precuneus.  Several of these regions are members of the ``posterior medial system''~\citep{RangRitc12}, which has been implicated in matching incoming cues about the current situation to internally maintained ``situation models'' that specify the parameters and expectations inherent to the current situation~\citep{ZackEtal07, ZwaaRadv98}.  Taken together, our results support the notion that these two (partially overlapping) networks work in coordination to make sense of our ongoing experiences, distort them in a way that links them with our prior knowledge and experiences, and encodes those distorted representations into memory for our later use.  Our work also provides a potential framework for modeling and elucidating ``memory schemas''---i.e., cognitive abstractions that may be applied to multiple related experiences~\citep{GilbMarl17, BaldEtal18}.  For example, the event-level geometric scaffolding of an experience (e.g., Fig.~\ref{fig:trajectory}A) might reflect its underlying schema, and experiences that share similar schemas might have similar shapes.  This could also help explain how brain structures including the ventromedial prefrontal cortex~\citep{GilbMarl17} (Fig.~\ref{fig:brainz}) might acquire or apply schema knowledge across different experiences (i.e., by learning patterns in the schema's shape).
 
-Our general approach draws inspiration from prior work aimed at elucidating the neural and behavioral underpinnings of how we process dynamic naturalistic experiences and remember them later.  Our approach to identifying neural responses to naturalistic stimuli (including experiences) entails building an explicit model of the stimulus dynamics and searching for brain regions whose responses are consistent with the model~\citep[also see][]{HuthEtal12, HuthEtal16}.  In prior work, a series of studies from Uri Hasson's group~\citep{LernEtal11, SimoEtal16, ChenEtal17, BaldEtal17, ZadbEtal17} have presented a clever alternative approach: rather than building an explicit stimulus model, these studies instead search for brain responses to the stimulus that are reliably similar across individuals.  So called \textit{inter-subject correlation} (ISC) and \textit{inter-subject functional connectivity} (ISFC) analyses effectively treat other people's brain responses to the stimulus as a ``model'' of how its features change over time~\citep[also see][]{SimoChan20}.  These purely brain-driven approaches are well suited to identifying which brain structures exhibit similar stimulus-driven responses across individuals.  Further, because neural response dynamics are observed data (rather than model approximations), such approaches do not require a detailed understanding of which stimulus properties or features might be driving the observed responses.  However, this also means that the specific stimulus features driving those responses are typically opaque to the researcher.  Our approach is complementary.  By explicitly modeling the stimulus dynamics, we are able to relate specific stimulus features to behavioral and neural dynamics.  However, when our model fails to accurately capture the stimulus dynamics that are truly driving behavioral and neural responses, our approach necessarily yields an incomplete characterization of the neural basis of the processes we are studying.
+Our general approach draws inspiration from prior work aimed at elucidating the neural and behavioral underpinnings of how we process dynamic naturalistic experiences and remember them later.  Our approach to identifying neural responses to naturalistic stimuli (including experiences) entails building an explicit model of the stimulus dynamics and searching for brain regions whose responses are consistent with the model~\citep{HuthEtal12, HuthEtal16}.  Building an explicit model of these dynamics also enables us to match up different people's recountings of a common shared experience, despite individual differences~\cite{GagnEtal20}.  In prior work, a series of studies from Uri Hasson's group~\citep{LernEtal11, SimoEtal16, ChenEtal17, BaldEtal17, ZadbEtal17} have presented a clever alternative approach: rather than building an explicit stimulus model, these studies instead search for brain responses to the stimulus that are reliably similar across individuals.  So called ``inter-subject correlation'' (ISC) and ``inter-subject functional connectivity'' (ISFC) analyses effectively treat other people's brain responses to the stimulus as a ``model'' of how its features change over time~\citep{SimoChan20}.  These purely brain-driven approaches are well suited to identifying which brain structures exhibit similar stimulus-driven responses across individuals.  Further, because neural response dynamics are observed data (rather than model approximations), such approaches do not require a detailed understanding of which stimulus properties or features might be driving the observed responses.  However, this also means that the specific stimulus features driving those responses are typically opaque to the researcher.  Our approach is complementary.  By explicitly modeling the stimulus dynamics, we are able to relate specific stimulus features to behavioral and neural dynamics.  However, when our model fails to accurately capture the stimulus dynamics that are truly driving behavioral and neural responses, our approach necessarily yields an incomplete characterization of the neural basis of the processes we are studying.
 
-Other recent work has used HMMs to discover latent event structure in neural responses to naturalistic stimuli~\citep{BaldEtal17}.  By applying HMMs to our explicit models of stimulus and memory dynamics, we gain a more direct understanding of those state dynamics.  For example, we found that although the events comprising each participant's recalls recapitulated the episode's essence, participants differed in the \textit{resolution} of their recounting of low-level details.  In turn, these individual behavioral differences were reflected in differences in neural activity dynamics as participants watched the television episode.
+Other recent work has used HMMs to discover latent event structure in neural responses to naturalistic stimuli~\citep{BaldEtal17}.  By applying HMMs to our explicit models of stimulus and memory dynamics, we gain a more direct understanding of those state dynamics.  For example, we found that although the events comprising each participant's recalls recapitulated the episode's essence, participants differed in the resolution of their recounting of low-level details.  In turn, these individual behavioral differences were reflected in differences in neural activity dynamics as participants watched the television episode.
 
 Our approach also draws inspiration from the growing field of word embedding models.  The topic models~\citep{BleiEtal03} we used to embed text from the episode annotations and participants' recall transcripts are just one of many models that have been studied in an extensive literature.  The earliest approaches to word embedding, including latent
-semantic analysis~\citep{LandDuma97}, used word co-occurrence statistics (i.e., how often pairs of words occur in the same documents contained in the corpus) to derive a unique feature vector for each word.  The feature vectors are constructed so that words that co-occur more frequently have feature vectors that are closer (in Euclidean distance).  Topic models are essentially an extension of those early models, in that they attempt to explicitly model the underlying causes of word co-occurrences by automatically identifying the set of themes or topics reflected across the documents in the corpus.  More recent work on these types of semantic models, including word2vec~\citep{MikoEtal13a}, the Universal Sentence Encoder~\citep{CerEtal18}, GPT-2~\citep{RadfEtal19}, and GTP-3~\citep{BrowEtal20} use deep neural networks to attempt to identify the deeper conceptual representations underlying each word.  Despite the growing popularity of these sophisticated deep learning-based embedding models, we chose to prioritize interpretability of the embedding dimensions (e.g., Fig.~\ref{fig:topics}) over raw performance (e.g., with respect to some predefined benchmark).  Nevertheless, we note that our general framework is, in principle, robust to the specific choice of language model as well as other aspects of our computational pipeline.  For example, the word embedding model, timeseries segmentation model, and the episode-recall matching function could each be customized to suit a particular question space or application.  Indeed, for some questions, interpretability of the embeddings may not be a priority, and thus other text embedding approaches (including the deep learning-based models described above) may be preferable.  Further work will be needed to explore the influence of particular models on our framework's predictions and performance.
+semantic analysis~\citep{LandDuma97}, used word co-occurrence statistics (i.e., how often pairs of words occur in the same documents contained in the corpus) to derive a unique feature vector for each word.  The feature vectors are constructed so that words that co-occur more frequently have feature vectors that are closer (in Euclidean distance).  Topic models are essentially an extension of those early models, in that they attempt to explicitly model the underlying causes of word co-occurrences by automatically identifying the set of themes or topics reflected across the documents in the corpus.  More recent work on these types of semantic models, including word2vec~\citep{MikoEtal13a}, the Universal Sentence Encoder~\citep{CerEtal18}, and Generative Pre-trained Transformers (e.g., GPT-2~\citep{RadfEtal19} and GTP-3~\citep{BrowEtal20}) use deep neural networks to attempt to identify the deeper conceptual representations underlying each word.  Despite the growing popularity of these sophisticated deep learning-based embedding models, we chose to prioritize interpretability of the embedding dimensions (e.g., Fig.~\ref{fig:topics}) over raw performance (e.g., with respect to some predefined benchmark).  Nevertheless, we note that our general framework is, in principle, robust to the specific choice of language model as well as other aspects of our computational pipeline.  For example, the word embedding model, timeseries segmentation model, and the episode-recall matching function could each be customized to suit a particular question space or application.  Indeed, for some questions, interpretability of the embeddings may not be a priority, and thus other text embedding approaches (including the deep learning-based models described above) may be preferable.  Further work will be needed to explore the influence of particular models on our framework's predictions and performance.
 
-Our work has broad implications for how we characterize and assess memory in real-world settings, such as the classroom or physician's office.  For example, the most commonly used classroom evaluation tools involve simply computing the proportion of correctly answered exam questions.  Our work indicates that this approach is only loosely related to what educators might really want to measure: how well did the students understand the key ideas presented in the course?  Under this typical framework of assessment, the same exam score of 50\% could be ascribed to two very different students: one who attended to the full course but struggled to learn more than a broad overview of the material, and one who attended to only half of the course but understood the attended material perfectly.  Instead, one could apply our computational framework to build explicit dynamic content models of the course material and exam questions.  This approach would provide a more nuanced and specific view into which aspects of the material students had learned well (or poorly).  In clinical settings, memory measures that incorporate such explicit content models might also provide more direct evaluations of patients' memories, and of doctor-patient interactions.
+Speculatively, our work may have broad implications for how we characterize and assess memory in real-world settings, such as the classroom or physician's office.  For example, the most commonly used classroom evaluation tools involve simply computing the proportion of correctly answered exam questions.  Our work suggests that this approach is only loosely related to what educators might really want to measure: how well did the students understand the key ideas presented in the course?  Under this typical framework of assessment, the same exam score of 50\% could be ascribed to two very different students: one who attended to the full course but struggled to learn more than a broad overview of the material, and one who attended to only half of the course but understood the attended material perfectly.  Instead, one could apply our computational framework to build explicit dynamic content models of the course material and exam questions.  This approach might provide a more nuanced and specific view into which aspects of the material students had learned well (or poorly).  In clinical settings, memory measures that incorporate such explicit content models might also provide more direct evaluations of patients' memories, and of doctor-patient interactions.
 
 
 \section*{Methods}
 \label{sec:methods}
 
 \subsection*{Paradigm and data collection}
-Data were collected by \cite{ChenEtal17}.  In brief, participants ($n=22$) viewed the first 48 minutes of ``A Study in Pink,'' the first episode of the BBC television show \textit{Sherlock}, while fMRI volumes were collected (TR = 1500~ms).  Participants were pre-screened to ensure they had never seen any episode of the show before.  The stimulus was divided into a 23~min (946~TR) and a 25~min (1030~TR) segment to mitigate technical issues related to the scanner.  After finishing the clip, participants were instructed to \citep[quoting from][]{ChenEtal17} ``describe what they recalled of the [episode] in as much detail as they could, to try to recount events in the original order they were viewed in, and to speak for at least 10 minutes if possible but that longer was better. They were told that completeness and detail were more important than temporal order, and that if at any point they realized they had missed something, to return to it. Participants were then allowed to speak for as long as they wished, and verbally indicated when they were finished (e.g., `I’m done').''  Five participants were dropped from the original dataset due to excessive head motion (2 participants), insufficient recall length (2 participants), or falling asleep during stimulus viewing (1 participant), resulting in a final sample size of $n=17$.  For additional details about the testing procedures and scanning parameters, see \cite{ChenEtal17}.  The testing protocol was approved by Princeton University's Institutional Review Board.
-
-After preprocessing the fMRI data and warping the images into a standard (3~mm$^3$ MNI) space, the voxel activations were $z$-scored (within voxel) and spatially smoothed using a 6~mm (full width at half maximum) Gaussian kernel.  The fMRI data were also cropped so that all episode-viewing data were aligned across participants.  This included a constant 3 TR (4.5~s) shift to account for the lag in the hemodynamic response.  \citep[All of these preprocessing steps followed][where additional details may be found.]{ChenEtal17}
+Data were collected by Chen et al. (2017)~\citep{ChenEtal17}.  In brief, participants ($n=22$) viewed the first 48 minutes of ``A Study in Pink,'' the first episode of the BBC television show \textit{Sherlock}, while fMRI volumes were collected (TR = 1500~ms).  Participants were pre-screened to ensure they had never seen any episode of the show before.  The stimulus was divided into a 23~min (946~TR) and a 25~min (1030~TR) segment to mitigate technical issues related to the scanner.  After finishing the clip, participants were instructed to ``describe what they recalled of the [episode] in as much detail as they could, to try to recount events in the original order they were viewed in, and to speak for at least 10 minutes if possible but that longer was better. They were told that completeness and detail were more important than temporal order, and that if at any point they realized they had missed something, to return to it. Participants were then allowed to speak for as long as they wished, and verbally indicated when they were finished (e.g., `I’m done').''~\citep{ChenEtal17}  Five participants were dropped from the original dataset due to excessive head motion (2 participants), insufficient recall length (2 participants), or falling asleep during stimulus viewing (1 participant), resulting in a final sample size of $n=17$.  For additional details about the testing procedures and scanning parameters, see Chen et al. (2017)~\cite{ChenEtal17}.  The testing protocol was approved by Princeton University's Institutional Review Board.
 
-The video stimulus was divided into 1,000 fine-grained ``time segments'' and annotated by an independent coder.  For each of these 1,000 annotations, the following information was recorded: a brief narrative description of what was happening, the location where the time segment took place, whether that location was indoors or outdoors, the names of all characters on-screen, the name(s) of the character(s) in focus in the shot, the name(s) of the character(s) currently speaking, the camera angle of the shot, a transcription of any text appearing on-screen, and whether or not there was music present in the background.  Each time segment was also tagged with its onset and offset time, in both seconds and TRs.
+After preprocessing the fMRI data and warping the images into a standard (3~mm$^3$ MNI) space, the voxel activations were $z$-scored (within voxel) and spatially smoothed using a 6~mm (full width at half maximum) Gaussian kernel.  The fMRI data were also cropped so that all episode-viewing data were aligned across participants.  This included a constant 3 TR (4.5~s) shift to account for the lag in the hemodynamic response.  All of these preprocessing steps followed Chen et al. (2017)~\citep{ChenEtal17}, where additional details may be found.
 
-\subsection*{Data and code availability}
-The fMRI data we analyzed are available online \href{http://dataspace.princeton.edu/jspui/handle/88435/dsp01nz8062179}{\underline{here}}.  The behavioral data and all of our analysis code may be downloaded \href{https://github.com/ContextLab/sherlock-topic-model-paper}{\underline{here}}.
+The video stimulus was divided into 1000 fine-grained ``time segments'' and annotated by an independent coder.  For each of these 1000 annotations, the following information was recorded: a brief narrative description of what was happening, the location where the time segment took place, whether that location was indoors or outdoors, the names of all characters on-screen, the name(s) of the character(s) in focus in the shot, the name(s) of the character(s) currently speaking, the camera angle of the shot, a transcription of any text appearing on-screen, and whether or not there was music present in the background.  Each time segment was also tagged with its onset and offset time, in both seconds and TRs.
 
 \subsection*{Statistics}
-All statistical tests performed in the behavioral analyses were two-sided.  All statistical tests performed in the neural data analyses were two-sided, except for the permutation-based thresholding, which was one-sided.  In this case, we were specifically interested in identifying voxels whose activation time series reflected the temporal structure of the episode and recall topic proportions matrices to a \textit{greater} extent than that of the phase-shifted matrices.
+All statistical tests performed in the behavioral analyses were two-sided.  All statistical tests performed in the neural data analyses were two-sided, except for the permutation-based thresholding, which was one-sided.  In this case, we were specifically interested in identifying voxels whose activation time series reflected the temporal structure of the episode and recall topic proportions matrices to a greater extent than that of the phase-shifted matrices.  The 95\% confidence intervals we reported for each correlation were estimated by generating 10000 ``bootstrap'' distributions of correlation coefficients by sampling (with replacement) from the observed data.
 
 \subsection*{Modeling the dynamic content of the episode and recall transcripts}
 \subsubsection*{Topic modeling}
-The input to the topic model we trained to characterize the dynamic content of the episode comprised 998 hand-generated annotations of short (mean: 2.96s) time segments spanning the video clip~(\citealp{ChenEtal17} generated 1000 annotations total; we removed two annotations referring to a break between the first and second scan sessions, during which no fMRI data were collected).  We concatenated the text for all of the annotated features within each segment, creating a ``bag of words'' describing its content and performed some minor preprocessing (e.g., stemming possessive nouns and removing punctuation).  We then re-organized the text descriptions into overlapping sliding windows spanning (up to) 50 annotations each.  In other words, we estimated the ``context'' for each annotated segment using the text descriptions of the preceding 25 annotations, the present annotations, and the following 24 annotations.  To model the context for annotations near the beginning of the episode (i.e., within 25 of the beginning or end), we created overlapping sliding windows that grew in size from one annotation to the full length.  We also tapered the sliding window lengths at the end of the episode, whereby time segments within fewer than 24 annotations of the end of the episode were assigned sliding windows that extended to the end of the episode.  This procedure ensured that each annotation's content was represented in the text corpus an equal number of times.
+The input to the topic model we trained to characterize the dynamic content of the episode comprised 998 hand-generated annotations of short (mean: 2.96s) time segments spanning the video clip~(Chen et al., 2017~\citep{ChenEtal17} generated 1000 annotations total; we removed two annotations referring to a break between the first and second scan sessions, during which no fMRI data were collected).  We concatenated the text for all of the annotated features within each segment, creating a ``bag of words'' describing its content, and performed some minor preprocessing (e.g., stemming possessive nouns and removing punctuation).  We then re-organized the text descriptions into overlapping sliding windows spanning (up to) 50 annotations each.  In other words, we estimated the ``context'' for each annotated segment using the text descriptions of the preceding 25 annotations, the present annotations, and the following 24 annotations.  To model the context for annotations near the beginning of the episode (i.e., within 25 of the beginning or end), we created overlapping sliding windows that grew in size from one annotation to the full length.  We also tapered the sliding window lengths at the end of the episode, whereby time segments within fewer than 24 annotations of the end of the episode were assigned sliding windows that extended to the end of the episode.  This procedure ensured that each annotation's content was represented in the text corpus an equal number of times.
 
-We trained our model using these overlapping text samples with \texttt{scikit-learn}~\citep[version 0.19.1; ][]{PedrEtal11}, called from our high-dimensional visualization and text analysis software, \texttt{HyperTools}~\citep{HeusEtal18a}.  Specifically, we used the \texttt{CountVectorizer} class to transform the text from each window into a vector of word counts (using the union of all words across all annotations as the ``vocabulary,'' excluding English stop words); this yielded a number-of-windows by number-of-words \textit{word count} matrix.  We then used the \texttt{LatentDirichletAllocation} class (topics=100, method=`batch') to fit a topic model~\citep{BleiEtal03} to the word count matrix, yielding a number-of-windows (1047) by number-of-topics (100) \textit{topic proportions} matrix.  The topic proportions matrix describes the gradually evolving mix of topics (latent themes) present in each annotated time segment of the episode.  Next, we transformed the topic proportions matrix to match the 1976 fMRI volume acquisition times.  We assigned each topic vector to the timepoint (in seconds) midway between the beginning of the first annotation and the end of the last annotation in its corresponding sliding text window.  By doing so, we warped the linear temporal distance between consecutive topic vectors to align with the inconsistent temporal distance between consecutive annotations (whose durations varied greatly).  We then rescaled these timepoints to 1.5s TR units, and used linear interpolation to estimate a topic vector for each TR.  This resulted in a number-of-TRs (1976) by number-of-topics (100) matrix.
+We trained our model using these overlapping text samples with \texttt{scikit-learn} version 0.19.1~\citep{PedrEtal11}, called from our high-dimensional visualization and text analysis software, \texttt{HyperTools}~\citep{HeusEtal18a}.  Specifically, we used the \texttt{CountVectorizer} class to transform the text from each window into a vector of word counts (using the union of all words across all annotations as the ``vocabulary,'' excluding English stop words); this yielded a number-of-windows by number-of-words ``word count'' matrix.  We then used the \texttt{LatentDirichletAllocation} class (topics=100, method=`batch') to fit a topic model~\citep{BleiEtal03} to the word count matrix, yielding a number-of-windows (1047) by number-of-topics (100) ``topic proportions'' matrix.  The topic proportions matrix describes the gradually evolving mix of topics (latent themes) present in each annotated time segment of the episode.  Next, we transformed the topic proportions matrix to match the 1976 fMRI volume acquisition times.  We assigned each topic vector to the timepoint (in seconds) midway between the beginning of the first annotation and the end of the last annotation in its corresponding sliding text window.  By doing so, we warped the linear temporal distance between consecutive topic vectors to align with the inconsistent temporal distance between consecutive annotations (whose durations varied greatly).  We then rescaled these timepoints to 1.5s TR units, and used linear interpolation to estimate a topic vector for each TR.  This resulted in a number-of-TRs (1976) by number-of-topics (100) matrix.
 
-We created similar topic proportions matrices using hand-annotated transcripts of each participant's verbal recall of the episode~\citep[annotated by][]{ChenEtal17}.  We tokenized the transcript into a list of sentences, and then re-organized the list into overlapping sliding windows spanning (up to) 10 sentences each, analogously to how we parsed the episode annotations.  In turn, we transformed each window's sentences into a word count vector (using the same vocabulary as for the episode model), then used the topic model already trained on the episode scenes to compute the most probable topic proportions for each sliding window.  This yielded a number-of-windows (range: 83--312) by number-of-topics (100) topic proportions matrix for each participant.  These reflected the dynamic content of each participant's recalls.  Note: for details on how we selected the episode and recall window lengths and number of topics, see \textit{Supporting Information} and Figure~\topicopt.
+We created similar topic proportions matrices using hand-annotated transcripts of each participant's verbal recall of the episode~\citep{ChenEtal17}.  We tokenized the transcript into a list of sentences, and then re-organized the list into overlapping sliding windows spanning (up to) 10 sentences each, analogously to how we parsed the episode annotations.  In turn, we transformed each window's sentences into a word count vector (using the same vocabulary as for the episode model), then used the topic model already trained on the episode scenes to compute the most probable topic proportions for each sliding window.  This yielded a number-of-windows (range: 83--312) by number-of-topics (100) topic proportions matrix for each participant.  These reflected the dynamic content of each participant's recalls.  For details on how we selected the episode and recall window lengths and number of topics, see \textit{Supplementary Information} and Supplementary Figure~\topicopt.
 
 
 \subsubsection*{Segmenting topic proportions matrices into discrete events using hidden Markov Models}
-We parsed the topic proportions matrices of the episode and participants' recalls into discrete events using hidden Markov Models~\citep[HMMs;][]{Rabi89}.  Given the topic proportions matrix (describing the mix of topics at each timepoint) and a number of states, $K$, an HMM recovers the set of state transitions that segments the timeseries into $K$ discrete states.  Following \cite{BaldEtal17}, we imposed an additional set of constraints on the discovered state transitions that ensured that each state was encountered exactly once (i.e., never repeated).  We used the \texttt{BrainIAK} toolbox~\citep{Brainiak} to implement this segmentation.
+We parsed the topic proportions matrices of the episode and participants' recalls into discrete events using hidden Markov Models (HMMs)~\citep{Rabi89}.  Given the topic proportions matrix (describing the mix of topics at each timepoint) and a number of states, $K$, an HMM recovers the set of state transitions that segments the timeseries into $K$ discrete states.  Following Baldassano et al. (2017)~\cite{BaldEtal17}, we imposed an additional set of constraints on the discovered state transitions that ensured that each state was encountered exactly once (i.e., never repeated).  We used the \texttt{BrainIAK} toolbox~\citep{Brainiak} to implement this segmentation.
 
-We used an optimization procedure to select the appropriate $K$ for each topic proportions matrix.  Prior studies on narrative structure and processing have shown that we both perceive and internally represent the world around us at multiple, hierarchical timescales \citep[e.g.,][]{HassEtal08, LernEtal11, HassEtal15, ChenEtal17, BaldEtal17, BaldEtal18}.  However, for the purposes of our framework, we sought to identify the single timeseries of event-representations that is emphasized \textit{most heavily} in the temporal structure of the episode and of each participant's recall.  We quantified this as the set of $K$ states that maximized the similarity between topic vectors for timepoints comprising each state, while minimizing the similarity between topic vectors for timepoints across different states.  Specifically, we computed (for each matrix)
+We used an optimization procedure to select the appropriate $K$ for each topic proportions matrix.  Prior studies on narrative structure and processing have shown that we both perceive and internally represent the world around us at multiple, hierarchical timescales~\citep{HassEtal08, LernEtal11, HassEtal15, ChenEtal17, BaldEtal17, BaldEtal18}.  However, for the purposes of our framework, we sought to identify the single timeseries of event representations that was emphasized most heavily in the temporal structure of the episode and of each participant's recall.  We quantified this as the set of $K$ states that maximized the similarity between topic vectors for timepoints comprising each state, while minimizing the similarity between topic vectors for timepoints across different states.  Specifically, we computed (for each matrix)
 \[
   \argmax_K \left[W_{1}(a, b)\right],
 \]
-where $a$ was the distribution of within-state topic vector correlations, and $b$ was the distribution of across-state topic vector correlations .  We computed the first Wasserstein distance ($W_{1}$; also known as \textit{Earth mover's distance}; \citealp{Dobr70, RamdEtal17}) between these distributions for a large range of possible $K$-values (range [2, 50]), and selected the $K$ that yielded the maximum value.  Figure~\ref{fig:model}B displays the event boundaries returned for the episode, and Figure~\corrmats~displays the event boundaries returned for each participant's recalls.  See Figure \kopt~for the optimization functions for the episode and recalls.  After obtaining these event boundaries, we created stable estimates of the content represented in each event by averaging the topic vectors across timepoints between each pair of event boundaries.  This yielded a number-of-events by number-of-topics matrix for the episode and recalls from each participant.
+where $a$ was the distribution of within-state topic vector correlations, and $b$ was the distribution of across-state topic vector correlations .  We computed the first Wasserstein distance ($W_{1}$, also known as ``Earth mover's distance''~\citep{Dobr70, RamdEtal17}) between these distributions for a large range of possible $K$-values (range [2, 50]), and selected the $K$ that yielded the maximum value.  Figure~\ref{fig:model}B displays the event boundaries returned for the episode, and Supplementary Figure~\corrmats~displays the event boundaries returned for each participant's recalls.  See Supplementary Figure \kopt~for the optimization functions for the episode and recalls.  After obtaining these event boundaries, we created stable estimates of the content represented in each event by averaging the topic vectors across timepoints between each pair of event boundaries.  This yielded a number-of-events by number-of-topics matrix for the episode and recalls from each participant.
 
 \subsubsection*{Naturalistic extensions of classic list-learning analyses}
 In traditional list-learning experiments, participants view a list of items (e.g., words) and then recall the items later.  Our episode-recall event matching approach affords us the ability to analyze memory in a similar way. The episode and recall events can be treated analogously to studied and recalled ``items'' in a list-learning study.  We can then extend classic analyses of memory performance and dynamics (originally designed for list-learning experiments) to the more naturalistic episode recall task used in this study.
 
-Perhaps the simplest and most widely used measure of memory performance is \textit{accuracy}---i.e., the proportion of studied (experienced) items (in this case, episode events) that the participant later remembered.  \cite{ChenEtal17} used this method to rate each participant's memory quality by computing the proportion of (50, manually identified) scenes mentioned in their recall.  We found a strong across-participants correlation between these independent ratings and the proportion of 30 HMM-identified episode events matched to participants' recalls (Pearson's $r(15) = 0.71, p = 0.002$).  We further considered a number of more nuanced memory performance measures that are typically associated with list-learning studies.  We also provide a software package, \texttt{Quail}, for carrying out these analyses~\citep{HeusEtal17b}.
+Perhaps the simplest and most widely used measure of memory performance is ``accuracy''---i.e., the proportion of studied (experienced) items (in this case, episode events) that the participant later remembered.  Chen et al.~(2017)~\cite{ChenEtal17} used this method to rate each participant's memory quality by computing the proportion of (50 manually identified) scenes mentioned in their recall.  We found a strong across-participants correlation between these independent ratings and the proportion of 30 HMM-identified episode events matched to participants' recalls (Pearson's $r(15) = 0.71, p = 0.002,~95\%~\mathrm{CI} = [0.39, 0.88]$).  We further considered a number of more nuanced memory performance measures that are typically associated with list-learning studies.  We also provide a software package, \texttt{Quail}, for carrying out these analyses~\citep{HeusEtal17b}.
 
 \paragraph{Probability of first recall (PFR).}  PFR curves~\citep{WelcBurn24, PostPhil65, AtkiShif68} reflect the probability that an item will be recalled first, as a function of its serial position during encoding. To carry out this analysis, we initialized a number-of-participants (17) by number-of-episode-events (30) matrix of zeros. Then, for each participant, we found the index of the episode event that was recalled first (i.e., the episode event whose topic vector was most strongly correlated with that of the first recall event) and filled in that index in the matrix with a 1.  Finally, we averaged over the rows of the matrix, resulting in a 1 by 30 array representing the proportion of participants that recalled an event first, as a function of the order of the event's appearance in the episode (Fig.~\ref{fig:list-learning}A).
-% ------ NOTE: ------
-% reiterate meaning of error ribbons in list-learning figure? (already noted in figure caption)
-% - Paxton
 
-\paragraph{Lag conditional probability curve (lag-CRP).} The lag-CRP curve~\citep{Kaha96} reflects the probability of recalling a given item after the just-recalled item, as a function of their relative encoding positions (or \textit{lag}).  In other words, a lag of 1 indicates that a recalled item was presented immediately after the previously recalled item, and a lag of -3 indicates that a recalled item came 3 items before the previously recalled item.  For each recall transition (following the first recall), we computed the lag between the current recall event and the next recall event, normalizing by the total number of possible transitions.  This yielded a number-of-participants (17) by number-of-lags (-29 to +29; 58 lags total excluding lags of 0) matrix. We averaged over the rows of this matrix to obtain a group-averaged lag-CRP curve (Fig.~\ref{fig:list-learning}B).
+\paragraph{Lag conditional probability curve (lag-CRP).} The lag-CRP curve~\citep{Kaha96} reflects the probability of recalling a given item after the just-recalled item, as a function of their relative encoding positions (lag).  In other words, a lag of 1 indicates that a recalled item was presented immediately after the previously recalled item, and a lag of -3 indicates that a recalled item came 3 items before the previously recalled item.  For each recall transition (following the first recall), we computed the lag between the current recall event and the next recall event, normalizing by the total number of possible transitions.  This yielded a number-of-participants (17) by number-of-lags (-29 to +29; 58 lags total excluding lags of 0) matrix. We averaged over the rows of this matrix to obtain a group-averaged lag-CRP curve (Fig.~\ref{fig:list-learning}B).
 
 \paragraph{Serial position curve (SPC).} SPCs~\citep{Murd62a} reflect the proportion of participants that remember each item as a function of the item's serial position during encoding. We initialized a number-of-participants (17) by number-of-episode-events (30) matrix of zeros. Then, for each recalled event, for each participant, we found the index of the episode event that the recalled event most closely matched (via the correlation between the events' topic vectors) and entered a 1 into that position in the matrix. This resulted in a matrix whose entries indicated whether or not each event was recalled by each participant (depending on whether the corresponding entires were set to one or zero).  Finally, we averaged over the rows of the matrix to yield a 1 by 30 array representing the proportion of participants that recalled each event as a function of the events' order appearance in the episode (Fig.~\ref{fig:list-learning}C).
 
 \paragraph{Temporal clustering scores.} Temporal clustering describes a participant's tendency to organize their recall sequences by the learned items' encoding positions.  For instance, if a participant recalled the episode events in the exact order they occurred (or in exact reverse order), this would yield a score of 1.  If a participant recalled the events in random order, this would yield an expected score of 0.5.  For each recall event transition (and separately for each participant), we sorted all not-yet-recalled events according to their absolute lag (i.e., distance away in the episode).  We then computed the percentile rank of the next event the participant recalled.  We averaged these percentile ranks across all of the participant's recalls to obtain a single temporal clustering score for the participant.
 
-\paragraph{Semantic clustering scores.} Semantic clustering describes a participant's tendency to recall semantically similar presented items together in their recall sequences.  Here, we used the topic vectors for each event as a proxy for its semantic content. Thus, the similarity between the semantic content for two events can be computed by correlating their respective topic vectors.  For each recall event transition, we sorted all not-yet-recalled events according to how correlated the topic vector \textit{of the closest-matching episode event} was to the topic vector of the closest-matching episode event to the just-recalled event.  We then computed the percentile rank of the observed next recall.  We averaged these percentile ranks across all of the participant's recalls to obtain a single semantic clustering score for the participant.
+\paragraph{Semantic clustering scores.} Semantic clustering describes a participant's tendency to recall semantically similar presented items together in their recall sequences.  Here, we used the topic vectors for each event as a proxy for its semantic content. Thus, the similarity between the semantic content for two events can be computed by correlating their respective topic vectors.  For each recall event transition, we sorted all not-yet-recalled events according to how correlated the topic vector of the closest-matching episode event was to the topic vector of the closest-matching episode event to the just-recalled event.  We then computed the percentile rank of the observed next recall.  We averaged these percentile ranks across all of the participant's recalls to obtain a single semantic clustering score for the participant.
 
 \subsubsection*{Averaging correlations}
 In all instances where we performed statistical tests involving precision or distinctiveness scores (Fig.~\ref{fig:precision-detail}), we used the Fisher $z$-transformation~\citep{Fish25} to stabilize the variance across the distribution of correlation values prior to performing the test.  Similarly, when averaging precision or distinctiveness scores, we $z$-transformed the scores prior to computing the mean, and inverse $z$-transformed the result.
 
 \subsubsection*{Visualizing the episode and recall topic trajectories}
-We used the UMAP algorithm~\citep{McInEtal18} to project the 100-dimensional topic space onto a two-dimensional space for visualization (Figs.~\ref{fig:trajectory}, \ref{fig:topics}).  To ensure that all of the trajectories were projected onto the \textit{same} lower dimensional space, we computed the low-dimensional embedding on a ``stacked'' matrix created by vertically concatenating the events-by-topics topic proportions matrices for the episode, across-participants average recall and all 17 individual participants' recalls.  We then separated the rows of the result (a total-number-of-events by two matrix) back into individual matrices for the episode topic trajectory, across-participant average recall trajectory, and the trajectories for each individual participant's recalls (Fig.~\ref{fig:trajectory}).  This general approach for discovering a shared low-dimensional embedding for a collections of high-dimensional observations follows \cite{HeusEtal18a}.
+We used the UMAP algorithm~\citep{McInEtal18} to project the 100-dimensional topic space onto a two-dimensional space for visualization (Figs.~\ref{fig:trajectory}, \ref{fig:topics}).  To ensure that all of the trajectories were projected onto the same lower dimensional space, we computed the low-dimensional embedding on a ``stacked'' matrix created by vertically concatenating the events-by-topics topic proportions matrices for the episode, the across-participants average recalls and all 17 individual participants' recalls.  We then separated the rows of the result (a total-number-of-events by two matrix) back into individual matrices for the episode topic trajectory, the across-participant average recall trajectory, and the trajectories for each individual participant's recalls (Fig.~\ref{fig:trajectory}).  This general approach for discovering a shared low-dimensional embedding for a collections of high-dimensional observations follows our prior work on manifold learning~\cite{HeusEtal18a}.
 
-We optimized the manifold space for visualization based on two criteria: First, that the 2D embedding of the episode trajectory should reflect its original 100-dimensional structure as faithfully as possible. Second, that the path traversed by the embedded episode trajectory should intersect itself a minimal number of times.  The first criteria helps bolster the validity of visual intuitions about relationships between sections of episode content, based on their locations in the embedding space.  The second criteria was motivated by the observed low off-diagonal values in the episode trajectory's temporal correlation matrix (suggesting that the same topic-space coordinates should not be revisited; see Fig.~2A). For further details on how we created this low-dimensional embedding space, see \textit{Supporting Information}.
+We optimized the manifold space for visualization based on two criteria: First, that the 2D embedding of the episode trajectory should reflect its original 100-dimensional structure as faithfully as possible. Second, that the path traversed by the embedded episode trajectory should intersect itself a minimal number of times.  The first criteria helps bolster the validity of visual intuitions about relationships between sections of episode content, based on their locations in the embedding space.  The second criteria was motivated by the observed low off-diagonal values in the episode trajectory's temporal correlation matrix (suggesting that the same topic-space coordinates should not be revisited; see Fig.~2A). For further details on how we created this low-dimensional embedding space, see \textit{Supplementary Information}.
 
 \subsubsection*{Estimating the consistency of flow through topic space across participants}
-In Figure~\ref{fig:trajectory}B, we present an analysis aimed at characterizing locations in topic space that different participants move through in a consistent way (via their recall topic trajectories).  The two-dimensional topic space used in our visualizations (Fig.~\ref{fig:trajectory}) comprised a $60 \times 60$ (arbitrary units) square.  We tiled this space with a $50 \times 50$ grid of evenly spaced vertices, and defined a circular area centered on each vertex whose radius was two times the distance between adjacent vertices (i.e., 2.4 units).  For each vertex, we examined the set of line segments formed by connecting each pair successively recalled events, across all participants, that passed through this circle.  We computed the distribution of angles formed by those segments and the $x$-axis, and used a Rayleigh test to determine whether the distribution of angles was reliably ``peaked'' (i.e., consistent across all transitions that passed through that local portion of topic space).  To create Figure~\ref{fig:trajectory}B, we drew an arrow originating from each grid vertex, pointing in the direction of the average angle formed by the line segments that passed within 2.4 units.  We set the arrow lengths to be inversely proportional to the $p$-values of the Rayleigh tests at each vertex.  Specifically, for each vertex we converted all of the angles of segments that passed within 2.4 units to unit vectors, and we set the arrow lengths at each vertex proportional to the length of the (circular) mean vector.  We also indicated any significant results ($p < 0.05$, corrected using the Benjamani-Hochberg procedure) by coloring the arrows in blue (darker blue denotes a lower $p$-value, i.e., a longer mean vector); all tests with $p \geq 0.05$ are displayed in gray and given a lower opacity value.
+In Figure~\ref{fig:trajectory}B, we present an analysis aimed at characterizing locations in topic space that different participants move through in a consistent way (via their recall topic trajectories; also see Supp.\ Fig.~\arrows).  The two-dimensional topic space used in our visualizations (Fig.~\ref{fig:trajectory}) comprised a $60 \times 60$ (arbitrary units) square.  We tiled this space with a $50 \times 50$ grid of evenly spaced vertices, and defined a circular area centered on each vertex whose radius was two times the distance between adjacent vertices (i.e., 2.4 units).  For each vertex, we examined the set of line segments formed by connecting each pair successively recalled events, across all participants, that passed through this circle.  We computed the distribution of angles formed by those segments and the $x$-axis, and used a Rayleigh test to determine whether the distribution of angles was reliably ``peaked'' (i.e., consistent across all transitions that passed through that local portion of topic space).  To create Figure~\ref{fig:trajectory}B, we drew an arrow originating from each grid vertex, pointing in the direction of the average angle formed by the line segments that passed within 2.4 units.  We set the arrow lengths to be inversely proportional to the $p$-values of the Rayleigh tests at each vertex.  Specifically, for each vertex we converted all of the angles of segments that passed within 2.4 units to unit vectors, and we set the arrow lengths at each vertex proportional to the length of the (circular) mean vector.  We also indicated any significant results ($p < 0.05$, corrected using the Benjamani-Hochberg procedure) by coloring the arrows in blue (darker blue denotes a lower $p$-value, i.e., a longer mean vector); all tests with $p \geq 0.05$ are displayed in gray and given a lower opacity value.
 
 \subsection*{Searchlight fMRI analyses}
-In Figure~\ref{fig:brainz}, we present two analyses aimed at identifying brain regions whose responses (as participants viewed the episode) exhibited a particular temporal structure.  We developed a searchlight analysis wherein we constructed a $5 \times 5 \times 5$ cube of voxels~\citep[following][]{ChenEtal17} centered on each voxel in the brain, and for each of these cubes, computed the temporal correlation matrix of the voxel responses during episode viewing.  Specifically, for each of the 1976 volumes collected during episode viewing, we correlated the activity patterns in the given cube with the activity patterns (in the same cube) collected during every other timepoint.  This yielded a $1976 \times 1976$ correlation matrix for each cube.  Note: participant 5's scan ended 75s early, and in~\cite{ChenEtal17}'s publicly released dataset, their scan data was zero-padded to match the length of the other participants'.  For our searchlight analyses, we removed this padded data (i.e., the last 50 TRs), resulting in a $1925 \times 1925$ correlation matrix for each cube in participant 5's brain.
+In Figure~\ref{fig:brainz}, we present two analyses aimed at identifying brain regions whose responses (as participants viewed the episode) exhibited a particular temporal structure.  We developed a searchlight analysis wherein we constructed a $5 \times 5 \times 5$ cube of voxels centered on each voxel in the brain~\citep{ChenEtal17}, and for each of these cubes, computed the temporal correlation matrix of the voxel responses during episode viewing.  Specifically, for each of the 1976 volumes collected during episode viewing, we correlated the activity patterns in the given cube with the activity patterns (in the same cube) collected during every other timepoint.  This yielded a $1976 \times 1976$ correlation matrix for each cube.  Note: participant 5's scan ended 75s early, and in Chen et al. (2017)~\cite{ChenEtal17}'s publicly released dataset, their scan data was zero-padded to match the length of the other participants'.  For our searchlight analyses, we removed this padded data (i.e., the last 50 TRs), resulting in a $1925 \times 1925$ correlation matrix for each cube in participant 5's brain.
 
-Next, we constructed a series of ``template'' matrices.  The first template reflected the timecourse of the episode's topic proportions matrix, and the others reflected the timecourse of each participant's recall topic proportions matrix.  To construct the episode template, we computed the correlations between the topic proportions estimated for every pair of TRs (prior to segmenting the topic proportions matrices into discrete events; i.e., the correlation matrix shown in Figs.~\ref{fig:model}B and \ref{fig:brainz}A).  We constructed similar temporal correlation matrices for each participant's recall topic proportions matrix (Figs.~\ref{fig:model}D, \corrmats).  However, to correct for length differences and potential non-linear transformations between viewing time and recall time, we first used dynamic time warping~\citep{BernClif94} to temporally align participants' recall topic proportions matrices with the episode topic proportions matrix.  An example correlation matrix before and after warping is shown in Fig.~\ref{fig:brainz}B.  This yielded a $1976 \times 1976$ correlation matrix for the episode template and for each participant's recall template.
+Next, we constructed a series of ``template'' matrices.  The first template reflected the timecourse of the episode's topic proportions matrix, and the others reflected the timecourse of each participant's recall topic proportions matrix.  To construct the episode template, we computed the correlations between the topic proportions estimated for every pair of TRs (prior to segmenting the topic proportions matrices into discrete events; i.e., the correlation matrix shown in Figs.~\ref{fig:model}B and \ref{fig:brainz}A).  We constructed similar temporal correlation matrices for each participant's recall topic proportions matrix (Fig.~\ref{fig:model}D, Supp.\ Fig.~\corrmats).  However, to correct for length differences and potential non-linear transformations between viewing time and recall time, we first used dynamic time warping~\citep{BernClif94} to temporally align participants' recall topic proportions matrices with the episode topic proportions matrix.  An example correlation matrix before and after warping is shown in Fig.~\ref{fig:brainz}B.  This yielded a $1976 \times 1976$ correlation matrix for the episode template and for each participant's recall template.
 
-The temporal structure of the episode's content (as described by our model) is captured in the block-diagonal structure of the episode's temporal correlation matrix (e.g., Figs.~\ref{fig:model}B, \ref{fig:brainz}A), with time periods of thematic stability represented as dark blocks of varying sizes.  Inspecting the episode correlation matrix suggests that the episode's semantic content is highly temporally specific (i.e., the correlations between topic vectors from distant timepoints are almost all near zero).  By contrast, the activity patterns of individual (cubes of) voxels can encode relatively limited information on their own, and their activity frequently contributes to multiple separate functions \citep{FreeEtal01, SigmDeha08, CharKoec10, RishEtal13}.  By nature, these two attributes give rise to similarities in activity across large timescales that may not necessarily reflect a single task.  To enable a more sensitive analysis of brain regions whose shifts in activity patterns mirrored shifts in the semantic content of the episode or recalls, we restricted the temporal correlations we considered to the timescale of semantic information captured by our model.  Specifically, we isolated the upper triangle of the episode correlation matrix and created a ``proximal correlation mask'' that included only diagonals from the upper triangle of the episode correlation matrix up to the first diagonal that contained no positive correlations.  Applying this mask to the full episode correlation matrix was equivalent to excluding diagonals beyond the corner of the largest diagonal block.  In other words, the timescale of temporal correlations we considered corresponded to the longest period of thematic stability in the episode, and by extension the longest period of thematic stability in participants' recalls and the longest period of stability we might expect to see in voxel activity arising from processing or encoding episode content.  Figure \ref{fig:brainz} shows this proximal correlation mask applied to the temporal correlation matrices for the episode, an example participant's (warped) recall, and an example cube of voxels from our searchlight analyses.
+The temporal structure of the episode's content (as described by our model) is captured in the block-diagonal structure of the episode's temporal correlation matrix (e.g., Figs.~\ref{fig:model}B, \ref{fig:brainz}A), with time periods of thematic stability represented as dark blocks of varying sizes.  Inspecting the episode correlation matrix suggests that the episode's semantic content is highly temporally specific (i.e., the correlations between topic vectors from distant timepoints are almost all near zero).  By contrast, the activity patterns of individual (cubes of) voxels can encode relatively limited information on their own, and their activity frequently contributes to multiple separate functions \citep{FreeEtal01, SigmDeha08, CharKoec10, RishEtal13}.  By nature, these two attributes give rise to similarities in activity across large timescales that may not necessarily reflect a single task.  To identify brain regions whose shifts in activity patterns mirrored shifts in the semantic content of the episode or recalls, we restricted the temporal correlations we considered to the timescale of semantic information captured by our model.  Specifically, we isolated the upper triangle of the episode correlation matrix and created a ``proximal correlation mask'' that included only diagonals from the upper triangle of the episode correlation matrix up to the first diagonal that contained no positive correlations.  Applying this mask to the full episode correlation matrix was equivalent to excluding diagonals beyond the corner of the largest diagonal block.  In other words, the timescale of temporal correlations we considered corresponded to the longest period of thematic stability in the episode, and by extension the longest period of thematic stability in participants' recalls and the longest period of stability we might expect to see in voxel activity arising from processing or encoding episode content.  Figure \ref{fig:brainz} shows this proximal correlation mask applied to the temporal correlation matrices for the episode, an example participant's (warped) recall, and an example cube of voxels from our searchlight analyses.
 
 To determine which (cubes of) voxel responses matched the episode template, we correlated the proximal diagonals from the upper triangle of the voxel correlation matrix for each cube with the proximal diagonals from episode template matrix~\citep{KrieEtal08b}.  This yielded, for each participant, a voxelwise map of correlation values.  We then performed a one-sample $t$-test on the distribution of (Fisher $z$-transformed) correlations at each voxel, across participants.  This resulted in a value for each voxel (cube), describing how reliably its timecourse followed that of the episode.
 
@@ -249,23 +246,578 @@ \subsection*{Searchlight fMRI analyses}
 We used an analogous procedure to identify which voxels' responses reflected the recall templates.  For each participant, we correlated the proximal diagonals from the upper triangle of the correlation matrix for each cube of voxels with the proximal diagonals from the upper triangle of their (time-warped) recall correlation matrix.  As in the episode template analysis, this yielded a voxelwise map of correlation coefficients for each participant.  However, whereas the episode analysis compared every participant's responses to the same template, here the recall templates were unique for each participant.  As in the analysis described above, we $t$-scored the (Fisher $z$-transformed) voxelwise correlations, and used the same permutation procedure we developed for the episode responses to ensure specificity to the recall timeseries and assign significance values.  To create the map in Figure~\ref{fig:brainz}D we again thresholded out any voxels whose scores were below the 95\textsuperscript{th} percentile of the permutation-derived null distribution.
 
 \subsection*{Neurosynth decoding analyses}
-\texttt{Neurosynth} parses a massive online database of over 14,000 neuroimaging studies and constructs meta-analysis images for over 13,000 psychology- and neuroscience-related terms, based on NIfTI images accompanying studies where those terms appear at a high frequency.  Given a novel image (tagged with its value type; e.g., $z$-, $t$-, $F$- or $p$-statistics), \texttt{Neurosynth} returns a list of terms whose meta-analysis images are most similar.  Our permutation procedure yielded, for each of the two searchlight analyses, a voxelwise map of $z$-values.  These maps describe the extent to which each voxel \textit{specifically} reflected the temporal structure of the episode or individuals' recalls (i.e., relative to the null distributions of phase-shifted values). We inputted the two statistical maps described above to \texttt{Neurosynth} to create a list of the 10 most representative terms for each map.
-
+\texttt{Neurosynth}~\citep{YarkEtal11} parses a massive online database of over 14000 neuroimaging studies and constructs meta-analysis images for over 13000 psychology- and neuroscience-related terms, based on NIfTI images accompanying studies where those terms appear at a high frequency.  Given a novel image (tagged with its value type; e.g., $z$-, $t$-, $F$- or $p$-statistics), \texttt{Neurosynth} returns a list of terms whose meta-analysis images are most similar.  Our permutation procedure yielded, for each of the two searchlight analyses, a voxelwise map of $z$-values.  These maps describe the extent to which each voxel specifically reflected the temporal structure of the episode or individuals' recalls (i.e., relative to the null distributions of phase-shifted values). We inputted the two statistical maps described above to \texttt{Neurosynth} to create a list of the 10 most representative terms for each map.
+
+\section*{Data availability}
+The fMRI data we analyzed are available online \href{http://dataspace.princeton.edu/jspui/handle/88435/dsp01nz8062179}{\underline{here}}.  The behavioral data is available  \href{https://github.com/ContextLab/sherlock-topic-model-paper}{\underline{here}}.
+
+\section*{Code availability}
+All of our analysis code may be downloaded \href{https://github.com/ContextLab/sherlock-topic-model-paper}{\underline{here}}.
+
+%\bibliographystyle{naturemag}
+%\bibliography{CDL-bibliography/memlab}
+
+\begin{thebibliography}{10}
+\expandafter\ifx\csname url\endcsname\relax
+  \def\url#1{\texttt{#1}}\fi
+\expandafter\ifx\csname urlprefix\endcsname\relax\def\urlprefix{URL }\fi
+\providecommand{\bibinfo}[2]{#2}
+\providecommand{\eprint}[2][]{\url{#2}}
+
+\bibitem{Murd62a}
+\bibinfo{author}{Murdock, B.~B.}
+\newblock \bibinfo{title}{The serial position effect of free recall}.
+\newblock \emph{\bibinfo{journal}{Journal of Experimental Psychology}}
+  \textbf{\bibinfo{volume}{64}}, \bibinfo{pages}{482--488}
+  (\bibinfo{year}{1962}).
+
+\bibitem{Kaha96}
+\bibinfo{author}{Kahana, M.~J.}
+\newblock \bibinfo{title}{Associative retrieval processes in free recall}.
+\newblock \emph{\bibinfo{journal}{Memory \& Cognition}}
+  \textbf{\bibinfo{volume}{24}}, \bibinfo{pages}{103--109}
+  (\bibinfo{year}{1996}).
+
+\bibitem{Yone02}
+\bibinfo{author}{Yonelinas, A.~P.}
+\newblock \bibinfo{title}{The nature of recollection and familiarity: A review
+  of 30 years of research}.
+\newblock \emph{\bibinfo{journal}{Journal of Memory and Language}}
+  \textbf{\bibinfo{volume}{46}}, \bibinfo{pages}{441--517}
+  (\bibinfo{year}{2002}).
+
+\bibitem{Kaha12}
+\bibinfo{author}{Kahana, M.~J.}
+\newblock \emph{\bibinfo{title}{Foundations of Human Memory}}
+  (\bibinfo{publisher}{Oxford University Press}, \bibinfo{address}{New York,
+  NY}, \bibinfo{year}{2012}).
+
+\bibitem{KoriGold94}
+\bibinfo{author}{Koriat, A.} \& \bibinfo{author}{Goldsmith, M.}
+\newblock \bibinfo{title}{Memory in naturalistic and laboratory contexts:
+  distinguishing accuracy-oriented and quantity-oriented approaches to memory
+  assessment}.
+\newblock \emph{\bibinfo{journal}{Journal of Experimental Psychology: General}}
+  \textbf{\bibinfo{volume}{123}}, \bibinfo{pages}{297--315}
+  (\bibinfo{year}{1994}).
+
+\bibitem{HukEtal18}
+\bibinfo{author}{Huk, A.}, \bibinfo{author}{Bonnen, K.} \& \bibinfo{author}{He,
+  B.~J.}
+\newblock \bibinfo{title}{Beyond trial-based paradigms: continuous behavior,
+  ongoing neural activity, and naturalistic stimuli}.
+\newblock \emph{\bibinfo{journal}{Journal of Neuroscience}}
+  \textbf{\bibinfo{volume}{10.1523/JNEUROSCI.1920-17.2018}}
+  (\bibinfo{year}{2018}).
+
+\bibitem{LernEtal11}
+\bibinfo{author}{Lerner, Y.}, \bibinfo{author}{Honey, C.~J.},
+  \bibinfo{author}{Silbert, L.~J.} \& \bibinfo{author}{Hasson, U.}
+\newblock \bibinfo{title}{Topographic mapping of a hierarchy of temporal
+  receptive windows using a narrated story}.
+\newblock \emph{\bibinfo{journal}{Journal of Neuroscience}}
+  \textbf{\bibinfo{volume}{31}}, \bibinfo{pages}{2906--2915}
+  (\bibinfo{year}{2011}).
+
+\bibitem{Mann19}
+\bibinfo{author}{Manning, J.~R.}
+\newblock \bibinfo{title}{Episodic memory: mental time travel or a quantum
+  'memory wave' function?}
+\newblock \emph{\bibinfo{journal}{PsyArXiv}}
+  \textbf{\bibinfo{volume}{doi:10.31234/osf.io/6zjwb}} (\bibinfo{year}{2019}).
+
+\bibitem{Mann20}
+\bibinfo{author}{Manning, J.~R.}
+\newblock \bibinfo{title}{Context reinstatement}.
+\newblock In \bibinfo{editor}{Kahana, M.~J.} \& \bibinfo{editor}{Wagner, A.~D.}
+  (eds.) \emph{\bibinfo{booktitle}{Handbook of Human Memory}}
+  (\bibinfo{publisher}{Oxford University Press}, \bibinfo{year}{2020}).
+
+\bibitem{HowaKaha02a}
+\bibinfo{author}{Howard, M.~W.} \& \bibinfo{author}{Kahana, M.~J.}
+\newblock \bibinfo{title}{A distributed representation of temporal context}.
+\newblock \emph{\bibinfo{journal}{Journal of Mathematical Psychology}}
+  \textbf{\bibinfo{volume}{46}}, \bibinfo{pages}{269--299}
+  (\bibinfo{year}{2002}).
+
+\bibitem{HowaEtal14}
+\bibinfo{author}{Howard, M.~W.} \emph{et~al.}
+\newblock \bibinfo{title}{A unified mathematical framework for coding time,
+  space, and sequences in the medial temporal lobe}.
+\newblock \emph{\bibinfo{journal}{Journal of Neuroscience}}
+  \textbf{\bibinfo{volume}{34}}, \bibinfo{pages}{4692--4707}
+  (\bibinfo{year}{2014}).
+
+\bibitem{MannEtal15}
+\bibinfo{author}{Manning, J.~R.}, \bibinfo{author}{Norman, K.~A.} \&
+  \bibinfo{author}{Kahana, M.~J.}
+\newblock \bibinfo{title}{The role of context in episodic memory}.
+\newblock In \bibinfo{editor}{Gazzaniga, M.} (ed.)
+  \emph{\bibinfo{booktitle}{The Cognitive Neurosciences, Fifth edition}},
+  \bibinfo{pages}{557--566} (\bibinfo{publisher}{{MIT} Press},
+  \bibinfo{year}{2015}).
+
+\bibitem{RangRitc12}
+\bibinfo{author}{Ranganath, C.} \& \bibinfo{author}{Ritchey, M.}
+\newblock \bibinfo{title}{Two cortical systems for memory-guided behavior}.
+\newblock \emph{\bibinfo{journal}{Nature Reviews Neuroscience}}
+  \textbf{\bibinfo{volume}{13}}, \bibinfo{pages}{713 -- 726}
+  (\bibinfo{year}{2012}).
+
+\bibitem{ZackEtal07}
+\bibinfo{author}{Zacks, J.~M.}, \bibinfo{author}{Speer, N.~K.},
+  \bibinfo{author}{Swallow, K.~M.}, \bibinfo{author}{Braver, T.~S.} \&
+  \bibinfo{author}{Reynolds, J.~R.}
+\newblock \bibinfo{title}{Event perception: a mind-brain perspective}.
+\newblock \emph{\bibinfo{journal}{Psychological Bulletin}}
+  \textbf{\bibinfo{volume}{133}}, \bibinfo{pages}{273--293}
+  (\bibinfo{year}{2007}).
+
+\bibitem{ZwaaRadv98}
+\bibinfo{author}{Zwaan, R.~A.} \& \bibinfo{author}{Radvansky, G.~A.}
+\newblock \bibinfo{title}{Situation models in language comprehension and
+  memory}.
+\newblock \emph{\bibinfo{journal}{Psychological Bulletin}}
+  \textbf{\bibinfo{volume}{123}}, \bibinfo{pages}{162 -- 185}
+  (\bibinfo{year}{1998}).
+
+\bibitem{RadvZack17}
+\bibinfo{author}{Radvansky, G.~A.} \& \bibinfo{author}{Zacks, J.~M.}
+\newblock \bibinfo{title}{Event boundaries in memory and cognition}.
+\newblock \emph{\bibinfo{journal}{Curr Opin Behav Sci}}
+  \textbf{\bibinfo{volume}{17}}, \bibinfo{pages}{133--140}
+  (\bibinfo{year}{2017}).
+
+\bibitem{BrunEtal18}
+\bibinfo{author}{Brunec, I.~K.}, \bibinfo{author}{Moscovitch, M.~M.} \&
+  \bibinfo{author}{Barense, M.~D.}
+\newblock \bibinfo{title}{Boundaries shape cognitive representations of spaces
+  and events}.
+\newblock \emph{\bibinfo{journal}{{Trends in Cognitive Sciences}}}
+  \textbf{\bibinfo{volume}{22}}, \bibinfo{pages}{637--650}
+  (\bibinfo{year}{2018}).
+
+\bibitem{HeusEtal18b}
+\bibinfo{author}{Heusser, A.~C.}, \bibinfo{author}{Ezzyat, Y.},
+  \bibinfo{author}{Shiff, I.} \& \bibinfo{author}{Davachi, L.}
+\newblock \bibinfo{title}{Perceptual boundaries cause mnemonic trade-offs
+  between local boundary processing and across-trial associative binding}.
+\newblock \emph{\bibinfo{journal}{Journal of Experimental Psychology Learning,
+  Memory, and Cognition}} \textbf{\bibinfo{volume}{44}},
+  \bibinfo{pages}{1075--1090} (\bibinfo{year}{2018}).
+
+\bibitem{ClewDava17}
+\bibinfo{author}{Clewett, D.} \& \bibinfo{author}{Davachi, L.}
+\newblock \bibinfo{title}{The ebb and flow of experience determines the
+  temporal structure of memory}.
+\newblock \emph{\bibinfo{journal}{Curr Opin Behav Sci}}
+  \textbf{\bibinfo{volume}{17}}, \bibinfo{pages}{186--193}
+  (\bibinfo{year}{2017}).
+
+\bibitem{EzzyDava11}
+\bibinfo{author}{Ezzyat, Y.} \& \bibinfo{author}{Davachi, L.}
+\newblock \bibinfo{title}{What constitutes an episode in episodic memory?}
+\newblock \emph{\bibinfo{journal}{Psychological Science}}
+  \textbf{\bibinfo{volume}{22}}, \bibinfo{pages}{243--252}
+  (\bibinfo{year}{2011}).
+
+\bibitem{DuBrDava13}
+\bibinfo{author}{DuBrow, S.} \& \bibinfo{author}{Davachi, L.}
+\newblock \bibinfo{title}{The influence of contextual boundaries on memory for
+  the sequential order of events}.
+\newblock \emph{\bibinfo{journal}{Journal of Experimental Psychology: General}}
+  \textbf{\bibinfo{volume}{142}}, \bibinfo{pages}{1277--1286}
+  (\bibinfo{year}{2013}).
+
+\bibitem{TompDava17}
+\bibinfo{author}{Tompary, A.} \& \bibinfo{author}{Davachi, L.}
+\newblock \bibinfo{title}{Consolidation promotes the emergence of
+  representational overlap in the hippocampus and medial prefrontal cortex}.
+\newblock \emph{\bibinfo{journal}{Neuron}} \textbf{\bibinfo{volume}{96}},
+  \bibinfo{pages}{228--241} (\bibinfo{year}{2017}).
+
+\bibitem{ChenEtal17}
+\bibinfo{author}{Chen, J.} \emph{et~al.}
+\newblock \bibinfo{title}{Shared memories reveal shared structure in neural
+  activity across individuals}.
+\newblock \emph{\bibinfo{journal}{Nature Neuroscience}}
+  \textbf{\bibinfo{volume}{20}}, \bibinfo{pages}{115} (\bibinfo{year}{2017}).
+
+\bibitem{BleiEtal03}
+\bibinfo{author}{Blei, D.~M.}, \bibinfo{author}{Ng, A.~Y.} \&
+  \bibinfo{author}{Jordan, M.~I.}
+\newblock \bibinfo{title}{Latent dirichlet allocation}.
+\newblock \emph{\bibinfo{journal}{Journal of Machine Learning Research}}
+  \textbf{\bibinfo{volume}{3}}, \bibinfo{pages}{993 -- 1022}
+  (\bibinfo{year}{2003}).
+
+\bibitem{Rabi89}
+\bibinfo{author}{Rabiner, L.}
+\newblock \bibinfo{title}{A tutorial on {Hidden Markov Models} and selected
+  applications in speech recognition}.
+\newblock \emph{\bibinfo{journal}{Proceedings of the IEEE}}
+  \textbf{\bibinfo{volume}{77}}, \bibinfo{pages}{257--286}
+  (\bibinfo{year}{1989}).
+
+\bibitem{BaldEtal17}
+\bibinfo{author}{Baldassano, C.} \emph{et~al.}
+\newblock \bibinfo{title}{Discovering event structure in continuous narrative
+  perception and memory}.
+\newblock \emph{\bibinfo{journal}{Neuron}} \textbf{\bibinfo{volume}{95}},
+  \bibinfo{pages}{709--721} (\bibinfo{year}{2017}).
+
+\bibitem{BleiLaff06}
+\bibinfo{author}{Blei, D.~M.} \& \bibinfo{author}{Lafferty, J.~D.}
+\newblock \bibinfo{title}{Dynamic topic models}.
+\newblock In \emph{\bibinfo{booktitle}{Proceedings of the 23rd International
+  Conference on Machine Learning}}, ICML '06, \bibinfo{pages}{113--120}
+  (\bibinfo{publisher}{ACM}, \bibinfo{address}{New York, NY, US},
+  \bibinfo{year}{2006}).
+
+\bibitem{MannEtal11}
+\bibinfo{author}{Manning, J.~R.}, \bibinfo{author}{Polyn, S.~M.},
+  \bibinfo{author}{Baltuch, G.}, \bibinfo{author}{Litt, B.} \&
+  \bibinfo{author}{Kahana, M.~J.}
+\newblock \bibinfo{title}{Oscillatory patterns in temporal lobe reveal context
+  reinstatement during memory search}.
+\newblock \emph{\bibinfo{journal}{Proceedings of the National Academy of
+  Sciences, USA}} \textbf{\bibinfo{volume}{108}}, \bibinfo{pages}{12893--12897}
+  (\bibinfo{year}{2011}).
+
+\bibitem{HowaEtal12}
+\bibinfo{author}{Howard, M.~W.}, \bibinfo{author}{Viskontas, I.~V.},
+  \bibinfo{author}{Shankar, K.~H.} \& \bibinfo{author}{Fried, I.}
+\newblock \bibinfo{title}{Ensembles of human {MTL} neurons ``jump back in
+  time'' in response to a repeated stimulus}.
+\newblock \emph{\bibinfo{journal}{Hippocampus}} \textbf{\bibinfo{volume}{22}},
+  \bibinfo{pages}{1833--1847} (\bibinfo{year}{2012}).
+
+\bibitem{AtkiShif68}
+\bibinfo{author}{Atkinson, R.~C.} \& \bibinfo{author}{Shiffrin, R.~M.}
+\newblock \bibinfo{title}{Human memory: {A} proposed system and its control
+  processes}.
+\newblock In \bibinfo{editor}{Spence, K.~W.} \& \bibinfo{editor}{Spence, J.~T.}
+  (eds.) \emph{\bibinfo{booktitle}{The psychology of learning and motivation}},
+  vol.~\bibinfo{volume}{2}, \bibinfo{pages}{89--105}
+  (\bibinfo{publisher}{Academic Press}, \bibinfo{address}{New York},
+  \bibinfo{year}{1968}).
+
+\bibitem{PostPhil65}
+\bibinfo{author}{Postman, L.} \& \bibinfo{author}{Phillips, L.~W.}
+\newblock \bibinfo{title}{Short-term temporal changes in free recall}.
+\newblock \emph{\bibinfo{journal}{Quarterly Journal of Experimental
+  Psychology}} \textbf{\bibinfo{volume}{17}}, \bibinfo{pages}{132--138}
+  (\bibinfo{year}{1965}).
+
+\bibitem{WelcBurn24}
+\bibinfo{author}{Welch, G.~B.} \& \bibinfo{author}{Burnett, C.~T.}
+\newblock \bibinfo{title}{Is primacy a factor in association-formation}.
+\newblock \emph{\bibinfo{journal}{American Journal of Psychology}}
+  \textbf{\bibinfo{volume}{35}}, \bibinfo{pages}{396--401}
+  (\bibinfo{year}{1924}).
+
+\bibitem{PolyEtal09}
+\bibinfo{author}{Polyn, S.~M.}, \bibinfo{author}{Norman, K.~A.} \&
+  \bibinfo{author}{Kahana, M.~J.}
+\newblock \bibinfo{title}{A context maintenance and retrieval model of
+  organizational processes in free recall}.
+\newblock \emph{\bibinfo{journal}{Psychological Review}}
+  \textbf{\bibinfo{volume}{116}}, \bibinfo{pages}{129--156}
+  (\bibinfo{year}{2009}).
+
+\bibitem{MannKaha12}
+\bibinfo{author}{Manning, J.~R.} \& \bibinfo{author}{Kahana, M.~J.}
+\newblock \bibinfo{title}{Interpreting semantic clustering effects in free
+  recall}.
+\newblock \emph{\bibinfo{journal}{Memory}} \textbf{\bibinfo{volume}{20}},
+  \bibinfo{pages}{511--517} (\bibinfo{year}{2012}).
+
+\bibitem{HeusEtal18a}
+\bibinfo{author}{Heusser, A.~C.}, \bibinfo{author}{Ziman, K.},
+  \bibinfo{author}{Owen, L. L.~W.} \& \bibinfo{author}{Manning, J.~R.}
+\newblock \bibinfo{title}{{HyperTools}: a {Python} toolbox for gaining
+  geometric insights into high-dimensional data}.
+\newblock \emph{\bibinfo{journal}{{Journal of Machine Learning Research}}}
+  \textbf{\bibinfo{volume}{18}}, \bibinfo{pages}{1--6} (\bibinfo{year}{2018}).
+
+\bibitem{McInEtal18}
+\bibinfo{author}{McInnes, L.}, \bibinfo{author}{Healy, J.} \&
+  \bibinfo{author}{Melville, J.}
+\newblock \bibinfo{title}{{UMAP}: Uniform manifold approximation and projection
+  for dimension reduction}.
+\newblock \emph{\bibinfo{journal}{arXiv}} \textbf{\bibinfo{volume}{1802}}
+  (\bibinfo{year}{2018}).
+
+\bibitem{MuelEtal18}
+\bibinfo{author}{Mueller, A.} \emph{et~al.}
+\newblock \bibinfo{title}{{WordCloud 1.5.0: a little word cloud generator in
+  Python}}.
+\newblock \emph{\bibinfo{journal}{Zenodo}}
+  \textbf{\bibinfo{volume}{https://zenodo.org/record/1322068\#.W4tPKZNKh24}}
+  (\bibinfo{year}{2018}).
+
+\bibitem{PallWagn02}
+\bibinfo{author}{Paller, K.~A.} \& \bibinfo{author}{Wagner, A.~D.}
+\newblock \bibinfo{title}{Observing the transformation of experience into
+  memory}.
+\newblock \emph{\bibinfo{journal}{Trends in Cognitive Sciences}}
+  \textbf{\bibinfo{volume}{6}}, \bibinfo{pages}{93--102}
+  (\bibinfo{year}{2002}).
+
+\bibitem{YarkEtal11}
+\bibinfo{author}{Yarkoni, T.}, \bibinfo{author}{Poldrack, R.~A.},
+  \bibinfo{author}{Nichols, T.~E.}, \bibinfo{author}{Van~Essen, D.~C.} \&
+  \bibinfo{author}{Wager, T.~D.}
+\newblock \bibinfo{title}{Large-scale automated synthesis of human functional
+  neuroimaging data}.
+\newblock \emph{\bibinfo{journal}{Nature Methods}}
+  \textbf{\bibinfo{volume}{8}}, \bibinfo{pages}{665} (\bibinfo{year}{2011}).
+
+\bibitem{BellEtal18}
+\bibinfo{author}{Bellmund, J. L.~S.}, \bibinfo{author}{G\"{a}rdenfors, P.},
+  \bibinfo{author}{Moser, E.~I.} \& \bibinfo{author}{Doeller, C.~F.}
+\newblock \bibinfo{title}{Navigating cognition: spatial codes for human
+  thinking}.
+\newblock \emph{\bibinfo{journal}{Science}} \textbf{\bibinfo{volume}{362}}
+  (\bibinfo{year}{2018}).
+
+\bibitem{BellEtal20}
+\bibinfo{author}{Bellmund, J. L.~S.} \emph{et~al.}
+\newblock \bibinfo{title}{Deforming the metric of cognitive maps distorts
+  memory}.
+\newblock \emph{\bibinfo{journal}{Nature Human Behavior}}
+  \textbf{\bibinfo{volume}{4}}, \bibinfo{pages}{177--188}
+  (\bibinfo{year}{2020}).
+
+\bibitem{ConsEtal16}
+\bibinfo{author}{Constantinescu, A.~O.}, \bibinfo{author}{O'Reilly, J.~X.} \&
+  \bibinfo{author}{Behrens, T. E.~J.}
+\newblock \bibinfo{title}{Organizing conceptual knowledge in humans with a
+  gridlike code}.
+\newblock \emph{\bibinfo{journal}{Science}} \textbf{\bibinfo{volume}{352}},
+  \bibinfo{pages}{1464--1468} (\bibinfo{year}{2016}).
+
+\bibitem{GilbMarl17}
+\bibinfo{author}{Gilboa, A.} \& \bibinfo{author}{Marlatte, H.}
+\newblock \bibinfo{title}{Neurobiology of schemas and schema-mediated memory}.
+\newblock \emph{\bibinfo{journal}{Trends Cogn Sci}}
+  \textbf{\bibinfo{volume}{21}}, \bibinfo{pages}{618--631}
+  (\bibinfo{year}{2017}).
+
+\bibitem{BaldEtal18}
+\bibinfo{author}{Baldassano, C.}, \bibinfo{author}{Hasson, U.} \&
+  \bibinfo{author}{Norman, K.~A.}
+\newblock \bibinfo{title}{Representation of real-world event schemas during
+  narrative perception}.
+\newblock \emph{\bibinfo{journal}{{Journal of Neuroscience}}}
+  \textbf{\bibinfo{volume}{38}}, \bibinfo{pages}{9689--9699}
+  (\bibinfo{year}{2018}).
+
+\bibitem{HuthEtal12}
+\bibinfo{author}{Huth, A.~G.}, \bibinfo{author}{Nisimoto, S.},
+  \bibinfo{author}{Vu, A.~T.} \& \bibinfo{author}{Gallant, J.~L.}
+\newblock \bibinfo{title}{A continuous semantic space describes the
+  representation of thousands of object and action categories across the human
+  brain}.
+\newblock \emph{\bibinfo{journal}{Neuron}} \textbf{\bibinfo{volume}{76}},
+  \bibinfo{pages}{1210--1224} (\bibinfo{year}{2012}).
+
+\bibitem{HuthEtal16}
+\bibinfo{author}{Huth, A.~G.}, \bibinfo{author}{de~Heer, W.~A.},
+  \bibinfo{author}{Griffiths, T.~L.}, \bibinfo{author}{Theunissen, F.~E.} \&
+  \bibinfo{author}{Gallant, J.~L.}
+\newblock \bibinfo{title}{Natural speech reveals the semantic maps that tile
+  human cerebral cortex}.
+\newblock \emph{\bibinfo{journal}{Nature}} \textbf{\bibinfo{volume}{532}},
+  \bibinfo{pages}{453--458} (\bibinfo{year}{2016}).
+
+\bibitem{GagnEtal20}
+\bibinfo{author}{Gagnepain, P.} \emph{et~al.}
+\newblock \bibinfo{title}{Collective memory shapes the organization of
+  individual memories in the medial prefrontal cortex}.
+\newblock \emph{\bibinfo{journal}{Nature Human Behavior}}
+  \textbf{\bibinfo{volume}{4}}, \bibinfo{pages}{189--200}
+  (\bibinfo{year}{2020}).
+
+\bibitem{SimoEtal16}
+\bibinfo{author}{Simony, E.}, \bibinfo{author}{Honey, C.~J.},
+  \bibinfo{author}{Chen, J.} \& \bibinfo{author}{Hasson, U.}
+\newblock \bibinfo{title}{Dynamic reconfiguration of the default mode network
+  during narrative comprehension}.
+\newblock \emph{\bibinfo{journal}{Nature Communications}}
+  \textbf{\bibinfo{volume}{7}}, \bibinfo{pages}{1--13} (\bibinfo{year}{2016}).
+
+\bibitem{ZadbEtal17}
+\bibinfo{author}{Zadbood, A.}, \bibinfo{author}{Chen, J.},
+  \bibinfo{author}{Leong, Y.~C.}, \bibinfo{author}{Norman, K.~A.} \&
+  \bibinfo{author}{Hasson, U.}
+\newblock \bibinfo{title}{How we transmit memories to other brains:
+  Constructing shared neural representations via communication}.
+\newblock \emph{\bibinfo{journal}{Cereb Cortex}} \textbf{\bibinfo{volume}{27}},
+  \bibinfo{pages}{4988--5000} (\bibinfo{year}{2017}).
+
+\bibitem{SimoChan20}
+\bibinfo{author}{Simony, E.} \& \bibinfo{author}{Chang, C.}
+\newblock \bibinfo{title}{Analysis of stimulus-induced brain dynamics during
+  naturalistic paradigms}.
+\newblock \emph{\bibinfo{journal}{{N}euro{I}mage}}
+  \textbf{\bibinfo{volume}{216}}, \bibinfo{pages}{116461}
+  (\bibinfo{year}{2020}).
+
+\bibitem{LandDuma97}
+\bibinfo{author}{Landauer, T.~K.} \& \bibinfo{author}{Dumais, S.~T.}
+\newblock \bibinfo{title}{A solution to {P}lato's problem: the latent semantic
+  analysis theory of acquisition, induction, and representation of knowledge}.
+\newblock \emph{\bibinfo{journal}{Psychological Review}}
+  \textbf{\bibinfo{volume}{104}}, \bibinfo{pages}{211--240}
+  (\bibinfo{year}{1997}).
+
+\bibitem{MikoEtal13a}
+\bibinfo{author}{Mikolov, T.}, \bibinfo{author}{Chen, K.},
+  \bibinfo{author}{Corrado, G.} \& \bibinfo{author}{Dean, J.}
+\newblock \bibinfo{title}{Efficient estimation of word representations in
+  vector space}.
+\newblock \emph{\bibinfo{journal}{{arXiv}}}
+  \textbf{\bibinfo{volume}{1301.3781}} (\bibinfo{year}{2013}).
+
+\bibitem{CerEtal18}
+\bibinfo{author}{Cer, D.} \emph{et~al.}
+\newblock \bibinfo{title}{Universal sentence encoder}.
+\newblock \emph{\bibinfo{journal}{{arXiv}}}
+  \textbf{\bibinfo{volume}{1803.11175}} (\bibinfo{year}{2018}).
+
+\bibitem{RadfEtal19}
+\bibinfo{author}{Radford, A.} \emph{et~al.}
+\newblock \bibinfo{title}{Language models are unsupervised multitask learners}.
+\newblock \emph{\bibinfo{journal}{OpenAI Blog}} \textbf{\bibinfo{volume}{1}}
+  (\bibinfo{year}{2019}).
+
+\bibitem{BrowEtal20}
+\bibinfo{author}{Brown, T.~B.} \emph{et~al.}
+\newblock \bibinfo{title}{Language models are few-shot learners}.
+\newblock \emph{\bibinfo{journal}{{arXiv}}}
+  \textbf{\bibinfo{volume}{2005.14165}} (\bibinfo{year}{2020}).
+
+\bibitem{PedrEtal11}
+\bibinfo{author}{Pedregosa, F.} \emph{et~al.}
+\newblock \bibinfo{title}{Scikit-learn: Machine learning in {P}ython}.
+\newblock \emph{\bibinfo{journal}{Journal of Machine Learning Research}}
+  \textbf{\bibinfo{volume}{12}}, \bibinfo{pages}{2825--2830}
+  (\bibinfo{year}{2011}).
+
+\bibitem{Brainiak}
+\bibinfo{author}{Capota, M.} \emph{et~al.}
+\newblock \bibinfo{title}{Brain imaging analysis kit} (\bibinfo{year}{2017}).
+\newblock \urlprefix\url{https://doi.org/10.5281/zenodo.59780}.
+
+\bibitem{HassEtal08}
+\bibinfo{author}{Hasson, U.}, \bibinfo{author}{Yang, E.},
+  \bibinfo{author}{Vallines, I.}, \bibinfo{author}{Heeger, D.~J.} \&
+  \bibinfo{author}{Rubin, N.}
+\newblock \bibinfo{title}{A hierarchy of temporal receptive windows in human
+  cortex}.
+\newblock \emph{\bibinfo{journal}{Journal of Neuroscience}}
+  \textbf{\bibinfo{volume}{28}}, \bibinfo{pages}{2539--2550}
+  (\bibinfo{year}{2008}).
+
+\bibitem{HassEtal15}
+\bibinfo{author}{Hasson, U.}, \bibinfo{author}{Chen, J.} \&
+  \bibinfo{author}{Honey, C.~J.}
+\newblock \bibinfo{title}{Hierarchical process memory: memory as an integral
+  component of information processing}.
+\newblock \emph{\bibinfo{journal}{Trends in Cognitive Science}}
+  \textbf{\bibinfo{volume}{19}}, \bibinfo{pages}{304--315}
+  (\bibinfo{year}{2015}).
+
+\bibitem{Dobr70}
+\bibinfo{author}{Dobrushin, R.~L.}
+\newblock \bibinfo{title}{Prescribing a system of random variables by
+  conditional distributions}.
+\newblock \emph{\bibinfo{journal}{Theory of Probability \& Its Applications}}
+  \textbf{\bibinfo{volume}{15}}, \bibinfo{pages}{458--486}
+  (\bibinfo{year}{1970}).
+
+\bibitem{RamdEtal17}
+\bibinfo{author}{Ramdas, A.}, \bibinfo{author}{Trillos, N.} \&
+  \bibinfo{author}{Cuturi, M.}
+\newblock \bibinfo{title}{On wasserstein two-sample testing and related
+  families of nonparametric tests}.
+\newblock \emph{\bibinfo{journal}{Entropy}} \textbf{\bibinfo{volume}{19}},
+  \bibinfo{pages}{47} (\bibinfo{year}{2017}).
+
+\bibitem{HeusEtal17b}
+\bibinfo{author}{Heusser, A.~C.}, \bibinfo{author}{Fitzpatrick, P.~C.},
+  \bibinfo{author}{Field, C.~E.}, \bibinfo{author}{Ziman, K.} \&
+  \bibinfo{author}{Manning, J.~R.}
+\newblock \bibinfo{title}{Quail: a {Python} toolbox for analyzing and plotting
+  free recall data}.
+\newblock \emph{\bibinfo{journal}{The Journal of Open Source Software}}
+  \textbf{\bibinfo{volume}{10.21105/joss.00424}} (\bibinfo{year}{2017}).
+
+\bibitem{Fish25}
+\bibinfo{author}{Fisher, R.~A.}
+\newblock \emph{\bibinfo{title}{Statistical Methods for Research Workers}}
+  (\bibinfo{publisher}{Oliver and Boyd}, \bibinfo{year}{1925}).
+
+\bibitem{BernClif94}
+\bibinfo{author}{Berndt, D.~J.} \& \bibinfo{author}{Clifford, J.}
+\newblock \bibinfo{title}{Using dynamic time warping to find patterns in time
+  series}.
+\newblock In \emph{\bibinfo{booktitle}{{KDD workshop}}},
+  vol.~\bibinfo{volume}{10}, \bibinfo{pages}{359--370} (\bibinfo{year}{1994}).
+
+\bibitem{FreeEtal01}
+\bibinfo{author}{Freedman, D.}, \bibinfo{author}{Riesenhuber, M.},
+  \bibinfo{author}{Poggio, T.} \& \bibinfo{author}{Miller, E.}
+\newblock \bibinfo{title}{Categorical representation of visual stimuli in the
+  primate prefrontal cortex}.
+\newblock \emph{\bibinfo{journal}{Science}} \textbf{\bibinfo{volume}{291}},
+  \bibinfo{pages}{312--316} (\bibinfo{year}{2001}).
+
+\bibitem{SigmDeha08}
+\bibinfo{author}{Sigman, M.} \& \bibinfo{author}{Dehaene, S.}
+\newblock \bibinfo{title}{Brain mechanisms of serial and parallel processing
+  during dual-task performance}.
+\newblock \emph{\bibinfo{journal}{Journal of Neuroscience}}
+  \textbf{\bibinfo{volume}{28}}, \bibinfo{pages}{7585--7589}
+  (\bibinfo{year}{2008}).
+
+\bibitem{CharKoec10}
+\bibinfo{author}{Charron, S.} \& \bibinfo{author}{Koechlin, E.}
+\newblock \bibinfo{title}{Divided representations of current goals in the human
+  frontal lobes}.
+\newblock \emph{\bibinfo{journal}{Science}} \textbf{\bibinfo{volume}{328}},
+  \bibinfo{pages}{360--363} (\bibinfo{year}{2010}).
+
+\bibitem{RishEtal13}
+\bibinfo{author}{Rishel, C.~A.}, \bibinfo{author}{Huang, G.} \&
+  \bibinfo{author}{Freedman, D.~J.}
+\newblock \bibinfo{title}{Independent category and spatial encoding in parietal
+  cortex}.
+\newblock \emph{\bibinfo{journal}{Neuron}} \textbf{\bibinfo{volume}{77}},
+  \bibinfo{pages}{969--979} (\bibinfo{year}{2013}).
+
+\bibitem{KrieEtal08b}
+\bibinfo{author}{Kriegeskorte, N.}, \bibinfo{author}{Mur, M.} \&
+  \bibinfo{author}{Bandettini, P.}
+\newblock \bibinfo{title}{Representational similarity analysis -- connecting
+  the branches of systems neuroscience}.
+\newblock \emph{\bibinfo{journal}{Frontiers in Systems Neuroscience}}
+  \textbf{\bibinfo{volume}{2}}, \bibinfo{pages}{1 -- 28}
+  (\bibinfo{year}{2008}).
+
+\end{thebibliography}
 
-% \bibliography{../../CDL-bibliography/memlab}
-\bibliography{CDL-bibliography/memlab}
 
-\section*{Supporting information}
-Supporting information is available in the online version of the paper.
 
 \section*{Acknowledgements}
-We thank Luke Chang, Janice Chen, Chris Honey, Lucy Owen, Emily Whitaker, and Kirsten Ziman for feedback and scientific discussions. We also thank Janice Chen, Yuan Chang Leong, Kenneth Norman, and Uri Hasson for sharing the data used in our study.  Our work was supported in part by NSF EPSCoR Award Number 1632738. The content is solely the responsibility of the authors and does not necessarily represent the official views of our supporting organizations.
+We thank Luke Chang, Janice Chen, Chris Honey, Caroline Lee, Lucy Owen, Emily Whitaker, Xinming Xu, and Kirsten Ziman for feedback and scientific discussions. We also thank Janice Chen, Yuan Chang Leong, Chris Honey, Chung Yong, Kenneth Norman, and Uri Hasson for sharing the data used in our study.  Our work was supported in part by NSF EPSCoR Award Number 1632738. The content is solely the responsibility of the authors and does not necessarily represent the official views of our supporting organizations.  The funders had no role in study design, data collection and analysis, decision to publish or preparation of the manuscript.
 
 \section*{Author contributions}
 Conceptualization: A.C.H. and J.R.M.; Methodology: A.C.H., P.C.F. and J.R.M.; Software: A.C.H., P.C.F. and J.R.M.; Analysis: A.C.H., P.C.F. and J.R.M.; Writing, Reviewing, and Editing: A.C.H., P.C.F. and J.R.M.; Supervision: J.R.M.
 
-\section*{Author information}
-The authors declare no competing financial interests.  Correspondence and requests for materials should be addressed to J.R.M. (jeremy.r.manning@dartmouth.edu).
+\section*{Competing interests}
+The authors declare no competing interests.
 
 
 \end{document}
diff --git a/paper/supplementary_information.pdf b/paper/supplementary_information.pdf
index b3b212f..8588f93 100644
Binary files a/paper/supplementary_information.pdf and b/paper/supplementary_information.pdf differ
diff --git a/paper/supplementary_information.tex b/paper/supplementary_information.tex
index 68f1205..91f9692 100644
--- a/paper/supplementary_information.tex
+++ b/paper/supplementary_information.tex
@@ -59,7 +59,7 @@ \subsection*{Optimizing topic model parameters}
 \[
 \argmax_{\omega, \rho, K} \left[\mathrm{corr}\left(\mathrm{corr}\left(\mu\left(\omega, \rho, K\right), \nu\left(\omega, \rho, K\right)\right), \theta\right)\right],
 \]
-where $\mathrm{corr}(\mu, \nu)$ is the per-participant correlation between the episode ($\mu$) and recall ($\nu$) topic proportions matrices, and $\theta$ is the per-participant hand-counted number of recalled scenes.  We searched over a grid of pre-specified values for each of these parameters; the resulting correlations are displayed in Figure~\ref{fig:paramsearch}.  The optimal parameters were $\omega = 50$, $\rho = 10$, and $K = 100$.  In our current paper we made a number of improvements to how we preprocessed text and fit topic models (see \textit{Methods}), but we carried the same optimal parameters forward from \cite{HeusMann18} without performing any additional optimization.
+where $\mathrm{corr}(\mu, \nu)$ is the per-participant correlation between the episode ($\mu$) and recall ($\nu$) topic proportions matrices, and $\theta$ is the per-participant hand-counted number of recalled scenes.  We searched over a grid of pre-specified values for each of these parameters; the resulting correlations are displayed in Supplementary Figure~\ref{fig:paramsearch}.  The optimal parameters were $\omega = 50$, $\rho = 10$, and $K = 100$.  In our current paper we made a number of improvements to how we preprocessed text and fit topic models (see \textit{Methods}), but we carried the same optimal parameters forward from \cite{HeusMann18} without performing any additional optimization.
 
 
 \begin{figure}[b]
@@ -69,7 +69,7 @@ \subsection*{Optimizing topic model parameters}
 \label{fig:paramsearch}
 \end{figure}
 
-The optimized model converged on 32 unique topics that were assigned non-zero weights.  We provide a list of the top ten highest-weighted words from each topic in Figure~\ref{fig:topics}.
+The optimized model converged on 32 unique topics that were assigned non-zero weights.  We provide a list of the top ten highest-weighted words from each topic in Supplementary Figure~\ref{fig:topics}.
 
 \begin{figure}[p]
 \centering
@@ -80,7 +80,7 @@ \subsection*{Optimizing topic model parameters}
 %\FloatBarrier
 
 \subsection*{Feature importance analyses}
-To determine the contribution of each feature to the temporal structure of the episode topic proportions matrix, we conducted a ``leave one out'' analysis.  Specifically, we compared the original episode's topic proportions matrix (created using all hand-annotated features from the 998 manually identified scenes spanning the \textit{Sherlock} episode; see \textit{Methods} for a full list of features) with topic proportions matrices created using all but one type of feature.  For each impoverished topic proportions matrix, we computed the timepoint-by-timepoint correlation matrix, and correlated the proximal diagonals from the upper triangle with those from the temporal correlation matrix of the feature-complete matrix (for details on how we isolated proximal temporal correlations, see \textit{Methods}).  Observing a lower correlation between an impoverished matrix (with a particular feature removed) and the feature-complete matrix would suggest that the held-out feature contributed more prominently to the full episode topic proportion matrix's temporal structure.  We found that hand-annotated narrative details played the greatest roll in determining the temporal structure of the episode, whereas the name of the character(s) in focus for each shot contributed least (Fig.~\ref{fig:feature-importance}A).
+To determine the contribution of each feature to the temporal structure of the episode topic proportions matrix, we conducted a ``leave one out'' analysis.  Specifically, we compared the original episode's topic proportions matrix (created using all hand-annotated features from the 998 manually identified scenes spanning the \textit{Sherlock} episode; see \textit{Methods} for a full list of features) with topic proportions matrices created using all but one type of feature.  For each impoverished topic proportions matrix, we computed the timepoint-by-timepoint correlation matrix, and correlated the proximal diagonals from the upper triangle with those from the temporal correlation matrix of the feature-complete matrix (for details on how we isolated proximal temporal correlations, see \textit{Methods}).  Observing a lower correlation between an impoverished matrix (with a particular feature removed) and the feature-complete matrix would suggest that the held-out feature contributed more prominently to the full episode topic proportion matrix's temporal structure.  We found that hand-annotated narrative details played the greatest roll in determining the temporal structure of the episode, whereas the name of the character(s) in focus for each shot contributed least (Supp. Fig.~\ref{fig:feature-importance}A).
 
 \begin{figure}[]
 \centering
@@ -89,11 +89,19 @@ \subsection*{Feature importance analyses}
 \label{fig:feature-importance}
 \end{figure}
 
-Next, we sought to determine which annotated features contributed aspects of the episode's temporal structure that were preserved in participants' later recalls.  Specifically, we computed the timepoint-by-timepoint correlation matrix of the episode's topic proportions matrix, and correlated the proximal diagonals from its upper triangle with those from the timepoint-by-timepoint correlation matrices for each participant's recall topic proportions matrices (stretched via linear interpolation to have the same number of timepoints as the episode topic proportions matrix).  This yielded a single correlation coefficient for each participant.  We then repeated this analysis with each annotated feature held out in turn.  Observing a lower correlation between the episode and recall topic proportions matrices (constructed in the absence of a given feature) would indicate that participants utilize changes in that feature's content to discriminate between sections of the episode when organizing their recalls.  We found that hand-annotated narrative details were the most heavily utilized feature, whereas changes in the text present on-screen, the indoor/outdoor distinction between shots, the camera angle, the names of the various characters on screen, and the shot's location tended not to impact participants' recall structures (i.e., removing those features resulted in a \textit{greater} episode-recall correlation than including them; Fig.~\ref{fig:feature-importance}B).
+Next, we sought to determine which annotated features contributed aspects of the episode's temporal structure that were preserved in participants' later recalls.  Specifically, we computed the timepoint-by-timepoint correlation matrix of the episode's topic proportions matrix, and correlated the proximal diagonals from its upper triangle with those from the timepoint-by-timepoint correlation matrices for each participant's recall topic proportions matrices (stretched via linear interpolation to have the same number of timepoints as the episode topic proportions matrix).  This yielded a single correlation coefficient for each participant.  We then repeated this analysis with each annotated feature held out in turn.  Observing a lower correlation between the episode and recall topic proportions matrices (constructed in the absence of a given feature) would indicate that participants utilize changes in that feature's content to discriminate between sections of the episode when organizing their recalls.  We found that hand-annotated narrative details were the most heavily utilized feature, whereas changes in the text present on-screen, the indoor/outdoor distinction between shots, the camera angle, the names of the various characters on screen, and the shot's location tended not to impact participants' recall structures (i.e., removing those features resulted in a \textit{greater} episode-recall correlation than including them; Supp. Fig.~\ref{fig:feature-importance}B).
 
-We also wondered how the different types of features might relate.  For example, knowing which character is in focus during a given scene may also provide information about which character is speaking.  We computed topic proportions matrices from the annotations for each individual feature, in turn, and (using the same technique as in the above analyses) compared the proximal temporal correlation structure of each single-feature topic proportions matrix to each other, as well as to that of the full episode.  This provided additional confirmation that the full episode's temporal structure was largely driven by narrative details.  We also found that character-driven features (characters on screen, characters speaking, and characters in focus) were strongly correlated.  Other details, such as the presence or absence of music, led to very different topic proportions matrices (Fig.~\ref{fig:feature-importance}C).
+We also wondered how the different types of features might relate.  For example, knowing which character is in focus during a given scene may also provide information about which character is speaking.  We computed topic proportions matrices from the annotations for each individual feature, in turn, and (using the same technique as in the above analyses) compared the proximal temporal correlation structure of each single-feature topic proportions matrix to each other, as well as to that of the full episode.  This provided additional confirmation that the full episode's temporal structure was largely driven by narrative details.  We also found that character-driven features (characters on screen, characters speaking, and characters in focus) were strongly correlated.  Other details, such as the presence or absence of music, led to very different topic proportions matrices (Supp. Fig.~\ref{fig:feature-importance}C).
 %\FloatBarrier
 
+\subsubsection*{Binary features}
+Two categories of annotations, ``Music presence'' and ``Indoor vs outdoor,'' comprise a single word that can each take on one of two possible values (music: ``yes'' or ``no''; indoor vs outdoor: ``indoor'' or ``outdoor'').  Participants are unlikely to use the words ``yes'' or ``no'' to specifically refer to the presence or absence of music when they recount the episode.  In fact, the word ``no'' is filtered out at the start of our analysis pipeline (see \textit{Methods}), since it appears so frequently throughout the annotations and recalls that it cannot provide reliable event-specific information.
+
+In contrast, the word ``yes'' is a comparatively rare word (both in the episode annotations and participants’ recall transcripts). The rarity of the word ``yes'' in the annotations makes it a potentially informative feature for the topic models.  For example, the presence of music often co-occurs with the presence of specific characters, actions, locations, and plot elements.  In this way, the ``Music presence'' annotations provide a (likely relatively weak) signal for when these associated themes appear.  Further, when participants reference those themes in their recalls through their uses of other words associated with those themes (e.g., even if they don’t specifically use the word ``yes''), our modeling framework will still ``match'' references to music-related themes (i.e., semantic features or topics in the episode that co-occur with the presence or absence of music) with other words associated with those semantic features or topics.
+
+
+The ``Indoor vs outdoor'' annotations are treated similarly.  Again, although participant may not refer to a particular scene as having taken place indoors versus outdoors, they may refer to other themes that co-occur with these annotations.  In general, the notion that annotations and recalls do not need to use the same words (or in the same ways)—is a central feature of our modeling framework.  We think solving the ``matching problem'' (i.e., labeling specific things participants say with specific events they experienced) in a way that is robust to wording differences is an important advance in studying naturalistic memory behaviors.
+
 \subsection*{Creating a low-dimensional embedding space}
 Figures~\trajectories~and~\wordles C in the main text display two-dimensional projections of the 100-dimensional topic trajectories for the episode (Figs.~\trajectories A,~\wordles C), average recall (Fig.~\trajectories B), and each individual's recall (Figs.~\trajectories C,~\wordles C).  We created these embeddings using the Uniform Manifold Approximation and Projection algorithm \citep[UMAP;][]{McInEtal18} called from our high-dimensional visualization and text analysis software, \texttt{HyperTools}~\citep{HeusEtal18a}.  An advantage of the UMAP algorithm over comparable manifold learning techniques (e.g., $t$-SNE) is that UMAP explicitly attempts to preserve the global structure of the data \citep{McInEtal18,BechEtal19} by constructing a space where distance on the manifold is standard Euclidean distance, with respect to the global coordinate system.  This was important in our use case, as we wanted to visualize both the evolving structure of the episode and the spatial relationships between presented and recalled content.
 
@@ -110,46 +118,16 @@ \subsection*{Creating a low-dimensional embedding space}
 \]
 $\xi$ is the episode's trajectory through the manifold space; $\chi$ is the original, 100-dimensional episode trajectory; $\Upsilon$ is a function that computes a condensed matrix of pairwise distances between event vectors (computed using correlation distance in the original topic space and Euclidean distance in the manifold space); $\mathrm{corr}\left(\Upsilon\left(\xi\right), \Upsilon\left(\chi\right)\right)$ is the correlation between the sets of pairwise distances, and $\Gamma$ is the number of instances in which lines drawn between consecutive episode event embeddings intersected each other.   The sets of hyperparameter values we searched over comprised: $K \in \{10x \in \mathbb{Z}~|~10 < x <  22\} \cup \{161\}$ (a range roughly centered on half the total number of events, 161, in order to balance representations of local and global structure); $\gamma \in \{1, 3, 5, 7, 9\}$; $\delta \in \{0.1, 0.3, 0.5, 0.7, 0.9\}$; $\tau \in \{x \in \mathbb{Z}~|~0 < x < 101\}$; and $\varphi \in {S\choose3},~\mathrm{where}~S=\{\mathrm{episode~events,~average~recall~events, individual~recall~events}\}$.  The optimal parameters (that yielded $\Phi=0$) were $K=170$, $\gamma=7$, $\delta=0.7$, $\tau=41$, with the order of sequence $\varphi$ as the average recall events, episode events, and individuals' recall events, vertically concatenated, in order.
 
+\subsection*{Additional comments on the searchlight analyses (Fig.~\brains)}
+The final ``stage'' shown for the episode template pipeline (Fig. \brains A) displays the correlations between pairs of episode topic vectors at nearby timepoints, while the final stage shown for the recall template pipeline (Fig. \brains B) displays the correlations between pairs of recall topic vectors at nearby timepoints.  The ``proximal correlation matrices'' we mention in the figure and main text denote that we have taken temporal autocorrelation matrices and masked out everything above the $n$th diagonal (where $n$ is the duration, in TRs, of the longest episode event).  When we say ``correlations at nearby timepoints,'' we are referring to the unmasked parts of these proximal correlation matrices.
 
-% \FloatBarrier
-\newpage
-%\begin{centering}
-%  \vspace*{\fill}
-  \section*{Supplementary figures}
-
-\begin{figure}[tp]
-\centering
-\includegraphics[width=0.6\textwidth]{figs/topic_space_flow}
-\caption{\small \textbf{Methods detail for recall trajectory analysis displayed in Figure~\trajectories B} \textbf{A.} This panel replicates Figure~\trajectories B in the \textit{Main text}, but with two additions.  First, individual participants' recall trajectories are displayed (faintly) as light gray lines.  Second, three locations on the trajectory have been highlighted (blue, yellow, and red circles).  \textbf{B.}  These zoomed-in views of the locations highlighted in Panel A show the average trajectory (black) and individual participants' trajectories (gray lines) that intersect the given region of topic space.  \textbf{C.} For each circular region of topic space tiling the 2D embedding plane displayed in Panel A, we compute the distribution of angles formed between each participant's trajectory segment (i.e., the point at which the trajectory enters and exists the region of topic space) and the $x$-axis.  The distributions of angles for these three example regions are displayed in the colored rose plots.  We use Rayleigh tests to assign an arrow direction, length, and color for that region of topic space.  Arrows displayed in color are significant at the $p < 0.05$ level (corrected).  The arrow directions are oriented according to the circular means of each distribution, and the arrow lengths are proportional to the lengths of those mean vectors.  The example regions have been oriented from left to right in decreasing order of consistency across participants.}
-\label{fig:arrows-methods}
-\end{figure}
-\vspace*{\fill}
-
-\newpage
-    \subsection*{Participant-level figures referenced in the main text}
-    %\vspace*{\fill}
-  %\end{centering}
+To compute proximal correlation matrices for the episode, or for a given searchlight’s activity patterns, we simply correlate the topic vectors (or voxel responses) from every pair of timepoints, and then mask out anything beyond the $n$th diagonal.  In this way, the RSA analysis that we designed to identify searchlights whose responses show a similar (proximal) correlation structure to the episode’s topics (Fig. \brains C) does not require any further temporal alignment, since both the topic timeseries and voxel responses are computed at the same timepoints.
 
-\begin{figure}[p!]
-\centering
-\includegraphics[width=\textwidth]{figs/corrmats}
-\caption{\small \textbf{Recall temporal correlation matrices and event segmentation fits.} Each panel is in the same format as Figure~\topicprops E in the main text.  The yellow boxes indicate HMM-identified event boundaries.}
-\label{fig:corrmats}
-\end{figure}
+However, computing the proximal correlation matrix for participants’ recalls requires an additional ``warping'' step to bring the behavioral data (during recall) and neural data (while viewing the episode) into temporal alignment.  For example, different participants take different amounts of time to recount the episode, and their transcripts have different lengths.  These differences occur at the ``episode'' level (i.e., with respect to total recall duration and/or transcript length) and also at the ``event'' level (i.e., a given participant's recounting of a particular event may be more or less detailed, and take more or less time, than another participant's recounting of the same event).  The purpose of the warping step is to temporally stretch or compress the topic timecourse of each participants’ recounting so that it is temporally aligned with the voxel responses recorded as the participants were watching the episode.  That warping step is what we are highlighting in the two rightmost matrices in Figure~\brains B.
 
-\begin{figure}[p!]
-\centering
-\includegraphics[width=\textwidth]{figs/matchmats}
-\caption{\small \textbf{Episode-recall event correlation matrices.}  Each panel is in the same format as Figure~2G in the main text.  The yellow boxes mark the matched episode event for each recall event (i.e., the maximum correlation in each column).}
-\label{fig:matchmats}
-\end{figure}
+The transition between the 1st and 2nd stage of the pipeline shown in Figure~\brains B shows the effect of using dynamic time warping to temporally align an example participant’s recall topic proportions matrix with the episode’s topic proportions matrix.  The rightmost matrix (stage 1) shows the correlation between the topic vectors for each unwarped recall timepoint (i.e., sliding text window) and each episode timepoint (TR).  The middle matrix (stage 2) shows the correlation between the topic vectors for each warped recall timepoint (i.e., TR) and each episode timepoint (TR).  In both matrices, the row (episode timepoint) matched to each column (recall timepoint) by the dynamic time warping algorithm is indicated in yellow.  We chose to visualize this step by correlating the example recall trajectory with the episode (rather than with itself, as in the other matrices) before versus after warping.  We felt this would help to illustrate how the diagonal ``straightens'' as a result of the non-linear ``stretching'' the algorithm performs to align the two timeseries.  An important feature of the warping algorithm is that it is a monotonic transformation-- i.e., it does not change the relative orders of the recalled timepoints; it only stretches or compresses different parts of the recall topic proportions matrix while preserving its temporal order.
 
-\begin{figure}[p!]
-\centering
-\includegraphics[width=.92\textwidth]{figs/k_optimization}
-\caption{\small \textbf{Episode and recall topic proportions matrix \textit{K}-optimization functions.}  We selected the optimal $K$-value for the episode and each recall topic proportions matrix using the formula described in \textit{Methods}. This computation resulted in a curve for each matrix, describing the Wasserstein distance between the distributions of within-event and across-event topic vector correlations, as a function of $K$.}
-\label{fig:k_optimization}
-\end{figure}
+The 3rd (leftmost) stage shown in Fig.~\brains B then displays the proximal correlation matrix for the example participant’s recalls.  In other words, we computed the correlation matrix for that participant’s warped recall topic proportions (i.e., after they have been temporally aligned to the episode), and then masked out everything beyond the $n$th diagonal.  We label axes corresponding to the warped recall trajectory as “Warped recall time (TR)” rather than “Viewing time (TR)” to help differentiate them from axes corresponding to the episode trajectory, as well as to maintain consistency with how we labeled video-recall and recall-recall correlation matrices in other figures.
 
 \newpage
 \renewcommand{\refname}{Supplementary references}