Information

What neural mechanism explains the tendency to visually attend to the whole scene before attending to details?

What neural mechanism explains the tendency to visually attend to the whole scene before attending to details?



We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

I have the intuition that human vision first attends to large-scale objects and then small-scale details. Is there any mechanism in the visual cortex that will explain this phenomenon? Is there a resolution refinement process when we look at scenes?


First, it is not only your intuition - there are many experimental results showing that we first perceive the gist of scenes (for example, is it outdoors or indoors?), then the major parts of it (was there an animal, or a human figure in it?) then more and more details (is that figure male or female? what is her expression?) [1] [2]. Note, however, that it is not exactly related to the size of the object, but more to its perceived importance or relevance. (See also this great video about change blindness, that exemplifies that)

Reverse Hierarchy Theory [3] proposes a mechanism for that - the activation in the network flows mostly "bottom up", but conscious perception starts at higher level, and then actively (through attention) accesses "lower level" details as they are needed. or in their words:

Classically, the visual system was seen as a hierarchy of cortical areas and cell types. Neurons of low-level areas (V1, V2) receive visual input and represent simple features such as lines or edges of specific orientation and location. Their outputs are integrated and processed by successive cortical levels (V3, V4, medial-temporal area MT), which gradually generalize over spatial parameters and specialize to represent global features. Finally, further levels (inferotemporal area IT, prefrontal area PF, etc.) integrate their outputs to represent abstract forms, objects, and categories. The function of feedback connections was unknown. Reverse Hierarchy Theory proposes that the above forward hierarchy acts implicitly, with explicit perception beginning at high-level cortex, representing the gist of the scene on the basis of a first-order approximate integration of low-level input. Later, explicit perception returns to lower areas via the feedback connections, to integrate into conscious vision with scrutiny the detailed information available there. Thus, initial perception is based on spread attention (large receptive fields), guessing at details, and making binding or conjunction errors. Later vision incorporates details, overcoming such blindnesses


[1] Potter, M. C. (1976). Short-term conceptual memory for pictures. Journal of Experimental Psychology: Human Learning and Memory; Journal of Experimental Psychology: Human Learning and Memory, 2(5), 509. link

[2] Rensink, R. A., O'Regan, J. K., & Clark, J. J. (1997). To see or not to see: The need for attention to perceive changes in scenes. Psychological Science, 8(5), 368-373. link

[3] Hochstein, S., & Ahissar, M. (2002). View from the top-hierarchies and reverse hierarchies in the visual system. Neuron, 36(5), 791-804. link


The phenomenon you describe is called the global precedence effect, and was first studied extensively by David Navon (1977). One way to measure this effect is to create conflict between global and local features. For example, Navon presented observes with letter stimuli that were globally organised into different letters, such as;

Observers were instructed to indicate either a) when the smaller letters were E vs H or b) when the larger letters formed an E vs H. Navon found that reaction times were generally faster when the global and local features were congruent. However, the conflict caused by the global form impaired reaction times in a) much more than the conflict caused by the smaller letters in b). Greater interference from the global structure was interpreted as showing that the global form is processed before the local details.

This effect was studied in greater detail by Aude Oliva and Phillipe Schyns. Oliva presented hybrid images of natural scenes. These images were composed of the high spatial frequency information from one scene, and the low spatial frequency information from another scene. For example in the images below the low spatial frequency information of a highway has been combined with the high spatial frequency information of a picture of skyscrapers (top image) and vice versa in the bottom image. They showed that the low spatial frequency information is more useful, particularly in cases where the scenes were viewed only briefly, or when participants had to make a very fast judgement.

What is interesting about the study conducted by Schyns and Oliva is that it provides a evidence for a neural explanation for why global features dominate over details. This explanation is based on two cell types that exist in the retina the send axons to the thalamus in the brain. These cell types are roughly divided into two types: the larger magnocellular neuron, and the smaller parvocellular neuron. These cell types have different spatial preferences and temporal characteristics. Mangocellular neurons prefer low spatial frequency inputs and show a rapid, transient response. Parvocellular neurons on the other hand prefer colourful, high spatial frequency input and show a slow, sustained response. So the idea is that the magnocellular pathway rapidly carries coarse, low spatial frequency information to the brain to form an initial interpretation of the world. This interpretation is then compared to the more detailed information carried by the parvocellular pathway as it arrives in the cortex.

References Navon, D. (1977). Forest before trees: The precedence of global features in visual perception, Cognitive Psychology, Vol 9(3), 353-383.

Schyns, P. G., & Oliva, A. (1994). From blobs to boundary edges: Evidence for time-and spatial-scale-dependent scene recognition. Psychological Science, 5(4), 195-200.


Human vision is more accustomed to first see things that move. So, considering both large scale objects and small scale objects are present in the vision field, the object will shows the first sign of movement will be first attended by the visual cortex. I believe this is because of the evolutionary process where human were hunters and mind evolved to detect any animal movement in the periphery.


Discussion

The primary finding is that the presence of an attended sound matching the temporal rate of one of a pair of competing ambiguous visual stimuli allows subjects much more control over voluntarily holding that stimulus dominant. Attentional control over the other, temporally mismatched, visual pattern was also influenced by the sound but in the opposite manner. The size of this effect is remarkably large, given that attentional control over binocular rivalry is usually found to be quite weak (Meng and Tong, 2004 Chong et al., 2005 van Ee et al., 2005 Paffen et al., 2006). Importantly, we also showed that active attention to both the sound and the visual stimulus promoted enhanced voluntary control. Below, we argue that this may help to explain why other researchers in psychophysics have failed to find such intimate links between auditory and visual attentional control. We also demonstrated a facilitatory relationship in the opposite direction in that attentional control over audio ambiguity is markedly aided by a matching visual stimulus. Extending this generalization, we demonstrated that a matching tactile stimulus enhanced attentional control in perceptually selecting competing visual stimuli and that this control was further strengthened in a trimodal condition that combined congruent audio-tactile stimuli with the bistable visual stimulus. Figure 4 summarizes the generalization of results across different visual patterns, sound patterns, and sensory modalities.

When the sound was temporally delayed, subjects still sensed that vision and sound were linked because of their constant phase relationship (Fig. 2b). In addition, although we have only provided formal evidence for a mandatory involvement of directed attention in the sound-on-vision experiments (Fig. 3d), our pilot work (supplemental Fig. 5d, available at www.jneurosci.org as supplemental material) and the available literature suggest that attention must be engaged to promote cross-modal interactions (Calvert et al., 1997 Gutfreund et al., 2002 Degerman et al., 2007 Mozolic et al., 2008 for review, see Shinn-Cunningham, 2008). Nevertheless, although a systematic investigation of temporal offset and automation for the cross-modal effects goes beyond the scope of the present paper, it is interesting to note that the underlying rhythm mechanism for our rhythm-based effect may be different from the mechanism underlying automatically occurring coincidence-based auditory-visual interactions (such as in the reported enhanced perception of visual change by a coincident auditory tone pip) (van der Burg et al., 2008).

Our study is unique in that it uses competing bistable visual and bistable auditory stimuli, providing the opportunity to study how competing sensory processing in two modalities (related to percepts rather than physical stimuli) are influenced by signals from other modalities. How do our findings shed light on the mechanisms underlying the resolution of perceptual ambiguity? We suggest that the enhanced capacity for attentional selection of the congruent stimulus results from a boost of its perceptual gain, which is attributable to top-down feedback from multisensory attentional processes that select the congruent feature of the input signal. In support of this, for vision, it has been shown previously that the effect of top-down attention on extending dominance durations for perceptually competing stimuli is equivalent to a boost in stimulus contrast (Chong et al., 2005 Chong and Blake, 2006 Paffen et al., 2006). This is in line with recent studies on visual spatial and feature attention in psychophysics (Blaser et al., 1999 Carrasco et al., 2004 Boynton, 2005) and neurophysiology (Reynolds and Chelazzi, 2004) which demonstrate that the neural mechanism underlying attentional selection involves boosting the gain of the relevant neural population. This is observed in the early cortical stages of both visual (Treue and Maunsell, 1996 Treue and Martínez Trujillo, 1999 Lamme and Roelfsema, 2000 Womelsdorf et al., 2006 Wannig et al., 2007) and auditory processing (Bidet-Caulet et al., 2007). From the present results, we can conclude that the scope of this feedback process can be extended to incorporate relevant multimodal signals. Thus, it appears that voluntary control over ambiguity resolution can be modeled as an increase in effective contrast (perceptual gain) of stimulus elements involving feature attention, as opposed to spatial attention. Dovetailing with this, voluntary control in perceptual bistability depends multiplicatively on stimulus features (Suzuki and Peterson, 2000), and an equivalence between stimulus parameter effects and attentional control is evident even at the level of fit parameters to distributions of perceptual duration data (Brouwer and van Ee, 2006 van Ee et al., 2006). It can also be demonstrated quantitatively, as in a recently developed theoretical neural model (Noest et al., 2007), that attentional gain modulation at early cortical stages is sufficient to explain all reported data on attentional control of bistable visual stimuli (Klink et al., 2008). Thus, there is converging evidence that an early gain mechanism is involved in attentional control of perceptual resolution of ambiguous stimuli, although it is too early to entirely rule out high-level modification.

Although there is support for the idea that auditory and visual attention are processed separately (Shiffrin and Grantham, 1974 Bonnel and Hafter, 1998 Soto-Faraco et al., 2005 Alais et al., 2006 Pressnitzer and Hupé, 2006 Hupé et al., 2008), our findings support the neurophysiological literature (Calvert et al., 1997 Gutfreund et al., 2002 Shomstein and Yantis, 2004 Amedi et al., 2005 Brosch et al., 2005 Budinger et al., 2006 Degerman et al., 2007 Lakatos et al., 2007, 2008 Shinn-Cunningham, 2008) that the mechanisms mediating multisensory attentional control are intimately linked. To understand these seemingly disparate results, note first that psychophysical studies finding separate processing, focused on spatial attention, as opposed to our study. Our findings concern feature attention and agree with recent findings that feature attention can more profoundly influence processing of stimuli than spatial attention (Melcher et al., 2005 Kanai et al., 2006). Note further that we presented the matched audio and visual stimuli simultaneously. The only other study on attentional control of ambiguous auditory and visual stimuli (Pressnitzer and Hupé, 2006) presented the stimuli from the two modalities separately in time, finding that results from the two modalities were unrelated. Although there are studies reporting that audiovisual stimulus combination is mandatory (Driver and Spence, 1998 Guttman et al., 2005), this is not a general view (Shiffrin and Grantham, 1974 Bonnel and Hafter, 1998 Soto-Faraco et al., 2005 Alais et al., 2006 Hupé et al., 2008). Our experiments address this by using perceptually ambiguous competing auditory and visual stimuli, thereby dissociating attention and stimulation to reveal that active attention to both modalities promotes audiovisual combination, in line with other recent studies (Calvert et al., 1997 Gutfreund et al., 2002 Degerman et al., 2007 Mozolic et al., 2008).

Our data suggest a functional role for neurons recently found in human posterior parietal, superior prefrontal, and superior temporal cortices that combine voluntarily initiated attentional functions across sensory modalities (Gutfreund et al., 2002 Shomstein and Yantis, 2004 Degerman et al., 2007). We suggest that when the brain can detect a rhythm in a task, attention feeds back to unisensory cortex to enforce coherent and amplified output of the matching perceptual interpretation. Recently, neurophysiologists were able to demonstrate that an attended rhythm in a task enforced the entrainment of low-level neuronal excitability oscillations across different sensory modalities (Lakatos et al., 2008). The fact that oscillations in V1 entrain to attended auditory stimuli just as well as to attended visual stimuli reinforces the view that the primary cortices are not the exclusive domain of a single modality input (Foxe and Schroeder, 2005 Macaluso and Driver, 2005 Ghazanfar and Schroeder, 2006 Kayser and Logothetis, 2007 Lakatos et al., 2007) and confirms the role of attention in coordinating heteromodal stimuli in the primary cortices (Brosch et al., 2005 Budinger et al., 2006 Lakatos et al., 2007, 2008 Shinn-Cunningham, 2008). We suggest that the same populations of neurons may control multimodal sensory integration and attentional control, suggesting that the neural network that creates multimodal sensory integration may also provide the interface for top-down perceptual selection. However, our understanding of multisensory neural architecture is still developing (Driver and Noesselt, 2008 Senkowski et al., 2008) and a competing view, rather than focusing on feedback from multisensory to unisensory areas, proposes that multisensory interactions can occur because of direct feedforward convergence at very early cortical areas previously thought to be exclusively unisensory (Foxe and Schroeder, 2005 Ghazanfar and Schroeder, 2006). Testing competing views will require further studies, possibly using neuroimaging techniques with high temporal resolution or neurodisruption techniques to temporarily lesion the putative higher-level area.

Conclusion

In sum, our novel paradigm involving ambiguous stimuli (either visual or auditory) enabled us to demonstrate that active attention to both the auditory and the visual pattern was necessary for enhanced voluntary control in perceptual selection. The audiovisual coupling that served awareness was therefore not fully automatic, not even when they had the same rate and phase. This suggests a functional role for neurons that combine voluntarily initiated attentional functions across different sensory modalities (Calvert et al., 1997 Gutfreund et al., 2002 Shomstein and Yantis, 2004 Amedi et al., 2005 Brosch et al., 2005 Budinger et al., 2006 Degerman et al., 2007 Lakatos et al., 2007, 2008), because in most of these studies congruency effects were not seen unless attention was actively used. This squares with psychophysics and neurophysiology showing intimate links between active attention and cross-modal integration (Spence et al., 2001 Kanai et al., 2007 Lakatos et al., 2007 Mozolic et al., 2008 Shinn-Cunningham, 2008). Thus, these attention-dependent multisensory mechanisms provide structure for attentional control of perceptual selection in two ways. First, in responding to intermodal congruency, they may boost the baseline response of the congruent alternative (as there is more “proof” for a perceptual interpretation when it is supported by two converging modality sources). Second, they may increase attentional control over perceptual selection because a multiplicative gain will be more significant when acting on a higher baseline, therefore allowing more attentional control.


Introduction

For centuries, researchers have tried to unravel the mechanics of the human visual system—a system that can successfully identify complex, naturalistic objects and materials across an unimaginably wide range of images. Many of the lower-level mechanisms within this system are now quite well understood [1–3]. For example, networks of cells have been identified that are specifically tuned to orientations, colours, spatial frequencies, temporal frequencies, motion directions, and disparities [4,5]. Cells further along the visual processing hierarchy are sensitive to more complex stimulus characteristics, and are much harder to characterize [6]. However, recent advances in artificial neural networks hold some promise for developing detailed, image-computable process models of sophisticated visual inferences, such as object recognition in arbitrary photographs [7–10].

Artificial neural networks provide an experimental platform for simulating complex visual abilities, and then carefully probing the role of specific objective functions, training sets and network architectures that yield human-like performance. By concentrating on a single task—such as the estimation of a particular physical property from the image—it becomes easier to single out the learned features of a network. Having developed a model that mimics human behaviour, the response properties of all units in the network can be measured with arbitrary precision over arbitrary conditions, like an idealised form of in vivo systems neuroscience performed on a model system rather than real tissue.

A particularly intriguing visual ability is the perception of liquids. Liquids can adopt an extraordinary range of different appearances because of their highly mutable shapes, which are influenced both by internal physical parameters, such as viscosity, and external forces, such as gravity. The most important physical property distinguishing different liquids is viscosity. Yet to estimate viscosity, the visual system must somehow discount the contributions of the external forces to the observed behaviour. For example, a viscous liquid can be made to flow and splash somewhat like a runny liquid if propelled with sufficient speed. The behaviour of liquids is governed by complex physical laws, and it is rather unlikely that we infer the viscosity of a given liquid by explicitly simulating the flow of particles within the liquid (although see [11,12]). Previously, we found that observers draw on a range of optical, shape and motion cues to identify liquids and infer their properties [13–16]. However, the stimulus features underlying such inferences are often only loosely defined. To date there is still no image-computable model that can predict the perception of liquids or their viscosity. Here, we sought to leverage recent advances in deep neural networks (DNNs) to develop such a model and then probe its inner workings to generate novel hypotheses about how the human visual system estimates viscosity.

In machine learning, most work on artificial neural networks concentrates on achieving the best possible performance in a given task. In this study, by contrast, rather than seeking to develop a network that is mathematically optimal at estimating viscosity, instead we seek to develop a feedforward convolution network that most closely mimics the behaviour of the human visual system. To evaluate the extent to which models resembled humans, we asked observers to judge viscosity in the same movies that were shown to the trained neural networks.

The neural networks used here had a ‘slow-fusion’ architecture [17] for processing movie data (as opposed to static frames). They were trained on a dataset of 100.000 computer-generated fluid simulation animations, 20 frames long, depicting liquids interacting in ten different scene classes, which induced a wide variety of behaviours (pouring, stirring, sprinkling, etc Fig 1). Their training objective was to estimate the physical viscosity parameter in the simulations. To test generalization, the tenth scene was not used during training and 0.8% of the simulations in each scene were withheld for validation during training. The training labels corresponded to the sixteen different physical viscosity steps that were simulated. For comparison, human observers performed a viscosity rating task, in which they viewed 800 of these stimuli and assigned perceived viscosity labels. The networks were trained on physical viscosity labels—not human ratings—but we used Bayesian optimization of the network’s hyperparameters (e.g., learning rate, momentum) and layer specific settings (kernel sizes, number of filters) to search for networks that correlated well with humans on the 800 perceived viscosity labels. Importantly, training was relatively short with only 30 epochs (30 repetitions of the entire training set). With the networks in hand, we then analysed their internal representations to identify characteristics that led to human-like behaviour.

Different liquid interactions were simulated, as pouring, rain, stirring and dipping. Optical material properties and illumination maps were randomly assigned with the white plane and square reservoir staying constant. S1 Video shows the moving stimuli.

Our main analyses and findings are as follows. To determine whether we have a model that is sufficiently close to human performance to warrant further analysis, we first compared the networks’ predictions with human perceptual judgments on a stimulus-by-stimulus basis. We find that a network trained to estimate physical viscosity indeed predicts average human viscosity judgments roughly as well as individual humans do. This need not have been the case. Humans learn to perform a much wider range of visual tasks on a much more diverse visual diet, so it is not trivial that such a network trained on physical labels and computer simulations predicts both the errors and successes of human performance. We also find that the best predictions arise when networks are trained for a relatively short duration.

Second, having established that the network mimics human performance, we sought to gain insights into the inner workings of the network, by analysing the response properties of individual units at various stages of the network (‘virtual electrophysiology’). We did this by: (a) comparing their responses to a set of hand-engineered features and ground-truth scene properties, (b) identifying stimuli that most strongly or weakly drive units, and (c) directly visualizing features through activation maximization. Together, these analyses revealed that many units are tuned to interpretable spatiotemporal and colour features. Yet we also find a distinct population of units with nontrivial responses properties (i.e., whose responses are poorly explained by any of the features we considered), and which are especially important for the performance of the network. We also show that linear combinations of the hand-engineered features are insufficient on their own to account for human viscosity perception, further reinforcing the importance of the additional units.

Third, we analysed network representations at the level of whole layers (‘virtual fMRI’), and studied the effects of network capacity (i.e., number of units) on the internal representation. The main findings are: (1) a gradual transition from low-level image descriptors to higher level features along the network hierarchy, and (2) a striking dependency of the internal representations on the number of units, practically independently of overall performance and the ability to predict human judgments. This suggests that caution is required in inferring the properties of biological visual systems from models with seemingly similar performance.

Finally, we compared representations at the level of entire networks, to confirm whether 100 instances of the same architecture trained on the same dataset yielded similar internal representations (‘virtual individual differences’). The results indeed reveal highly similar performance, with slightly declining similarity along the network hierarchy (i.e., low level representations are almost identical across networks, later stages differ more). We also compared our model against other network architectures (pre)trained on other datasets, finding that training the architecture studied here on the particular training set we used yields the closest correspondence to human judgments.


Materials and Method

Participants

Twenty-one young adults participated in the study. All participants were right-handed, native English speakers with no history of psychiatric or neurological illness. Participants provided written informed consent in accordance with the Institutional Review Board of Duke University Medical Center. One participant was excluded for excessive head motion and one was excluded for problems with image acquisition, leaving data from 19 participants included in analysis (9 female ages 18�, m = 23.0, SD = 3.1). In addition, one participant was removed only from analyses that directly compare “remember” versus “high-confidence old” judgments due to having no “remember” responses in the neutral-semantic condition.

Stimuli

Stimuli included 630 pictures from the International Affective Picture System (Lang, Bradley, & Cuthbert, 2008) as well as from an in-house, standardized database that allowed us to equate the pictures better for visual complexity and content (e.g., human presence). Pictures were assigned on the basis of a 9-point normative valence scale to emotionally negative (valence: 1𠄴), neutral (valence: 4𠄶), and positive (valence: 6𠄹) conditions. In accordance with the picture selection procedure, standardized valence scores were lower for negative (M =2.85, SD = .62) than neutral pictures (M = 5.14, SD = .43 t (418) = 43.98, p < .001), and higher for positive (M = 7.02, SD = .54) than neutral pictures (t(418) = 39.85, p < .001). Additionally, arousal scores (1 = calm, 9 = excited) were greater for negative (M = 5.72, SD = 0.49) than neutral pictures (M = 3.51, SD = .49 t (418) = 45.95, p < .001), greater for positive (M = 5.68, SD = .59) than neutral pictures (t (418) = 40.91, p < .001), and did not significantly differ between negative and positive pictures (t (418) = .62, p = .54).

Procedure

Participants performed both encoding and recognition memory tasks in the scanner, with a 2-day delay between tasks. During encoding, participants viewed 140 negative, 140 positive, and 140 neutral pictures. The encoding session consisted of 10 functional runs, across which negative, positive, and neutral pictures were evenly divided. Runs alternated between two distinct tasks, semantic and perceptual, described below. To avoid the induction of long-lasting mood states, the pictures within each block where pseudo-randomized so that no more than three pictures of the same valence were consecutively presented. The assignment of encoding stimulus lists to the semantic versus perceptual task was counterbalanced across participants.

Semantic and perceptual tasks are illustrated in Figure 1-A . In the semantic task, participants were instructed to analyze each picture carefully for its meaning and interpretation, so that after the picture was taken away, they could choose between two possible descriptions of the picture. In the perceptual task, participants were instructed to analyze each picture carefully for its perceptual features, particularly colors and lines, so that after the picture was taken away, they could decide, for example, whether there was more red versus green or more horizontal versus vertical lines in the picture. Critically, participants were cued before each run as to which task was next, so that they were able to tailor their processing of each picture to the current task.

Trial structure was similar between tasks ( Figure 1-A ). For each trial a picture was presented for 2 seconds. A jittered fixation interval followed each picture presentation, drawn from an exponential distribution with a mean of 2 seconds. After this interval the participant was instructed to rate the picture for its emotional arousal or intensity on a 4-point scale (1 = calm, 4 = excited). The rating screen remained on-screen for 1 second and was immediately followed by a question screen, which varied by task. In the semantic task, the question screen said, “Which word best describes the picture?” Two possible options were presented on-screen, both of which were written for each picture such that both could be related to the picture but only one described the true meaning of the picture. In the perceptual task, the question screen said, “Which feature are there more of?” Two possible options were presented on-screen: either two color names or the words horizontal and vertical. The question screen remained for 1 second, followed by another jittered fixation interval (mean = 2 s) before the next trial. Responses were collected until the next picture appeared.

Two days after encoding, participants completed a recognition task for the pictures (see Figure 1B ). An additional 70 emotionally negative, 70 positive, and 70 neutral pictures were presented as distracters. Pictures were each presented for 2 seconds, followed by a jittered fixation interval (mean = 2 s). Participants indicated whether the item was old or new using a 5-point scale, with 1 = definitely new, 2 = maybe new, 3 = maybe old, 4 = definitely old, and 5 = remember. Participants were instructed that a remember response indicated the recollection of a specific detail from when they saw that picture during the encoding period, whereas a definitely old response did not include any specific details.

Behavioral analyses

Average arousal ratings and question accuracy were calculated separately for each trial type. To measure differences in memory responding between conditions, hit rates, false alarm rates, and d’ scores were evaluated for each trial type. In signal detection models, sensitivity to the memory signal is measured as d’ (the difference between z-transformed hits and false alarms) (Macmillan & Creelman, 2005). Because the effect of emotion on memory tends to be strongest when only highly confident responses or recollection estimates are considered (Dolcos, et al., 2005 Ochsner, 2000) d’ was evaluated with its criterion between 3 (‘maybe old’) and 4 (�initely old’). That is, responses of 4 and R were taken as ‘old’ and the rest were taken as ‘new’ responses. Encoding response data and d’ scores were entered into separate repeated-measures ANOVAs with emotion (negative, neutral, positive) and task (deep, shallow) as factors. Subsequent post-hoc statistics consisted of repeated-measures ANOVAs with the corresponding factors and variables of interest.

FMRI Methods

Scanning

Images were collected using a 4T GE scanner. Stimuli were presented using liquid crystal display goggles (Resonance Technology, Northridge, CA), and behavioral responses were recorded using a four button fiber optic response box (Resonance Technology). Scanner noise was reduced with earplugs and head motion was minimized using foam pads and a headband. Anatomical scanning started with a T2-weighted sagittal localizer series. The anterior (AC) and posterior commissures (PC) were identified in the midsagittal slice, and 34 contiguous oblique slices were prescribed parallel to the AC-PC plane. High-resolution T1-weighted structural images were collected with a 24-cm field of view (FOV), a 256 2 matrix, 68 slices, and a slice thickness of 1.9 mm. Functional images were acquired using an inverse spiral sequence with a 2-sec TR, a 31-msec TE, a 24-cm FOV, a 64 2 matrix, and a 60ଏlip angle. Thirty-four contiguous slices were acquired with the same slice prescription as the anatomical images. Slice thickness was 3.8 mm, resulting in 3.75 × 3.75 × 3.8 mm voxels.

Statistical analyses

Preprocessing and data analyses were performed using SPM5 software implemented in Matlab (www.fil.ion.ucl.ac.uk/spm/). After discarding the first 6 volumes, the functional images were slice-timing corrected and motion-corrected, spatially normalized to the Montreal Neurological Institute (MNI) template, spatially smoothed using an 8 mm isotropic Gaussian kernel, and resliced to a resolution of 3.75 × 3.75 × 3.8 mm voxels. For each subject, evoked hemodynamic responses to event types were modeled with a delta (stick) function corresponding to stimulus presentation convolved with a canonical hemodynamic response function within the context of the general linear model, as implemented in SPM5. Main event types were modeled at the fixed effects level, representing all possible combinations of emotion (negative, neutral, positive), encoding task (semantic, deep), and memory accuracy (hits, misses, false alarms, correct rejections). Given our focus on the amygdala and available fMRI evidence that this region contributes similarly to emotional memory for positive and negative pictures (e.g., Anders, Lotze, Erb, Grodd, & Birbaumer, 2004 Garavan, Pendergrass, Ross, Stein, & Risinger, 2001 Hamann, Ely, Grafton, & Kilts, 1999 Hamann & Mao, 2002), positive and negative trials were collapsed into a single emotion category in all statistical analyses. Confounding factors (head motion, magnetic field drift) were included in the model. Because the theoretical focus of current analysis is on effects of arousal, rather than valence, positive and negative scenes were combined at the random effects level to form the emotional event type.

Our first goal was to investigate how perceptual versus semantic processing modulates the effects of emotion on retrieval-related activity. Given that the focus of this first goal was on quantitative memory differences, we used a parametric approach to identify activity that varied with memory strength and then investigated how this activity was affected by emotion and the encoding task. For each participant, a linear parametric regressor was used to model the recognition response to old items, with1 = definitely new, 2 = maybe new, 3 = maybe old, and 4 + 5 collapsed together for definitely old. High-confidence responses were collapsed together in this model in order to investigate effects of memory strength, rather than recollection. Estimates for the parametric regressor were generated for each participant, and then entered into group-level t-tests to evaluate the effects of emotion (emotional vs. neutral pictures) as a function of previous encoding task (perceptual vs. semantic processing). To specify further the interaction between emotional arousal and prior processing type on memory success, a second model was run in which arousal ratings made for each scene during encoding were entered as a parametric regressor, and activations during high-confidence trials were contrasted as a function of encoding task (perceptual versus semantic).

Our second goal was to test whether prior perceptual versus semantic encoding of emotional stimuli differentially influences recollection- versus familiarity-based neural activations. Thus, whereas our first goal focused on quantitative differences in memory (memory strength), our second goal focused on qualitative differences (recollection vs. familiarity). For this goal, we used an ANOVA approach with emotion (emotion, neutral), encoding task (perceptual, semantic), and memory type (Recollection vs. Familiarity). As in previous fMRI studies (e.g., Yonelinas, Otten, Shaw, & Rugg, 2005), we measured recollection using Remember (5) responses (mean number of trials in each bin: 11 for neutral perceptual, 36 for emotional perceptual, 17 for neutral semantic, and 44 for emotional semantic) and Familiarity using high-confidence (4) recognition responses (mean number of trials in each bin: 18 for neutral perceptual, 37 for negative perceptual, 21 for neutral semantic, and 40 for negative semantic). High-confidence (4) responses were described to the participants as being equally familiar as the Remember responses and differed only in recollection of specific details from the encoding period. Thus, this comparison is the cleanest way to discriminate between recollection and familiarity and can be interpreted in concert with the parametric strength analysis, which collapsed across these response types. Main effects and interactions were assessed by weighting condition types in the ANOVA framework. For visualization purposes only, regions-of-interest analyses were performed by extracting the mean beta value from all significantly active voxels within the functional cluster of interest and plotting these as a function of experimental condition.

Our third goal was to investigate the effects of perceptual vs. semantic processing on amygdala connectivity during successful emotional memory retrieval. A seed region for the functional connectivity analysis was selected from a general emotion (emotional, neutral) by retrieval success (hits, misses) interaction in the direction of emotional>neutral and hit>miss. This analysis identified a right amygdala cluster, which showed greater hit-miss differences for emotional than neutral stimuli (xyz= 23, 11, �) and was unbiased with respect to the effects of encoding task. Subsequently, each trial was modeled as a separate event, yielding different beta values for each trial and each subject in the seed cluster of interest (Rissman, Gazzaley, & D'Esposito, 2004), and correlations were examined between the time series activity of the seed with all other voxels in the brain. A box was built using all the voxels directly adjacent to the peak coordinate within the functional amygdala cluster from the general test of successful emotional memory (emotional>neutral, hits>misses). A correlation map was created for each condition that displayed the correlation magnitude between every voxel and the amygdala seed region over time. Correlation maps were subsequently entered into SPM to identify brain regions showing differential connectivity as a function of experimental condition. To determine amygdala connectivity effects for successful emotional retrieval, connectivity analysis were examined within the successful retrieval network, defined as hits>misses.

To control for family-wise error resulting from multiple comparisons, we performed a Monte Carlo simulation (Slotnick et al. 2003). This procedure determines the height and cluster extent threshold sufficient to yield a corrected threshold of p < 0.05. Based on the results of the simulation, clusters were considered if they exceeded an uncorrected threshold of p< 0.001 with 10 or more contiguous voxels (3.75 mm isotropic) for whole-brain analyses. In the case of the targeted analysis that assesses differences between “remember” versus �initely old” responses on MTL activity, activations were considered if they exceeded an uncorrected threshold of P < 0.005 with 3 or more contiguous voxels in the focal, hypothesized region of interest (ROI) (bilateral MTL). Conjunction analyses were assessed by entering individual contrasts at p <.001 uncorrected, such that they formed a joint threshold probability of p<.000001. All activations are presented according to neurological convention. In the figures, statistically significant activity is projected onto a single-subject T1 structural image template. Brodmann Area (BA) and gyral localizations of activations were determined using the WFU PickAtlas and the Talaraich Client (http://www.talairach.org/client.html).


Individuals with autism spectrum disorder have altered visual encoding capacity

¶ ‡ JPN and LQZ share first authorship on this work. AAS and DEA are joint senior authors on this work.

Affiliation Center for Neural Science, New York University, New York City, New York, United States of America

Roles Data curation, Formal analysis, Investigation, Software, Visualization, Writing – original draft, Writing – review & editing

¶ ‡ JPN and LQZ share first authorship on this work. AAS and DEA are joint senior authors on this work.

Affiliation Department of Psychology, University of Pennsylvania, Philadelphia, Pennsylvania, United States of America

Roles Investigation, Project administration, Supervision, Validation, Writing – original draft, Writing – review & editing

¶ ‡ JPN and LQZ share first authorship on this work. AAS and DEA are joint senior authors on this work.

Affiliations Department of Psychology, University of Pennsylvania, Philadelphia, Pennsylvania, United States of America, Department of Electrical and Systems Engineering, University of Pennsylvania, Philadelphia, Pennsylvania, United States of America

Roles Funding acquisition, Project administration, Writing – original draft, Writing – review & editing

¶ ‡ JPN and LQZ share first authorship on this work. AAS and DEA are joint senior authors on this work.

Affiliation Center for Neural Science, New York University, New York City, New York, United States of America


Psychology: Chapter 9

A) Her general conclusion from specific evidence is not a causal inference.

B) The conclusion is corroborated by an independent party.

C) The general statement upon which she bases her specific premise is true.

A) A form of judgment that discounts the causal theory over correlational theory

B) The tendency to selectively attend to information that supports one's general beliefs while ignoring evidence that contradicts one's beliefs

C) The tendency of people to view events as being more predictable than they really are once they occur

A) They help us organize our perceptions of the world.

B) They consist of visual representations created by the brain once the original stimulus gets activated.

C) They are structures in the mind that stand for an external object or thing sensed in the present.

A) Compared to single-language 6- to 9-month-old infants, bilingual infants of the same age discriminate similar sounds.

B) People who are fluent in two languages apparently are capable of more efficient cognitive processing than those who speak only one.

C) Brains of bilingual babies are less responsive to a wide range of sounds.

A) It is the ability to imagine things that are not currently being perceived.

B) It is found that the brain is less active during visual imagery than it is during visual perception.

C) It usually occurs only through verbal formulation of thoughts.

A) males and females perform at the same skill level on mental rotation tasks.

B) females generally do better than males on mental rotation tasks.

C) both males and females are rarely if ever skilled at mental rotation tasks.

A) They are less useful for thinking about things one sensed in the past.

B) They usually do not allow one to imagine things in the future.

C) They are frequently not about things one is currently sensing.

A) Visual perception can be measured on an ordinary scale, whereas visual imagery is abstract, and it is difficult to determine its intensity.

B) Visual perception occurs through verbal formulation, whereas visual imagery primarily occurs through mental rotation.

C) Visual perception occurs in the absence of sensory stimulus, whereas visual imagery is imagining an object turning in three-dimensional space for a long period of time.


From Fragments to Objects

Shaun P. Vecera , Marlene Behrmann , in Advances in Psychology , 2001

WHAT IS AN OBJECT?

Before reviewing findings and accounts of object-based attention , we must be clear what the term “object” means. In the context of attentional selection, “objects” refer to perceptual groups or units (see Logan, 1996 , for example). These perceptual groups are formed through the application of the well-known gestalt principles of organization, principles such as proximity, similarity, good continuation, closure, connectedness, and so forth. Multiple theoretical accounts and many empirical results suggest that gestalt principles operate early in visual processing at a preattentive level (e.g., Julesz, 1984 Neisser, 1967 Treisman & Gelade, 1980 ). Further, a single perceptual group may have a hierarchical organization. A perceptual group may contain parts, and there are perceptual principles that can be used to define the parts of a perceptual group (e.g., Hoffman & Richards, 1984 Hoffman & Singh, 1997 Vecera, Behrmann, & Filapek, in press Vecera, Vecera, Behrmann, & McGoldrick., 2000 ). These perceptual grouping principles allow visual space or spatiotopic features to be organized. We refer to this perceptual grouping definition of “object” as a “grouped array” representation. The grouped array is an array-format, or spatiotopic, representation that codes features in specific retinal locations, similar to Treisman’s (1988) feature maps. Various gestalt grouping principles organize this array into coherent chunks of visual information that correspond to objects or shapes. (Also see the next section of this volume for computational models of unit formation and grouping.) The spatial representations that underlie object-based attention may be shared with spatial attention (see Valdes-Sosa et al., 1997, for relevant results, which we discuss below).

Our definition of “object” points out a close connection between object segregation processes and object-based attention processes. Object segregation refers to the visual processes that determine which visual features combine to form a single shape and which features combine to form other shapes. Object segregation is synonymous with perceptual organization, the term used in conjunction with the gestalt principles of visual organization (e.g., Wertheimer, 1923/1958 ). The ability to perform figure-ground segregation and distinguish foreground shapes (‘figures’) from background regions also involves segregation processes (e.g., Rubin, 1915/1958 ), although figure-ground segregation may follow earlier image segregation processes (Vecera & O’Reilly, 1998). An example of object segregation appears in Figure 1 , which contains two perceptual groups that are formed by the gestalt principles of proximity and good continuation.

Figure 1 . An example of object segregation in which gestalt proximity and good continuation form two perceptual groups (two lines). The small straight lines of line the top group together because they are closer to one another than the small lines of the bottom line.

The features are individual line segments that are organized into two distinct shapes—two lines, a straight line and a squiggly line. Note that these two “objects” (lines) are approximately equal in their salience. Neither object appears to grab attention more effectively than the other object. However, in such a display, empirical evidence indicates that one of these objects could be selectively attended.

The fact that the two objects in Figure 1 have approximately equal salience indicates that the human visual system must be capable of somehow creating a processing bias favoring one of theseobjects over the other. Object-based attention (that is, directing attention to one of these objects) may provide a mechanism for favoring either the straight line or squiggly line in Figure 1 . Object-based attention refers to the visual processes that select a segregated shape from among several segregated shapes. As we noted above, object segregation and object-based attention likely are interrelated—before a shape can be selected, the features of the shape first must be segregated from features of other shapes to some extent. In Figure 1 , before an observer could attend to the squiggly line, the features of that line must be grouped together (and grouped separately from the features of the straight line). Further, object-based attention is more efficient when it is directed to a single object that is, observers can select either the straight line or the squiggly line with relatively little effort. In contrast, it is more difficult to divide object-based attention across multiple objects if an observer needed to attend to both lines, object-based selection would be more effortful. Object-based attention either would have to shift between the two lines or would need to be divided between the two lines. Either shifting or division of attention cause performance to decline this declining performance is the basis of many object-based attentional effects reported in the literature (e.g., Baylis & Driver, 1993 Behrmann, Zemel, & Mozer, 1998 Duncan, 1984, 1993a, 1993b Egly, Driver, & Rafal, 1994 Vecera, 1994 Vecera & Farah, 1994 ). Many of these object-based attentional effects are influenced by the spatial position of objects, indicating that object-based attention may involve the selection of grouped locations ( Vecera, 1994 Vecera & Farah, 1994 ). However, the coordinate system of these grouped locations is poorly understood, and not all forms of object selection may involve attending to grouped locations ( Vecera & Farah, 1994 Lee & Chun, in press).

In sum, any account of object-based attention needs to explain (1) the segregation processes that provide the input to object attention and (2) the object selection effect, in which one object and all of its features are more readily attended than multiple objects (or multiple features on different objects). We now turn to the key ideas behind the biased competition account that we will discuss in conjunction with behavioral studies of object attention. Because visual scenes contain many objects that compete with one another for attention, the visual system must allocate processing to one object over others. This allocation is achieved by biasing processing toward one object. This bias provides a resolution for the competition between objects. For example, the two objects in Figure 1 compete with one another for attention, yet observers can selectively process either of the lines, even though neither line has an ‘inherent’ processing advantage. The biased competition account attempts to explain how some objects are selected over others (also see Vecera, in press).


Results

The network used in this study—VGG-16, (Simonyan and Zisserman, 2014)—is shown in Figure 1A and explained in Materials and methods, 'Network Model'. Briefly, at each convolutional layer, the application of a given convolutional filter results in a feature map, which is a 2-D grid of artificial neurons that represent how well the bottom-up input at each location aligns with the filter. Each layer has multiple feature maps. Therefore a 'retinotopic’ layout is built into the structure of the network, and the same visual features are represented across that retinotopy (akin to how cells that prefer a given orientation exist at all locations across the V1 retinotopy). This network was explored in (Güçlü and van Gerven, 2015), where it was shown that early convolutional layers of this CNN are best at predicting activity of voxels in V1, while late convolutional layers are best at predicting activity of voxels in the object-selective lateral occipital area (LO).

Network architecture and feature-based attention task setup.

(A) The model used is a pre-trained deep neural network (VGG-16) that contains 13 convolutional layers (labelled in gray, number of feature maps given in parenthesis) and is trained on the ImageNet dataset to do 1000-way object classification. All convolutional filters are 3 × 3. (B) Modified architecture for feature-based attention tasks. To perform our feature-based attention tasks, the final layer that was implementing 1000-way softmax classification is replaced by binary classifiers (logistic regression), one for each category tested (two shown here, 20 total). These binary classifiers are trained on standard ImageNet images. (C) Test images for feature-based attention tasks. Merged images (left) contain two transparently overlaid ImageNet images of different categories. Array images (right) contain four ImageNet images on a 2 × 2 grid. Both are 224 × 224 pixels. These images are fed into the network and the binary classifiers are used to label the presence or absence of the given category. (D) Performance of binary classifiers. Box plots describe values over 20 different object categories (median marked in red, box indicates lower to upper quartile values and whiskers extend to full range, with the exception of outliers marked as dots). ‘Standard’ images are regular ImageNet images not used in the binary classifier training set.

The relationship between tuning and classification

The feature similarity gain model of attention posits that neural activity is modulated by attention in proportion to how strongly a neuron prefers the attended features, as assessed by its tuning. However, the relationship between a neuron’s tuning and its ability to influence downstream readouts remains a difficult one to investigate biologically. We use our hierarchical model to explore this question. We do so by using back propagation to calculate 'gradient values', which we compare to tuning curves (see Materials and methods, 'Object category gradient calculations' and 'Tuning values' for details). Gradient values indicate the ways in which feature map activities should change in order to make the network more likely to classify an image as being of a certain object category. Tuning values represent the degree to which the feature map responds preferentially to images of a given category. If there is a correspondence between tuning and classification, a feature map that prefers a given object category (that is, responds strongly to it) should also have a high positive gradient value for that category. In Figure 2A we show gradient values and tuning curves for three example feature maps. In Figure 2C, we show the average correlation coefficients between tuning values and gradient values for all feature maps at each of the 13 convolutional layers. As can be seen, tuning curves in all layers show higher correlation with gradient values than expected by chance (as assayed by shuffled controls), but this correlation is relatively low, increasing across layers from about .2 to .5. Overall tuning quality also increases with layer depth (Figure 2B), but less strongly.

Relationship between feature map tuning and gradient values.

(A) Example tuning values (green, left axis) and gradient values (purple, right axis) of three different feature maps from three different layers (identified in titles, layers as labelled in Figure 1A) over the 20 tested object categories. Tuning values indicate how the response to a category differs from the mean response gradient values indicate how activity should change in order to classify input as from the category. Correlation coefficients between tuning curves and gradient values given in titles. All gradient and tuning values available in Figure 2—source data 1 (B) Tuning quality across layers. Tuning quality is defined per feature map as the maximum absolute tuning value of that feature map. Box plots show distribution across feature maps for each layer. Average tuning quality for shuffled data: .372 ± .097 (this value does not vary significantly across layers) (C) Correlation coefficients between tuning curves and gradient value curves averaged over feature maps and plotted across layers (errorbars ± S.E.M., data values in blue and shuffled controls in orange). (D) Distributions of gradient values when tuning is strong. In red, histogram of gradient values associated with tuning values larger than one (i.e. for feature maps that strongly prefer the category), across all feature maps in layers 10, 11, 12, and 13. For comparison, histograms of gradient values associated with tuning values less than one are shown in black (counts are separately normalized for visibility, as the population in black is much larger than that in red).

Figure 2—source data 1

Object tuning curves and gradients.

Even at the highest layers, there can be serious discrepancies between tuning and gradient values. In Figure 2D, we show the gradient values of feature maps at the final four convolutional layers, segregated according to tuning value. In red are gradient values that correspond to tuning values greater than one (for example, category 12 for the feature map in the middle pane of Figure 2A). As these distributions show, strong tuning values can be associated with weak or even negative gradient values. Negative gradient values indicate that increasing the activity of that feature map makes the network less likely to categorize the image as the given category. Therefore, even feature maps that strongly prefer a category (and are only a few layers from the classifier) still may not be involved in its classification, or even be inversely related to it. This is aligned with a recent neural network ablation study that shows category selectivity does not predict impact on classification (Morcos et al., 2018).

Feature-based attention improves performance on challenging object classification tasks

To determine if manipulation according to tuning values can enhance performance, we created challenging visual images composed of multiple objects for the network to classify. These test images are of two types: merged (two object images transparently overlaid, such as in Serences et al., 2004) or array (four object images arranged on a grid) (see Figure 1C examples). The task for the network is to detect the presence of a given object category in these images. It does so using a series of binary classifiers trained on standard images of these objects, which replace the last layer of the network (Figure 1B). The performance of these classifiers on the test images indicates that this is a challenging task for the network (64.4% on merged images and 55.6% on array, Figure 1D. Chance is 50%), and thus a good opportunity to see the effects of attention.

We implement feature-based attention in this network by modulating the activity of units in each feature map according to how strongly the feature map prefers the attended object category (see Materials and methods, 'Tuning values' and 'How attention is applied'). A schematic of this is shown in Figure 3A. The slope of the activation function of units in a given feature map is scaled according to the tuning value of that feature map for the attended category (positive tuning values increase the slope while negative tuning values decrease it). Thus the impact of attention on activity is multiplicative and bi-directional.

Effects of applying feature-based attention on object category tasks.

(A) Schematic of how attention modulates the activity function. All units in a feature map are modulated the same way. The slope of the activation function is altered based on the tuning (or gradient) value, f l k c , of a given feature map (here, the k t h feature map in the l t h layer) for the attended category, c , along with an overall strength parameter β . I l k i j Is the input to this unit from the previous layer. For more information, see Materials and methods, 'How attention is applied'. (B) Average increase in binary classification performance as a function of layer at which attention is applied (solid line represents using tuning values, dashed line using gradient values, errorbars ± S.E.M.). In all cases, best performing strength from the range tested is used for each instance. Performance shown separately for merged (left) and array (right) images. Gradients perform significantly ( p < .05 , N = 20 ) better than tuning at layers 5 – 8 (p=4.6e -3 , 2.6e -5 , 6.5e -3 , 4.4e -3 ) for merged images and 5 – 9 (p=3.1e -2 , 2.3e -4 , 4.2e -2 , 6.1e -3 , 3.1e -2 ) for array images. Raw performance values in Figure 3—source data 1.

Figure 3—source data 1

Performance changes with attention.

The effects of attention are measured when attention is applied in this way at each layer individually (Figure 3B solid lines) or all layers simultaneously (Figure 3—figure supplement 1A, red). For both image types (merged and array), attention enhances performance and there is a clear increase in performance enhancement as attention is applied at later layers in the network (numbering is as in Figure 1A). In particular, attention applied at the final convolutional layer performs best, leading to an 18.8% percentage point increase in binary classification on the merged images task and 22.8% increase on the array images task. Thus, FSGM-like effects can have large beneficial impacts on performance.

Attention applied at all layers simultaneously does not lead to better performance than attention applied at any individual layer (Figure 3—figure supplement 1A). We also performed a control experiment to ensure that nonspecific scaling of activity does not alone enhance performance (Figure 3—figure supplement 1C).

Some components of the FSGM are debated, for example whether attention impacts responses multiplicatively or additively (Boynton, 2009 Baruni et al., 2015 Luck et al., 1997 McAdams and Maunsell, 1999 ), and whether the activity of cells that do not prefer the attended stimulus is actually suppressed (Bridwell and Srinivasan, 2012 Navalpakkam and Itti, 2007). Comparisons of different variants of the FSGM can be seen in Figure 3—figure supplement 2. In general, multiplicative and bidirectional effects work best.

We also measure performance when attention is applied using gradient values rather than tuning values (these gradient values are calculated to maximize performance on the binary classification task, rather than classify the image as a given category therefore technically they differ from those shown in Figure 2, however in practice they are strongly correlated. See Materials and methods, 'Object category gradient calculations' and 'Gradient values' for details). Attention applied using gradient values shows the same layer-wise trend as when using tuning values. It also reaches the same performance enhancement peak when attention is applied at the final layers. The major difference, however, comes when attention is applied at middle layers of the network. Here, attention applied according to gradient values outperforms that of tuning values.

Attention strength and the trade-off between increasing true and false positives

In the previous section, we examined the best possible effects of attention by choosing the strength for each layer and category that optimized performance. Here, we look at how performance changes as we vary the overall strength ( β ) of attention.

In Figure 4A we break the binary classification performance into true and false positive rates. Here, each colored line indicates a different category and increasing dot size represents increasing strength of attention. Ideally, true positives would increase without an equivalent increase (and possibly with a decrease) in false positive rates. If they increase in tandem, attention does not have a net beneficial effect. Looking at the effects of applying attention at different layers, we can see that attention at lower layers is less effective at moving the performance in this space and that movement is in somewhat random directions, although there is an average increase in performance with moderate attentional strength. With attention applied at later layers, true positive rates are more likely to increase for moderate attentional strengths, while substantial false positive rate increases occur only with higher strengths. Thus, when attention is applied with modest strength at layer 13, most categories see a substantial increase in true positives with only modest increases in false positives. As strength continues to increase however, false positives increase substantially and eventually lead to a net decrease in overall classifier performance (representing as crossing the dotted line in Figure 4A).

Effects of varying attention strength

(A) Effect of increasing attention strength ( β ) in true and false positive rate space for attention applied at each of four layers (layer indicated in bottom right of each panel, attention applied using tuning values). Each line represents performance for an individual category (only 10 categories shown for visibility), with each increase in dot size representing a .15 increase in β . Baseline (no attention) values are subtracted for each category such that all start at (0,0). The black dotted line represents equal changes in true and false positive rates. (B) Comparisons from experimental data. The true and false positive rates from six experiments in four previously published studies are shown for conditions of increasing attentional strength (solid lines). Cat-Drawings = (Lupyan and Ward, 2013), Exp. 1 Cat-Images=(Lupyan and Ward, 2013),Exp. 2 Objects=(Koivisto and Kahila, 2017), Letter-Aud.=(Lupyan and Spivey, 2010), Exp. 1 Letter-Vis.=(Lupyan and Spivey, 2010), Exp. 2. Ori-Change=(Mayo and Maunsell, 2016). See Materials and methods, 'Experimental data' for details of experiments. Dotted lines show model results for merged images, averaged over all 20 categories, when attention is applied using either tuning (TC) or gradient (Grad) values at layer 13. Model results are shown for attention applied with increasing strengths (starting at 0, with each increasing dot size representing a .15 increase in β ). Receiver operating curve (ROC) for the model using merged images, which corresponds to the effect of changing the threshold in the final, readout layer, is shown in gray. Raw performance values in Figure 3—source data 1.

Applying attention according to negated tuning values leads to a decrease in true and false positive values with increasing attention strength, which decreases overall performance (Figure 4—figure supplement 1A). This verifies that the effects of attention are not from non-specific changes in activity.

Experimentally, when switching from no or neutral attention, neurons in MT showed an average increase in activity of 7% when attending their preferred motion direction (and similar decrease when attending the non-preferred) (Martinez-Trujillo and Treue, 2004). In our model, when β = .75 (roughly the value at which performance peaks at later layers Figure 4—figure supplement 1B), given the magnitude of the tuning values (average magnitude: .38), attention scales activity by an average of 28.5%. This value refers to how much activity is modulated in comparison to the β = 0 condition, which is probably more comparable to passive or anesthetized viewing, as task engagement has been shown to scale neural responses generally (Page and Duffy, 2008). This complicates the relationship between modulation strength in our model and the values reported in the data.

To allow for a more direct comparison, in Figure 4B, we collected the true and false positive rates obtained experimentally during different object detection tasks (explained in Materials and methods, 'Experimental data'), and plotted them in comparison to the model results when attention is applied at layer 13 using tuning values (pink line) or gradient value (brown line). Five experiments (second through sixth studies) are human studies. In all of these, uncued trials are those in which no information about the upcoming visual stimulus is given, and therefore attention strength is assumed to be low. In cued trials, the to-be-detected category is cued before the presentation of a challenging visual stimulus, allowing attention to be applied to that object or category.

The majority of these experiments show a concurrent increase in both true and false positive rates as attention strength is increased. The rates in the uncued conditions (smaller dots) are generally higher than the rates produced by the β = 0 condition in our model, consistent with neutrally cued conditions corresponding to β > 0 . We find (see Materials and methods, 'Experimental data'), that the average corresponding β value for the neutral conditions is .37 and for the attended conditions .51. Because attention scales activity by 1 + β f c l k (where f c l k is the tuning value), these changes correspond to a ≈ 5% change in activity.

The first dataset included in the plot (Ori-Change yellow line in Figure 4B) comes from a macaque change detection study (see Materials and methods, 'Experimental data' for details). Because the attention cue was only 80% valid, attention strength could be of three levels: low (for the uncued stimuli on cued trials), medium (for both stimuli on neutrally-cued trials), or high (for the cued stimuli on cued trials). Like the other studies, this study shows a concurrent increase in both true positive (correct change detection) and false positive (premature response) rates with increasing attention strength. For the model to achieve the performance changes observed between low and medium attention a roughly 12% activity change is needed, but average V4 firing rates recorded during this task show an increase of only 3.6%. This discrepancy may suggest that changes in correlations (Cohen and Maunsell, 2009) or firing rate changes in areas aside from V4 also make important contributions to observed performance changes.

Thus, according to our model, the size of experimentally observed performance changes is broadly consistent with the size of experimentally observed neural changes. While other factors are likely also relevant for performance changes, this rough alignment between the magnitude of firing rate changes and magnitude of performance changes supports the idea that the former could be a major causal factor for the latter. In addition, the fact that the model can capture this relationship provides further support for its usefulness as a model of the biology.

Finally, we show the change in true and false positive rates when the threshold of the final layer binary classifier is varied (a ‘receiver operating characteristic’ analysis, Figure 4B, gray line no attention was applied during this analysis). Comparing this to the pink line, it is clear that varying the strength of attention applied at the final convolutional layer has more favorable performance effects than altering the classifier threshold (which corresponds to an additive effect of attention at the classifier layer). This points to the limitations that could come from attention targeting only downstream readout areas.

Overall, the model roughly matches experiments in the amount of neural modulation needed to create the observed changes in true and false positive rates. However, it is clear that the details of the experimental setup are relevant, and changes aside from firing rate and/or outside the ventral stream also likely play a role (Navalpakkam and Itti, 2007).

Feature-based attention enhances performance on orientation detection task

Some of the results presented above, particularly those related to the layer at which attention is applied, may be influenced by the fact that we are using an object categorization task. To see if results are comparable using the simpler stimuli frequently used in macaque studies, we created an orientation detection task (Figure 5A). Here, binary classifiers trained on full-field oriented gratings are tested using images that contain two gratings of different orientation and color. The performance of these binary classifiers without attention is above chance (distribution across orientations shown in inset of Figure 5A). The performance of the binary classifier associated with vertical orientation (0 degrees) was abnormally high (92% correct without attention, other orientations average 60.25%. This likely reflects the over-representation of vertical lines in the training images) and this orientation was excluded from further performance analysis.

Attention task and results using oriented gratings.

(A) Orientation detection task. Like with the object category detection tasks, separate binary classifiers trained to detect each of 9 different orientations replaced the final layer of the network. Test images included two oriented gratings of different color and orientation located at 2 of 4 quadrants. Inset shows performance over nine orientations without attention (B) Orientation tuning quality as a function of layer. (C) Average correlation coefficient between orientation tuning curves and gradient curves across layers (blue). Shuffled correlation values in orange. Errorbars are ± S.E.M. (D) Comparison of performance on orientation detection task when attention is determined by tuning values (solid line) or gradient values (dashed line) and applied at different layers. As in Figure 3B, best performing strength is used in all cases. Errorbars are ±S.E.M. Gradients perform significantly (p=1.9e -2) better than tuning at layer 7. Raw performance values available in Figure 5—source data 1. (E) Change in signal detection values and performance (perent correct) when attention is applied in different ways—spatial (red), feature according to tuning (solid blue), feature according to gradients (dashed blue), and both spatial and feature (according to tuning, black)—for the task of detecting a given orientation in a given quadrant. Top row is when attention is applied at layer 13 and bottom when applied at layer 4. Raw performance values available in Figure 5—source data 2.

Figure 5—source data 1

Performance on orientation detection task.

Figure 5—source data 2

Performance on spatial and feature-based attention task.

Attention is applied according to orientation tuning values of the feature maps (tuning quality by layer is shown in Figure 5B) and tested across layers. We find (Figure 5D, solid line and Figure 3—figure supplement 1B, red) that the trend in this task is similar to that of the object task: applying attention at later layers leads to larger performance increases (14.4% percentage point increase at layer 10). This is despite the fact that orientation tuning quality peaks in the middle layers.

We also calculate the gradient values for this orientation detection task. While overall the correlations between gradient values and tuning values are lower (and even negative for early layers), the average correlation still increases with layer (Figure 5C), as with the category detection task. Importantly, while this trend in correlation exists in both detection tasks tested here, it is not a universal feature of the network or an artifact of how these values are calculated. Indeed, an opposite pattern in the correlation between orientation tuning and gradient values is shown when using attention to orientation to classify the color of a stimulus with the attended orientation (see 'Recordings show how feature similarity gain effects propagate', and Materials and methods, 'Oriented grating attention tasks' and 'Gradient values').

The results of applying attention according to gradient values is shown in Figure 5D (dashed line). Here again, using gradient value creates similar trends as using tuning values, with gradient values performing better in the middle layers.

Feature-based attention primarily influences criteria and spatial attention primarily influences sensitivity

Signal detection theory is frequently used to characterize the effects of attention on performance (Verghese, 2001). Here, we use a joint feature-spatial attention task to explore effects of attention in the model. The task uses the same two-grating stimuli described above. The same binary orientation classifiers are used and the task of the model is to determine if a given orientation is present in a given quadrant of the image. Performance is then measured when attention is applied to an orientation, a quadrant, or both an orientation and a quadrant (effects are combined additively, for more, see Materials and methods, 'How attention is applied'). Two key signal detection measurements are computed: criteria and sensitivity. Criteria is a measure of the threshold that’s used to mark an input as positive, with a higher criteria leading to fewer positives sensitivity is a measure of the separation between the two populations (positives and negatives), with higher sensitivity indicating a greater separation.

Figure 5E shows that both spatial and feature-based attention influence sensitivity and criteria. However, feature-based attention decreases criteria more than spatial attention does. Intuitively, feature-based attention shifts the representations of all stimuli in the direction of the attended category, implicitly lowering the detection threshold. Starting from a high threshold, this can lead to the observed behavioural pattern wherein true positives increase before false positives do. Sensitivity increases more for spatial attention alone than for feature-based attention alone, indicating that spatial attention amplifies differences in the representation of whichever features are present. These general trends hold regardless of the layer at which attention is applied and whether feature-based attention is applied using tuning curves or gradients. Changes in true and false positive rates for this task can be seen explicitly in Figure 5—figure supplement 1.

In line with our results, spatial attention was found experimentally to increase sensitivity and (less reliably) decrease criteria (Hawkins et al., 1990 Downing, 1988). Furthermore, feature-based attention is known to decrease criteria, with lesser effects on sensitivity (Rahnev et al., 2011 Bang and Rahnev, 2017 though see Stein and Peelen, 2015). A study that looked explicitly at the different effects of spatial and category-based attention (Stein and Peelen, 2017) found that spatial attention increases sensitivity more than category-based attention (most visible in their Experiment 3c, which uses natural images), and the effects of the two are additive.

Attention and priming are known to impact neural activity beyond pure sensory areas (Krauzlis et al., 2013 Crapse et al., 2018). This idea is borne out by a study that aimed to isolate the neural changes associated with sensitivity and criteria changes (Luo and Maunsell, 2015) In this study, the authors designed behavioural tasks that encouraged changes in behavioural sensitivity or criteria exclusively: high sensitivity was encouraged by associating a given stimulus location with higher overall reward, while high criteria was encouraged by rewarding correct rejects more than hits (and vice versa for low sensitivity/criteria). Differences in V4 neural activity were observed between trials using high versus low sensitivity stimuli. No differences were observed between trials using high versus low criteria stimuli. This indicates that areas outside of the ventral stream (or at least outside V4) are capable of impacting criteria (Sridharan et al., 2017). Importantly, it does not mean that changes in V4 don’t impact criteria, but merely that those changes can be countered by the impact of changes in other areas. Indeed, to create sessions wherein sensitivity was varied without any change in criteria, the authors had to increase the relative correct reject reward (i.e., increase the criteria) at locations of high absolute reward, which may have been needed to counter a decrease in criteria induced by attention-related changes in V4 (similarly, they had to decrease the correct reject reward at low reward locations). Our model demonstrates clearly how such effects from sensory areas alone can impact detection performance, which, in turn highlights the role downstream areas may play in determining the final behavioural outcome.

Recordings show how feature similarity gain effects propagate

To explore how attention applied at one location in the network impacts activity later on, we apply attention at various layers and 'record' activity at others (Figure 6A, in response to full field oriented gratings). In particular, we record activity of feature maps at all layers while applying attention at layers 2, 6, 8, 10, or 12 individually.

How attention-induced activity changes propagate through the network.

(A) Recording setup. The spatially averaged activity of feature maps at each layer was recorded (left) while attention was applied at layers 2, 6, 8, 10, or 12 individually. Activity was in response to a full field oriented grating. (B) Schematic of metric used to test for the feature similarity gain model. Activity when a given orientation is present and attended is divided by the activity when no attention is applied, giving a set of activity ratios. Ordering these ratios from most to least preferred orientation and fitting a line to them gives the slope and intercept values plotted in (C). Intercept values are plotted in terms of how they differ from 1, so positive values are an intercept greater than 1. (FSGM predicts negative slope and positive intercept). (C) The median slope (solid line) and intercept (dashed line) values as described in (B) plotted for each layer when attention is applied to the layer indicated by the line color as labelled in (A). On the left, attention applied according to tuning values and on the right, attention applied according to gradient values. Raw slope and intercept values when using tuning curves available in Figure 6—source data 1 and for gradients in Figure 6—source data 2. (D) Fraction of feature maps displaying feature matching behaviour at each layer when attention is applied at the layer indicated by line color. Shown for attention applied according to tuning (solid lines) and gradient values (dashed line).

Figure 6—source data 1

Intercepts and slopes from gradient-applied attention.

Figure 6—source data 2

Intercepts and slopes from tuning curve-applied attention.

To understand the activity changes occurring at each layer, we use an analysis from (Martinez-Trujillo and Treue, 2004) that was designed to test for FSGM-like effects and is explained in Figure 6B. Here, the activity of a feature map in response to a given orientation when attention is applied is divided by the activity in response to the same orientation without attention. These ratios are organized according to the feature map’s orientation preference (most to least) and a line is fit to them. According to the FSGM of attention, this ratio should be greater than one for more preferred orientations and less than one for less preferred, creating a line with an intercept greater than one and negative slope.

In Figure 6C, we plot the median value of the slopes and intercepts across all feature maps at a layer, when attention is applied at different layers (indicated by color). When attention is applied directly at a layer according to its tuning values (left), FSGM effects are seen by default (intercept values are plotted in terms of how they differ from one comparable average values from (Martinez-Trujillo and Treue, 2004) are intercept: .06 and slope: 0.0166, but note we are using β = 0 for the no-attention condition in the model which, as mentioned earlier, is not necessarily the best analogue for no-attention conditions experimentally. Therefore we use these measures to show qualitative effects). As these activity changes propagate through the network, however, the FSGM effects wear off, suggesting that activating units tuned for a stimulus at one layer does not necessarily activate cells tuned for that stimulus at the next. This misalignment between tuning at one layer and the next explains why attention applied at all layers simultaneously isn’t more effective (Figure 3—figure supplement 1). In fact, applying attention to a category at one layer can actually have effects that counteract attention at a later layer (see Figure 6—figure supplement 1).

In Figure 6C (right), we show the same analysis, but while applying attention according to gradient values. The effects at the layer at which attention is applied do not look strongly like FSGM, however FSGM properties evolve as the activity changes propagate through the network, leading to clear FSGM-like effects at the final layer. Finding FSGM-like behaviour in neural data could thus be a result of FSGM effects at that area or non-FSGM effects at an earlier area (here, attention applied according to gradients which, especially at earlier layers, are not aligned with tuning).

An alternative model of the neural effects of attention—the feature matching (FM) model—suggests that the effect of attention is to amplify the activity of a neuron whenever the stimulus in its receptive field matches the attended stimulus. In Figure 6D, we calculate the fraction of feature maps at a given layer that show feature matching behaviour (defined as having activity ratios greater than one when the stimulus orientation matches the attended orientation for both preferred and anti-preferred orientations). As early as one layer post-attention, some feature maps start showing feature matching behaviour. The fact that the attention literature contains conflicting findings regarding the feature similarity gain model versus the feature matching model (Motter, 1994 Ruff and Born, 2015) may result from this finding that FSGM effects can turn into FM effects as they propagate through the network. In particular, this mechanism can explain the observations that feature matching behaviour is observed more in FEF than V4 (Zhou and Desimone, 2011) and that match information is more easily read out from perirhinal cortex than IT (Pagan et al., 2013).

We also investigated the extent to which measures of attention’s neural effects correlate with changes in performance (see Materials and methods, 'Correlating activity changes with performance'). For this we developed a new, experimentally-feasible way of calculating attention’s effects on neural activity that is inspired by the gradient-based approach to attention (that is, it focuses on classification rather than tuning). We show (Figure 6—figure supplement 2) that this new measure better correlates with performance changes than the FSGM measure of activity changes, particularly at earlier layers.

There is a simple experiment that would distinguish whether factors beyond tuning, such as gradients, play a role in guiding attention. It requires using two tasks with very different objectives (which should produce different gradients) but with the same attentional cue. An example is described in Figure 7. Here, the two tasks used would be an orientation-based color classification task (two gratings each with their own color and orientation are simultaneously shown, and the task is to report the color of the grating with the attended orientation) and an orientation detection task (report if the attended orientation is present or absent in the image). In both cases, attention is cued according to orientation. But gradient-based attention will produce different neural modulations for the two tasks, while the FSGM predicts identical modulations (Figure 7C). Thus, an experiment that recorded from the same neurons during both tasks could distinguish between tuning-based and gradient-based attention.

A proposed experiment to distinguish between tuning-based and gradient-based attention

(A) ‘Cross-featural’ attention task. Here, the final layer of the network is replaced with a color classifier and the task is to classify the color of the attended orientation in a two-orientation stimulus. Importantly, in both this and the orientation detection task (Figure 5A), a subject performing the task would be cued to attend to an orientation. (B) The correlation coefficient between the gradient values calculated for this task and orientation tuning values (as in Figure 5C). Correlation peaks at lower layers for this task. (C) Correlation between tuning values for the two tasks (blue) and between gradient values for the two tasks (orange). If attention does target cells based on tuning, the modulation would be the same in both the color classification task and the orientation detection task. If a gradient-based targeting is used, no (or even a slight anti-) correlation is expected. Tuning and gradient values available in Figure 7—source data 1.

Figure 7—source data 1

Orientation tuning curves and gradients.


Acknowledgments

The authors thank Micah Murray for helpful comments on the article. They also thank Nora Turoman, Alex Huth, Diane Quinn (© 2015 Trevor Day School), and Bridgette Archer (in the order of picture appearance, top left to bottom right) for providing images of different brain imaging and mapping methods and testing environments included in Figure 1. P. J. M. received support from Swiss National Science Foundation (grant PZ00P1_174150) as well as from the Pierre Mercier Foundation and the Fondation Asile des Aveugles. S. D.'s research is supported by The Netherlands Organization for Scientific Research Veni program (grant 275-89-018), the National Science Foundation INSPIRE Track 1 (grant 1344285), and NSF ECR-STEM (grant 1661016). C. P. is supported by the Sir Henry Wellcome Postdoctoral Fellowship from the Wellcome Trust (grant 110238/Z/15/Z) and A. G. H., by the Career Award at the Scientific Interface from the Burroughs-Wellcome Foundation.


METHODS

Participants

Thirty young healthy volunteers [mean age ± standard deviation (SD) = 25.6 ± 3.5 years] and 30 older healthy volunteers (mean age ± SD = 61.2 ± 4.6 years) who had undergone extensive clinical evaluations participated in this study. Recruitment evaluation included a complete history and physical examination, a detailed neurological exam, the Structured Clinical Interview for DSM-IV (SCID First, Spitzer, Gibbon, & Williams, 1994), WAIS-R, a neuropsychological assessment, and a clinical brain MRI scan. Exclusion criteria included a current or past history of neurological or psychiatric disorders, medical treatment pertaining to cerebral metabolism or blood flow, or history of drug abuse. The two groups were matched for handedness (25 right-handers in each group as measured by the Edinburgh Handedness Inventory Oldfield, 1971), sex (16 men in each group), race (29 Caucasians, 1 Asian in each group), and intelligence quotient (IQ) [obtained using the Weschler Adult Intelligence Scale older group, mean ± SD = 116 ± 8.1 young group, mean ± SD = 116.0 ± 7.4 F(1, 59) = 0.08, p = .78]. Older participants also underwent a thorough neuropsychological assessment to evaluate cognitive status and exclude pathologic cognitive decline (see Table 1). A secondary analysis was performed in participants that were also matched for performance in addition to the above demographics across both groups. This analysis consisted of 32 participants (16 young and 16 older) from the original 60 that were matched for sex (8 men in each group), handedness (13 right-handers in each group), race (1 Asian in each group), and IQ [older group, mean ± SD = 117.9 ± 7.2 young group, mean ± SD = 116.0 ± 7.5 F(1, 31) = 0.55 p = .46].

Neuropsychological Status of Older Participants

Neuropsychology/Neurological Test . M (SD) . n .
Cognitive Status
Mini-Mental State Examination (MMSE) 30.0 (0.2) 22
Executive Composite
Trail Making Test B (sec) 72.1 (30.6) 30
Word Fluency Test 48.5 (11.9) 29
Category Fluency Test 54.3 (11.1) 29
Letter and Number Sequencing 11.9 (2.4) 30
WAIS-IQ 116.6 (8.1) 30
Memory Composite
WMS-R Logical Memory Immediate Recall 12.4 (2.6) 26
WMS-R Logical Memory Delayed Recall 13.8 (2.5) 26
Processing Speed Composite
Trail Making Test A (sec) 32.2 (13.6) 30
Neuropsychology/Neurological Test . M (SD) . n .
Cognitive Status
Mini-Mental State Examination (MMSE) 30.0 (0.2) 22
Executive Composite
Trail Making Test B (sec) 72.1 (30.6) 30
Word Fluency Test 48.5 (11.9) 29
Category Fluency Test 54.3 (11.1) 29
Letter and Number Sequencing 11.9 (2.4) 30
WAIS-IQ 116.6 (8.1) 30
Memory Composite
WMS-R Logical Memory Immediate Recall 12.4 (2.6) 26
WMS-R Logical Memory Delayed Recall 13.8 (2.5) 26
Processing Speed Composite
Trail Making Test A (sec) 32.2 (13.6) 30

All participants underwent fMRI while performing an incidental encoding and memory retrieval task. All participants gave written informed consent, which was approved by the National Institute of Mental Health Institutional Review Board.

Experimental Paradigm

Each participant underwent BOLD fMRI during the encoding and retrieval of aversive and neutral scenes selected from the International Affective Picture System (Lang, Bradley, & Cuthbert, 2005). For both the encoding and retrieval sessions, the scenes were presented in a blocked fashion with two blocks of aversive/neutral scenes alternating with blocks of resting state. During experimental blocks, six scenes of similar valence (neutral or aversive) were presented serially to participants for 3 sec each. A Student's t test revealed that the selected aversive scenes were rated significantly less pleasant and more arousing than the selected neutral scenes as determined by standardized ratings described in Lang et al. (2005) [pleasure (mean ± SD aversive = 3.1 ± 0.9 neutral = 5.8 ± 1.1) arousal (mean ± SD aversive = 5.9.1 ± 0.7 neutral = 3.03 ± 0.8) p < .0001 for each measure]. In a recent study, Backs, da Silva, and Han (2005) reported that there was no significant difference in the ratings of older participants (mean age ± SD: 66.3 ± 5.6 years) compared to younger participants (mean age ± SD: 20.0 ± 2.3 years) when rating for negatively valenced stimuli from the International Affective Picture System picture set obtained by Lang et al. (2005). During resting blocks, participants were asked to attend to a fixation cross presented in the center of the screen for 18 sec. These fixation blocks were treated as a baseline in the fMRI analyses. During the encoding session, participants were instructed to determine whether each picture depicted an “indoor” or “outdoor” scene. During the retrieval session, participants were instructed to determine whether the scene presented was seen during the encoding session the participants were instructed to press the right button for scenes seen before during the encoding session (i.e., “old”) or press the left button for scenes not seen during the encoding session (i.e., “new”). In each retrieval session, half the scenes were old (i.e., presented during the encoding session), whereas the other half were new (i.e., not presented during the encoding session). Each session (encoding or retrieval) consisted of 17 blocks (four aversive, four neutral, and nine rest conditions). Participants completed the entire encoding session before beginning the retrieval session after a brief delay (about 2 min). Before each session, participants were given verbal instructions, and each run was preceded by a brief 2-sec instruction screen with a total scan time of 5 min 40 sec. For the encoding session, the presentation of “indoor” and “outdoor” scenes, and for the retrieval session, the presentation of “old” and “new” scenes, was counterbalanced within each block. In addition, the presentation order of aversive and neutral blocks was counterbalanced across participants. All participants responded with button presses using their dominant hand. Behavioral accuracy and RTs were recorded. This task has been shown to reliably engage the hippocampus as well as inferotemporal, parietal, and frontal cortices in healthy volunteers (Bertolino et al., 2006 Meyer-Lindenberg et al., 2006 Hariri et al., 2003).

FMRI Acquisition

BOLD fMRI was performed on a General Electric 3-Tesla Signa scanner (Milwaukee, WI) using a gradient-echo, echo-planar imaging sequence. Twenty-four axial slices covering the whole cerebrum and the majority of the cerebellum were acquired in an interleaved sequence with 4 mm thickness and a 1-mm gap (TR/TE = 2000/28 msec, FOV = 24 cm, matrix = 64 * 64). Scanning parameters were selected to optimize the quality of the BOLD signal while maintaining a sufficient number of slices to acquire whole-brain data.

Data Analysis

Behavioral Analysis

One-way factorial analyses of variance (ANOVAs) were performed on the behavioral data to explore the effects of age and stimulus valence on accuracy (ACC) and RT for both encoding and retrieval sessions. Two-way ANOVAs were also performed to assess an Age by Valence interaction on these measures. Statistical thresholds for significance were set at p < .05.

Functional Imaging Analysis

Image analysis was completed using SPM2 (www.fil.ion.ucl.ac.uk/spm). For each session (encoding and retrieval), subsequent images were realigned to the first image in the series to correct for head motion. These images were then spatially normalized to the MNI template using a fourth degree B-spline interpolation. Then the images were smoothed using an isotropic 8-mm 3 full-width half-maximum kernel. Each individual dataset was then carefully screened for data quality using a variety of parameters including visual inspection for image artifacts, estimating indices for ghosting artifacts, signal-to-noise ratio across the time series, signal variance across individual sessions, and head motion (data from participants with head motion greater than 3 mm and/or head rotation greater than 2° were excluded).

For both the encoding and retrieval sessions, fMRI responses were modeled using the General Linear Model (GLM) with a canonical hemodynamic response function convolved to a boxcar function for the length of the block, normalized to the global signal across the whole brain, and temporally filtered to remove low-frequency signals (<84 Hz). Regressors were modeled for conditions of interest (for encoding session: aversive encoding and neutral encoding for retrieval session: aversive retrieval and neutral retrieval) as well as six head motion regressors of no interest. Using this GLM model, individual t contrast maps were generated for contrasts of interest: aversive encoding > baseline, neutral encoding > baseline, aversive encoding > neutral encoding, aversive retrieval > baseline, neutral retrieval > baseline, and aversive retrieval > neutral retrieval.

Second-level random effects analyses were performed using one-sample t tests to explore the main effect of task for the encoding aversive, encoding neutral, retrieval aversive, and retrieval neutral conditions. For the encoding session, the t contrast option under an ANOVA in SPM2 was performed to assess the main effect of stimulus valence [(older aversive + young aversive) > (older neutral + young neutral)], the effect of age [(young aversive + young neutral) > (older aversive + older neutral) and (older aversive + older neutral) > (young aversive + young neutral)], and the effect of Age by Valence [young (aversive > neutral) > older (aversive > neutral) older (aversive > neutral) > young (aversive > neutral)]. For the retrieval session, to control for a significant difference in performance, the t contrast option under an analysis of covariance (ANCOVAs) in SPM2 using ACC and RT as covariates of no interest was performed to assess the effect of stimulus valence, the effect of age, and an effect of Age by Valence. All above ANOVAs were inclusively masked with conjunction maps of the effect of interest at p < .05, uncorrected.

Given the strong evidence for an important role of the amygdala during emotional memory processing (Dolcos, LaBar, & Cabeza, 2004b, 2005), a measure of functional connectivity was estimated to assess for residual brain connectivity between the amygdala and other brain regions after adjusting for task-related activity (Bertolino et al., 2006 Pezawas et al., 2005 Meyer-Lindenberg et al., 2001). This measure quantifies the covariation of neural activity between median activity (after mean signal and drift correction) of a seed in the amygdala and the rest of the voxels in the brain across the time series. Seed regions in the amygdala were constructed using a two-step process. First, a mask of significantly active voxels (p < .05, FDR-corrected) for the main effect of task was created separately for encoding and retrieval sessions across all participants. Then, seeds were constructed by determining each individual's functionally active voxels (p < .05) within the above mask. Following this, individual connectivity maps (covariance maps) were created by correlating the time series of the amygdala with the time series of the voxels in the rest of the brain. Error, namely, the residual term in the GLM model, was used after adjusting for task effects and confounds (e.g., global signal and realignment parameters) to estimate functional coupling across brain regions (see Caclin & Fonlupt, 2006 Pezawas et al., 2005 for more details on this approach). Functional coupling estimated in this manner is thought to reflect the inherent connectivity between brain regions rather than correlations mediated by the task. This analysis was performed separately for both the encoding and retrieval sessions.

To assess correlations between functional data and behavior, simple regressions were performed using individual participant's first-level contrast maps from the GLM and accuracy. For behavior–functional connectivity correlations, each individual's connectivity values were normalized to the sample of the mean using a Fisher r to z transform before entering into the regression. Estimates of the weighted beta parameters and functional connectivity values were extracted from significant voxels (p < .05, uncorrected) within ROIs using MARSBAR toolbox (http://marsbar.sourceforge.net) and exported into STATISTICA 6 (www.statsoft.com) to calculate Pearson's r for one-tailed analysis.

Statistical thresholds for all imaging analyses were set at p < .005 (uncorrected) within anatomical ROIs (see below) and p < .001 for all other regions. Results that survived p < .05, corrected for multiple comparisons (FDR-corrected, as described by Genovese, Lazar, & Nichols, 2002) are indicated within tables. All reported data were held to a cluster extent threshold of k > 5.

Given prior evidence of age-related changes in the circuits underlying episodic memory, ROIs of the hippocampal formation (hippocampus/parahippocampus) and the amygdala were created using the Wake Forest University PICKATLAS.