Wednesday, September 28, 2022
HomeBiologyNeural responses in human superior temporal cortex help coding of voice representations

Neural responses in human superior temporal cortex help coding of voice representations


The power to acknowledge summary options of voice throughout auditory notion is an intricate feat of human audition. For the listener, this happens in near-automatic vogue to seamlessly extract advanced cues from a extremely variable auditory sign. Voice notion relies on specialised areas of auditory cortex, together with superior temporal gyrus (STG) and superior temporal sulcus (STS). Nevertheless, the character of voice encoding on the cortical stage stays poorly understood. We leverage intracerebral recordings throughout human auditory cortex throughout presentation of voice and nonvoice acoustic stimuli to look at voice encoding on the cortical stage in 8 patient-participants present process epilepsy surgical procedure analysis. We present that voice selectivity will increase alongside the auditory hierarchy from supratemporal airplane (STP) to the STG and STS. Outcomes present correct decoding of vocalizations from human auditory cortical exercise even within the full absence of linguistic content material. These findings present an early, less-selective temporal window of neural exercise within the STG and STS adopted by a sustained, strongly voice-selective window. Encoding fashions reveal divergence within the encoding of acoustic options alongside the auditory hierarchy, whereby STG/STS responses are finest defined by voice class and acoustics, versus acoustic options of voice stimuli alone. That is in distinction to neural exercise recorded from STP, wherein responses have been accounted for by acoustic options. These findings help a mannequin of voice notion that engages categorical encoding mechanisms inside STG and STS to facilitate function extraction.


Vocalizations are an important social sign and a elementary driver of human and animal habits. People and different animals can simply distinguish con-specific vocalizations from different advanced sounds of their acoustic atmosphere [1,2] and might deduce demographic, emotional, and behavioral intentions from voice [3]. These voice recognition skills start to develop prenatally [4], precede growth of linguistic skills [5], and are fashioned by the processing of acoustic and paralinguistic points of voice [6]. Nevertheless, the neural group of auditory cortex underlying human voice notion stays a central unanswered query.

Neuroimaging research have recognized areas of auditory cortex theorized to mediate voice processing. These areas embody superior temporal sulcus (STS) and superior temporal gyrus (STG), collectively known as “temporal voice areas” (TVAs) [712] and reveal a strong BOLD response when listening to voice stimuli in comparison with nonvoice stimuli. Current neuroimaging work means that TVAs exist in a number of primate species [1,2,12,13]. Moreover, this area appears to categorize conspecific vocalizations other than different primate vocalizations, a sample that’s preserved throughout species [2].

These areas exhibit strong connectivity with auditory areas within the supratemporal airplane (STP), together with Heschl’s gyrus (HG), and better order affiliation cortices implicated in voice notion and voice identification recognition[1418]. Bilateral STS reveals voice selective responses, though some research counsel hemispheric asymmetry in STS anatomic construction and performance [10,19]. Present understanding of regional voice selectivitity and temporal dynamics of those responses in auditory cortex is pushed primarily by neuroimaging research. Subsequently, research using strategies on physiologic timescales are wanted to additional look at the contribution of STS and STG to voice notion.

Whether or not exercise in putative voice-selective areas of STG and STS is definitely demonstrating selectivity for voice, and never the acoustic or linguistic properties of speech, stays beneath debate [12,20,21]. A speech-driven mannequin of voice coding is supported by some neuroimaging research [8,17,20,22,23]. Extant behavioral work additionally means that voice notion could rely closely on linguistic content material [24]. Nevertheless, different research have proven voice selectivity persists when controlling for the distinctive acoustic properties of voice stimuli [12]. At a broader stage, it stays unknown to what extent processing of advanced auditory stimuli, corresponding to speech, music, or naturally occurring environmental sounds, depend on shared or distinctive neural mechanisms [18,20,2527] and the way these mechanisms are organized throughout the auditory cortical hierarchy [2832]. Collectively, help for TVAs means that populations of neurons in STG/STS exhibit specialization for the wealthy data carried by vocalizations and substantiates the speculation that cortical representations of vocalizations symbolize categorical encoding (voice versus nonvoice) past the contribution of encoding of vocal acoustics. Particular options driving neural encoding of voice and the timing and group of this coding can even be superior by approaches with larger temporal decision.

To grasp the cortical illustration of voice at physiologic timescales, we measured native neural exercise immediately from the STS, STG, and surrounding auditory cortex in affected person members present process scientific intracerebral recordings as a part of epilepsy surgical procedure [33]. To this point, proof for voice coding in human auditory cortex has largely come from useful magnetic resonance imaging (fMRI) research. The low temporal decision of fMRI limits interpretation of temporal dynamics of those responses, given physiologic delay and low-pass filtered nature of BOLD responses in comparison with peak spike frequency [22,34]. Right here, we leverage intracerebral recordings that uniquely enable direct electrophysiological measurements throughout the auditory hierarchy together with from sulcal banks, such because the STS and HG. We mixed this recording approach with decoding fashions to measure how distinguishable, throughout channels, the neural responses are between voice and nonvoice sounds, in addition to encoding fashions to estimate the stimulus options driving neural responses. Moreover, we carried out single-channel analyses to look at voice separability, or the diploma to which a channel responds extra strongly to voice than nonvoice, and voice class choice power, or the extent to which a channel responds completely to voice stimuli. We check the hypotheses that vocalizations are represented categorically in STG and STS. Right here, we offer knowledge in help of voice category-level encoding in neural recordings inside STG/STS and describe the temporal dynamics of those responses throughout temporal voice delicate areas.


We recorded neural knowledge from 8 patient-participants (ages 9 to 18 years) present process intracerebral recordings as a part of routine epilepsy surgical procedure analysis. Recording websites in every participant included STP, STG, and STS, together with HG (Fig 1A). Individuals carried out an auditory 1-back activity of pure sounds stimuli tailored from Norman-Haignere and colleagues [20] (n = 8; Pure Sounds, NatS). A subset of three members moreover carried out a 1-back activity utilizing stimuli tailored from Belin and colleagues [7] (n = 3; Voice Localizer, VL). Every of those stimulus units embody vocal and nonvocal sounds that can be utilized to evaluate vocal selectivity much like earlier research [734]. To measure native neuronal responses to auditory stimuli, broadband high-gamma exercise (HGA) [3537], or HGA (70 to 150 Hz), was extracted (Fig 1B and 1C). We targeted on HGA as a result of it’s an efficient index of native neuronal processing, as it’s extremely correlated with native neuronal spiking exercise [38]. Whereas different frequency bands or points of the sphere potential (e.g., part) have typically been implicated in auditory cortical processing, they’re outdoors the scope of this work. Every channel’s auditory responsiveness was assessed utilizing a 2-sample t check that in contrast imply HGA between a poststimulus onset window (0 to 500 ms) and a silent baseline (−600 to −100 ms relative to onset). Channels that exhibited a big auditory response (p < 0.05, false discovery price [FDR] corrected) have been included in subsequent analyses. This resulted in between 28 and 72 auditory-responsive channels per affected person, for a complete of 399 channels with NatS recordings and 174 channels with VL recordings.


Fig 1. Auditory-evoked HGA.

(A) Instance channel in left HG, affected person P7, proven in coronal (higher panel) and axial (decrease) slices. (B) Auditory-evoked spectral response averaged throughout all NatS stimuli in channel from (A). Vertical traces symbolize stimulus on- and offset, with horizontal traces demarcating frequency boundaries for broadband HGA at 70 and 150 Hz. (C) Imply HGA in the identical channel. (D) Auditory responsiveness, quantified because the 2-sample t-value between imply HGA in 500 ms pre- and poststimulus onset home windows. Small black dots symbolize channels with no auditory response, i.e., t-values that failed to achieve significance (p < 0.05, FDR corrected). Related knowledge are positioned on Zenodo within the Fig 1B and 1C folder (doi: 10.5281/zenodo.6544488). FDR, false discovery price; HG, Heschl’s gyrus; HGA, high-gamma exercise; NatS, Pure Sounds; STG, superior temporal gyrus; STP, supratemporal airplane; STS, superior temporal sulcus.

Decoding voice from nonvoice acoustic stimuli

We sought to determine the magnitude and temporal dynamics of ensemble neural response variations between vocal and nonvocal sounds, which we discuss with as voice decoding. For every affected person, HGA from all auditory-responsive channels was used to decode between voice and nonvocal sounds utilizing 2 kinds of fashions. Windowed fashions have been constructed utilizing imply HGA inside a sliding window (width of 100 ms, overlapping and sliding by 50 ms), with particular person fashions constructed at every time window. Full fashions refer to those who used all home windows concurrently.

For full fashions (Fig 2A), classification accuracy reached significance in every participant at p < 0.01 (permutation checks, Bonferroni corrected), with accuracy starting from 82% to 93% for the VL stimulus set and 82% to 92% for NatS stimuli (see “Acoustic stimuli” part). To deal with whether or not these outcomes have been pushed by encoding of linguistic data (i.e., speech) reasonably than voice, the complete mannequin decoding evaluation was carried out once more with NatS knowledge, excluding all speech stimuli (i.e., native and overseas speech, lyrical music), in order that the vocal class contained solely nonspeech vocal sounds corresponding to laughter, crying, and coughing. Accuracy ranged from 65% to 80% and remained statistically vital for all sufferers (p < 0.01, Bonferroni corrected) (mild blue bars, Fig 2A). It’s price noting that NatS full fashions could have benefitted from the longer stimuli relative to VL, which resulted in additional temporal home windows and due to this fact extra enter options to the decoding mannequin.


Fig 2. Decoding accuracy outcomes.

(A) Full mannequin (i.e., all channels and time home windows) decoding accuracy of vocal versus nonvocal for every affected person. Darkish and lightweight blue bars correspond to NatS outcomes with speech stimuli included or excluded, respectively (e.g., mild blue is nonspeech human vocalizations versus nonvocal auditory stimuli). White dots symbolize statistical significance (p < 0.01, Bonferroni corrected, permutation checks). (B) Sliding window outcomes. Vertical traces symbolize stimulus offset for the two duties, with horizontal traces displaying fraction of sufferers with statistically vital decoding in that window (p < 0.001, FDR corrected, cluster-based permutation checks). (C) Cross-task decoding accuracy, with colour indicating the coaching set (white: p < 0.01, purple: p < 0.05, Bonferroni corrected, permutation checks). Related knowledge are positioned on Zenodo within the Fig 2 folder (doi: 10.5281/zenodo.6544488). FDR, false discovery price; NatS, Pure Sounds; VL, Voice Localizer.

Time programs of decoding accuracy are proven in Fig 2B, averaged throughout all sufferers for each NatS and VL. Throughout sufferers and stimulus units, vital decoding emerged as early as 50 ms (vary of fifty to 150 ms), with decoding accuracy falling beneath likelihood between 25 ms earlier than to 500 ms after stimulus offset. Decoding accuracy trajectories have been vital all through the stimulus size for each NatS and VL, demonstrating that when voice decoding emerges, it persists all through the sound.

Subsequent, we examined the generalizability of our findings throughout totally different stimulus units for the three members that carried out each VL and NatS duties. Most notably, cross-task decoding reveals that fashions skilled on knowledge from one stimulus set have been in a position to decode vocal class membership (vocal–nonvocal, V–NV) on knowledge from the opposite stimulus set (Fig 2C). Comparable HGA response properties to voice have been noticed between duties, regardless of fully distinct stimulus units. Moreover, the sliding window accuracy profiles between VL and NatS duties have been extremely correlated inside affected person in the course of the VL stimulus window (550 ms; R = 0.80, 0.81, and 0.95 respectively, Fig 2B). Whereas VL stimuli have been considerably shorter than NatS, the speedy onset of great decoding for each stimulus units and shut correspondence of their temporal profiles (Fig 2B) means that this distinction doesn’t result in significant variations in neural responses in the course of the first 550 ms.

Distribution of voice-sensitive channels

Subsequent, to check the speculation that vocal separability, or the extent to which a channel reveals larger HGA responses to voice than nonvoice sounds, is pushed by exercise in STG and STS, we in contrast HGA between particular person cortical recording websites in STP (comprised of HG and planum temporale, PT) and STG/STS. Throughout all channels, STG/STS had smaller total auditory HGA responses (p < 10−5, rank-sum check) relative to STP. Though we had hypothesized that V–NV separability can be larger in STG/STS than STP, we discovered that there was no statistically vital distinction between these 2 areas of curiosity (ROIs; each auditory responsiveness and V–NV separability proven in Fig 3A). We suspected that the latter discovering could have been pushed by a big distinction within the proportion of channels displaying vital V–NV separability (87% STP versus 66% STG/STS; p = 0.004, Fisher precise check). Nevertheless, even when excluding nonsignificant channels, V–NV separability was not considerably totally different between STG/STS and STP.


Fig 3. Single channel outcomes.

(A) HGA separability between vocal and nonvocal NatS stimuli, throughout all sufferers. Channel sizes are proportional to t-statistics evaluating auditory response magnitude between 500 ms pre- and poststimulus onset home windows, identical as Fig 1D. (B) HGA for two instance channels positioned in PT (higher panel) and uSTS (decrease panel). Black bars present clusters of considerably totally different timepoints; V–NV separability (panels A, E) is the sum of all clusters for a given channel. Observe that whereas each channels obtain V–NV separability all through the length of the stimulus, the magnitude of the nonvocal response differs between the two channels, with the NV response of the uSTS channel returning close to baseline after the preliminary onset window. In distinction, the V response stays elevated in each onset and sustained home windows, for each the PT and uSTS channels. (C) Imply HGA averaged throughout 2 totally different home windows: onset (0 to 500 ms) and sustained (500 to 2,000 ms). (D) The HGA ratio is calculated because the distinction between vocal and nonvocal responses, relative to their sum. This metric, spanning from −1 to 1, describes a channel’s vocal class choice power: a worth close to 1 (or −1) represents a channel that responds solely to vocal (or nonvocal) stimuli, whereas a worth of 0 represents equal HGA responses to each stimulus classes. (E) All channels with V–NV separability exhibit onset responses to each stimulus classes: on this early window, HGA ratios reveal that STG and STS (in comparison with STP) reveals a barely diminished response to nonvocal relative to vocal stimuli. Throughout the sustained window, a robust choice for vocal stimuli emerges in STG and STS, whereas nonvocal responses return close to silent baseline. Related knowledge are positioned on Zenodo within the Fig 3B–3D folder (doi: 10.5281/zenodo.6544488). HGA, high-gamma exercise; NatS, Pure Sounds; PT, planum temporale; STG, superior temporal gyrus; STP, supratemporal airplane; STS, superior temporal sulcus; uSTS, higher STS; V–NV, vocal–nonvocal.

V–NV separability describes how distinguishable vocal responses are from nonvocal responses however doesn’t characterize the extent to which a channel responds completely to solely vocal (and never nonvocal) pure sounds; we discuss with the latter attribute because the voice class choice power. For example the distinction, contemplate the two channels in Fig 3B. Each channels exhibit V–NV separability as a result of there’s a relative distinction between voice and nonvoice responses. Nevertheless, a significant qualitative distinction exists between the two channels: Within the PT channel (higher plot), responses to each stimulus classes are elevated above baseline all through the stimulus length; in distinction, the higher STS (uSTS) channel (decrease plot) reveals elevated responses to solely the vocal class, notably after an preliminary onset window wherein all pure sounds elicit some response. To characterize the class choice power (i.e., the extent to which a channel responds completely to vocal sounds), we quantified the HGA ratio—a metric of the normalized neural response—because the distinction of imply V and NV responses relative to their sum. Responses which are extra completely confined to solely V stimuli exhibit HGA ratios nearer to 1, whereas ratios near 0 symbolize a lot weaker class choice.

As alluded to within the earlier paragraph, responses exhibited 2 home windows of distinct exercise, consisting of an onset (0 to 500 ms) and a sustained response (500 to 2,000 ms). As a way to characterize the voice class choice power in every of those home windows individually, imply HGA was calculated for every class (V and NV) and temporal window (Fig 3C, utilizing the identical channels as Fig 3B). These values have been then used to calculate HGA ratios, proven in Fig 3D. Discover that the PT channel shows a weak class choice (HGA ratio < 0.25) all through the stimulus length, whereas the uSTS channel transitions from a reasonable to a robust class choice between the onset and sustained window.

Fig 3E reveals HGA ratio outcomes throughout all separable channels (these with nonzero V–NV separability), revealing that the development from Fig 3B–3D generalizes throughout channels. Particularly, throughout the auditory hierarchy, V–NV separability within the onset window is broadly pushed by a weak class choice, strengthening to a stronger choice within the sustained window, i.e., HGA ratios are larger within the sustained relative to the onset window (p = 1.6 × 10−5, sign-rank check). Moreover, STG/STS channels displayed a stronger vocal class choice relative to STP channels in each home windows (p < 10−5 for each onset and sustained, rank-sum checks). In abstract, HGA responses are extra unique to vocal sounds in STG/STS relative to STP, in addition to within the sustained relative to the onset window.

Final, the onset of V–NV separability was additionally estimated in STP and STG/STS. This onset is very delicate to the response power, i.e., channels with poor signal-to-noise may present later separability onsets as a result of noise contamination. Subsequently, we calculated the median onset time for under the 50% most separable channels in a given ROI. This resulted in median onsets of HGA V–NV separability of 130 ms in STP and 150 ms in STG/STS.

Voice function encoding demonstrates category-level illustration of voice within the STS

Lastly, we investigated whether or not categorical voice responses may very well be defined by lower-level acoustic processing. To this finish, we used the audio processing software program OpenSMILE to extract acoustic options of various complexity [39,40]. Particularly, we used the functionals function set, which produces statistical summaries (e.g., imply, normal deviation, and peak price) of acoustic options (e.g., loudness, formants, mel-frequency cepstral coefficients, jitter, and shimmer) for every stimulus. A further binary function was included to point vocal class membership. Encoding fashions have been constructed to foretell onset and sustained imply HGA (as in Fig 3C) for every channel displaying V–NV HGA separability in auditory cortex utilizing this function house. Full encoding fashions used the complete function house, whereas nested encoding fashions have been constructed with out the explicit voice function.

Two metrics have been derived from this evaluation that, when taken collectively, present perception right into a channel’s encoding properties. First, the % of variance defined (R2) by the complete mannequin describes how nicely the enter options clarify the response magnitudes for a given channel. Second, the probability ratio check compares the nested to the complete mannequin and gives an estimate of the added worth conferred by the introduction of the vocal class membership function. Below the null speculation that each fashions match the info equally nicely, this ratio is χ2 distributed.

Amongst full encoding fashions with vital R2 values (p < 0.05, Bonferroni corrected, permutation checks), 2 qualitatively several types of responses emerged within the STP and STG/STS. The primary group, clustered in STP, represents auditory function encoding and is characterised by a mix of enormous R2 and low χ2 values (Fig 4). These channels are nicely defined by encoding fashions in each the onset and sustained home windows however present minimal enchancment in mannequin efficiency when a vocal class function is added. The second group of channels, clustered in lateral STG and STS, demonstrates categorical encoding properties by displaying substantial mannequin enchancment (massive χ2) with the addition of categorical voice data (in addition to massive R2 values). ROI evaluation confirms that χ2 values have been considerably bigger in STG/STS in comparison with STP in each the onset (p < 10−5) and sustained home windows (p < 10−5, rank-sum checks). R2 values from full (acoustic + class) encoding fashions weren’t considerably totally different between areas in both window. In distinction, acoustic-only R2 values have been considerably bigger in STP in comparison with STG/STS in each the onset (p < 10−5) and sustained home windows (p < 10−5, rank-sum checks).


Fig 4. Encoding mannequin outcomes.

Linear regression encoding fashions counsel that STP is primarily pushed by acoustic options, whereas STG and STS responses are far more influenced by category-like data. Mannequin inputs consisted of each low- and high-level acoustic options corresponding to loudness, MFCCs, spectral flux, and relative formant ratios. Full fashions additionally included a binary function indicating vocal class membership. Probability ratio check statistics examine this full mannequin to a nested, acoustic-only mannequin and thus describe the advance conferred by V–NV class data. Effectively-fit channels in STP are modeled finest by acoustic options all through each the onset and sustained home windows. In the meantime, STG and STS channels additionally carry out nicely and profit from the addition of category-level data, with a slight skew towards the later sustained window. STG, superior temporal gyrus; STP, supratemporal airplane; STS, superior temporal sulcus; V–NV, vocal–nonvocal.


The human TVAs have lengthy been related to vocal class choice, however the precise computations underlying this phenomenon stay debated [20,24,27,41]. One speculation is that, much like face processing, TVAs carry out a voice detection gating operate wherein incoming auditory stimuli are categorized as vocal or nonvocal previous to increased stage function extraction [18,4244]. If such a mannequin have been right, one would anticipate category-level encoding of vocal stimuli in TVAs. Our outcomes reveal cortical areas within the STG and STS, the putative websites of TVAs, have robust vocal class choice throughout 2 distinct voice localizer duties, NatS and VL. Notably, the info present that voice selectivity strengthens alongside the auditory cortical hierarchy from STP to STG/STS in a temporally dynamic vogue, with an preliminary much less particular onset response adopted by a sustained response with pronounced voice class choice in STG/STS [45]. Auditory sEEG experimental knowledge introduced right here reveal class selectivity to voice even in members as younger as 9, suggesting useful specialization of the STG/STS arises previous to maturity. Importantly, our outcomes reveal that separability between HGA reponse to voice and nonvoice acoustic stimuli within the STG/STS is pushed most robustly by voice class, reasonably than lower-level acoustic options. In distinction, voice-sensitive websites within the STP have been pushed primarily by acoustic options reasonably than voice class.

Whether or not voice selectivity in human auditory cortex is pushed by low-level acoustic options or whether or not neural response selectivity really displays extra summary illustration of voice class stays an open query [12,20,24]; some have advised that voice specialization really displays specialization for speech [20,2224]. First, our outcomes present that whereas speech data performs a task in vocal neural coding, vocal decoding can happen even within the absence of speech. Second, our encoding mannequin outcomes reveal that within the STG/STS voice class performs a much more influential position than low-level acoustic options. Thus, our outcomes help a mannequin in which there’s a gradient of voice class selectivity throughout the auditory hierarchy, with lower-level acoustic options enjoying a very powerful position in STP, and powerful voice class selectivity rising in bilateral STG/STS. Final, we discovered no vital hemispheric variations (left versus proper STP plus STG/STS) in any metrics explored right here, together with V–NV separability, HGA ratio, and encoding mannequin metrics (R2 and χ2). These findings are in keeping with different work displaying no left-right asymmetries in vocal class encoding [8]; nonetheless, this doesn’t preclude the potential of hemispheric specialization for increased stage voice representations corresponding to speaker identification.

This research additionally sheds mild on vital temporal dynamics of voice processing throughout the auditory system. Earlier excessive density EEG work reported an N200 sign distinguishing vocal from nonvocal stimuli with an onset at 164 ms [46]. In the meantime, a subsequent MEG research confirmed a dissociation within the neural exercise of vocal and nonvocal exercise beginning at 150 ms [47]. These research each inferred an identical localization of this impact: The MEG research confirmed maximal dissociation round bilateral STS, whereas the HD-EEG research additionally proposed an identical anatomical locus. Norman-Haignere and colleagues examined temporal integration in STP and STG/STS throughout various timescales and located that channels with “short-integration” home windows (<200 ms) present selectivity for spectrotemporal modulation, whereas “long-integration” channels (>200 ms) present distinguished class selectivity [32]. In settlement with these findings, we noticed an onset of V–NV separability round 150 ms in STG/STS channels. Curiously, our decoding outcomes revealed a barely earlier onset throughout auditory cortex, beginning between 50 and 150 ms. These outcomes could also be as a result of decoding fashions exhibiting increased sensitivity to early weak separability, given the inclusion of a number of channels concurrently.

Whereas we discovered that separability between vocal and nonvocal responses exists all through auditory cortex, a number of response traits counsel that voice selectivity is localized in STG/STS. Vocal class choice power and categorical voice encoding are each stronger on this area, notably in the course of the sustained window following onset responses. In help of this, Bodin and colleagues gives additional proof that TVAs encode ecologically salient vocalization classes by displaying selectivity of anterior STG/STS to conspecific vocalizations over vocalizations from different primate species (human, macaque, and marmoset) [2]; for overview, see Bodin and Belin [13]. In distinction, STP websites that show V–NV separability present a weak class choice power, presumably associated to acoustic function encoding reasonably than true class specificity. This rationalization is supported by the discovering that responses on this area encode acoustic options extra strongly than categorical options. Moreover, our outcomes help the thought of dynamic selectivity within the STS (i.e., there are 2 distinct phases of selectivity), whereby vocal class choice power evolves from weak in the course of the onset window to robust in the course of the sustained response.

The NatS stimulus set was not designed as a voice localizer and thus possesses a decrease proportion of nonspeech vocal stimuli. To make sure this stimulus set functioned equally as a purpose-built TVA localizer (like VL), we carried out a cross-decoding evaluation between the pure sounds and voice localizer paradigms, which present that responses to vocal versus nonvocal sounds are related throughout these separate stimulus units (Fig 2C). This stands in distinction to useful neuroimaging work that used the NatS stimulus set to indicate that temporal voice areas could not exist [20]. Notably, Norman-Haignere and colleagues [27] not too long ago revealed on intracranial recordings to the identical NatS stimulus set and confirmed voice selectivity not pushed by speech. They advised that the disparity between these outcomes and former fMRI work may very well be associated to the comparatively diminished granularity of fMRI approaches in comparison with sEEG. A latest fMRI research utilizing artificially generated sounds demonstrated that temporal voice areas could encode vocal perceptual high quality, i.e., the extent to which a sound is voice like [21]. For the reason that environmental sounds stimuli don’t sufficiently pattern throughout this perceptual continuum, the present knowledge are unable to make clear this risk immediately. Nevertheless, a weak (STP) versus robust (STG/STS) class choice power may mirror encoding of acoustic and perceptual options respectively.

We reveal dynamic category-driven encoding of voice within the human STG/STS. Additional, with the spatiotemporal decision of intracerebral recordings, our outcomes reveal a gradient of selectivity throughout auditory processing areas with distinct temporal dynamics underlying totally different points of voice processing. Taken collectively, our findings help a voice gating mechanism of voice coding by temporal voice areas.

Supplies and strategies

Individuals and electrode implantation

sEEG recordings of the STS, STG, and STP (together with HG) have been carried out in 8 neurosurgical sufferers with drug-resistant epilepsy as a part of scientific analysis for epilepsy surgical procedure. See Desk 1 for patient-participant traits. Written knowledgeable consent and assent (for sufferers >14 years previous) was obtained from all members. The analysis protocol was accepted by the College of Pittsburgh Institutional Evaluate Board (STUDY20030060).

All sufferers underwent preoperative neuropsychological analysis. sEEG electrode implantation was carried out as beforehand described [33]. Briefly, Dixi Medical Microdeep electrodes have been used, with a diameter of 0.8 mm, contact size of two mm, and center-to-center spacing of three.5 mm. Electrodes contained between 8 and 18 contacts every, which we discuss with as channels. For six of 8 sufferers, there was incomplete sampling of implanted channels as a result of scientific {hardware} constraints (recording most of 128 channels, nicely beneath the quantity usually implanted). Medical employees usually selected to document each different channel on a given electrode shaft, though they often strayed from this heuristic for scientific causes.

Acoustic stimuli

Two several types of stimuli have been utilized in separate experiments, which we discuss with as VL and NatS. VL stimuli have been modified variations of the stimuli used within the Belin voice localizer [7]. These authentic stimuli have been designed for fMRI experiments and consisted of 8-second clips, with every clip containing a sequence of both vocal (e.g., speech, laughter, and coughing) or nonvocal (e.g., mechanical, music, and animal vocalizations) sounds solely. These stimuli have been tailored to capitalize on the temporal decision afforded by intracranial analysis: PRAAT silence-based segmentation was used to extract and save particular person sounds from every clip [48]. Sounds with length shorter than 550 ms have been discarded; all different sounds have been shortened to this length, linear-ramped on and off by 50 ms, and rms-normalized. This process generated 80 nonvocal and 72 vocal sounds; to make sure balanced lessons, solely the primary 72 nonvocal sounds have been chosen.

NatS stimuli have been the identical as these initially utilized by Norman-Haignere and colleagues [20]. Every of the 165 sounds are 2 seconds in length and belong to 1 of 11 classes, outlined within the authentic research, which we grouped into superordinate classes of vocal and nonvocal sounds. Vocal classes consisted of English and overseas speech, human vocalizations, and lyrical music. Much like VL, nonvocal sounds have been extra various and included classes corresponding to mechanical sounds, nonlyrical music, and animal vocalizations.

Importantly, the NatS stimulus set contained human nonspeech vocal sounds that may not activate voice-selective areas of cortex. Particularly, crowd-generated cheering and laughter could also be categorically totally different from vocal sounds generated by people. Moreover, following the heuristic outlined by Belin and colleagues [7], we excluded sounds with out vocal fold vibrations, particularly respiration and whistling. Based mostly on these 2 concerns, we reclassified 4 NatS stimuli from the vocal to the nonvocal class.


For every affected person, cortical surfaces have been reconstructed from a preoperative MRI utilizing Freesurfer [49]. Utilizing the MATLAB third-party package deal Brainstorm, the MRI was then co-registered with a postoperative CT scan, and channels have been localized. MNI normalization was carried out utilizing Brainstorm’s implementation of SPM12’s nonlinear warping. This MNI deformation subject was then used to warp the Julich volumetric atlas into affected person house [5052], and every channel was localized to an ROI by discovering the ROI label of the closest voxel. ROI labels for every channel have been visually inspected and manually corrected the place applicable.

Information preprocessing

A standard common reference (CAR) filter was used to take away noise widespread throughout channels. Whereas bipolar montages have been proven to end in enhancements throughout some sign metrics [53], a CAR filter was chosen given the unfinished channel sampling described earlier. Voltages have been epoched by extracting a window of 1,000 ms earlier than stimulus onset to 1,450 ms after offset within the case of VL or 1,000 ms after offset for NatS. Every channel was then normalized relative to the prestimulus interval throughout all trials. All channels whose centroid was additional than 3 mm from the closest cortical vertex (both the pial floor or the gray-white matter boundary) have been excluded.

To estimate broadband HGA, epoched knowledge have been forward- and reverse-filtered utilizing a financial institution of 8 bandpass Butterworth filters (sixth order), with log-spaced middle frequencies (70 to 150 Hz) and bandwidths (16 to 64 Hz). The analytic sign amplitude was extracted utilizing the Hilbert remodel. Every band was then normalized relative to a standard baseline throughout all trials; in estimating the imply and normal deviation for normalization, the earliest 100 ms of baseline have been discarded as a result of edge results, and the 100 ms instantly previous stimulus onset have been discarded to stop any contamination from low-latency responses. HGA was then calculated because the imply throughout these 8 bands, which was down-sampled to 100 Hz and clipped to a window of 900 ms earlier than onset to 900 ms after offset.

Auditory-responsive channels have been recognized utilizing 2-sample t checks evaluating imply HGA between a 500 ms window instantly following stimulus onset to a baseline interval outlined as −600 to −100 ms preonset. Solely channels with p < 0.05 (FDR corrected) have been utilized in subsequent evaluation. For sufferers that accomplished each VL and NatS, channels have been labeled auditory-responsive in the event that they surpassed this threshold in at the least one of many 2 duties. At this level, HGA was averaged throughout all displays of a stimulus, which we discuss with as a stimulus response. Except in any other case famous, stimulus responses (versus single-trial responses) have been utilized in all subsequent evaluation.

Decoding evaluation

For every affected person, decoding was carried out by way of L1-regularized logistic regression utilizing the MATLAB package deal glmnet. Enter options consisted of imply HGA in 100 ms home windows, sliding each 50 ms. Relative to stimulus onset, window facilities spanned from −50 to 1,500 ms for VL and −50 to 2450 ms for NatS. Full fashions that included all channels and time home windows, in addition to sliding fashions that used single home windows, have been constructed. Whereas the length of VL (550 ms) and NatS stimuli (2,000 ms) differed, latest proof means that this may not have resulted in a significant distinction between NatS and VL sliding window decoding outcomes (for the primary 550 ms). Utilizing the identical NatS stimuli as this research, Norman-Haignere and colleagues demonstrated that temporal integration home windows throughout auditory cortex confirmed an higher certain of 400 ms [45]; since each stimulus units exceeded this window size, they each produced responses wherein the sounds have been totally contained inside every channel’s integration window. In distinction, NatS full fashions could have benefitted from the longer stimuli, which resulted in additional temporal home windows and due to this fact extra enter options to the decoding mannequin.

Along with regularization, cross validation (5-fold) was used to stop overfitting and discover generalizability. Inside every cross-validation fold, 20% of information was held out in a testing set; the remaining 80% was additional cut up right into a 72% coaching and eight% validation set utilizing a 10-fold inner-loop cross validation scheme. Earlier than the inner-loop cross validation, enter options within the coaching+validation and testing units have been z-scored relative to the coaching+validation set. Moreover, every statement was weighted by the inverse of its class prevalence to stop fashions from biasing towards probably the most quite a few class; these weights have been calculated utilizing the coaching + validation set. This step was particularly vital for NatS, as a result of a big class imbalance (37 vocal, 128 nonvocal stimuli). Final, the inner-loop cross validation was used to check 20 totally different regularization parameters; the parameter was chosen primarily based on the mannequin that minimized the imply deviance throughout inside folds. Balanced decoding accuracies have been reported, wherein the within-class accuracy for vocal and nonvocal stimuli have been calculated individually after which averaged. Nonspeech vocal decoding (Fig 2A, mild blue bars) was carried out utilizing single-trial (versus stimulus) responses, as a result of shortage of NSV exemplars within the NatS stimulus set. For cross-task decoding, enter options have been restricted to the shorter stimulus length of VL, i.e., solely home windows throughout the first 550 ms. A single mannequin was constructed on all knowledge in a single activity after which examined on all knowledge within the different activity.

Statistical significance was assessed for sliding window decoding by way of a permutation-based clustering method [54]. Briefly, V–NV labels have been shuffled, and sliding window decoding was carried out 1,000 instances. At every window, this generated separate accuracy null distributions from which vital values have been drawn, recognized because the higher ninety fifth percentile. For every permutation, values that exceeded their window’s vital threshold have been saved, and temporally adjoining values have been summed to create a cluster mass. The max cluster mass for every permutation was saved, producing a null distribution of 1,000 cluster plenty (if a permutation contained no home windows that exceeded threshold, the max cluster mass was set to 0). Lastly, the identical process was utilized to the true (unshuffled) sliding window accuracies, and every resultant cluster mass was assigned a p-value equal to the proportion of null cluster plenty that exceeded it.

Single-channel evaluation

Single channel V versus NV separability was estimated utilizing an identical method. Nevertheless, reasonably than utilizing decoding accuracy, time-varying 2-sample t-statistics have been calculated on single-trial responses, and 10,000 permutations have been used. V–NV separability (Fig 3A) was quantified because the sum cluster mass, i.e., the sum of plenty for all vital (p < 0.001) clusters for a given channel. The significance of utilizing this metric (versus the max cluster mass) could be appreciated within the higher panel of Fig 3B: Summing throughout this channel’s 5 separate clusters provides a extra correct description of the general separability.

Throughout V–NV separable channels, HGA response profiles appeared to share constant morphological traits, particularly onset and sustained responses of various magnitudes. The longer stimulus durations within the NatS stimulus set present a greater estimate of sustained response properties and is thus the main focus of this evaluation. We first averaged HGA throughout an onset window (preliminary 500 ms following stimulus onset) and a sustained window (the rest of the stimulus size, 500 to 2,000 ms) after which calculated the imply HGA inside V and NV stimuli. This window size is supported by a latest research that discovered that lateral STG exhibited integration home windows that spanned as much as about 500 ms poststimulus onset (integration window size plus neural response delay) [45]. Whereas these integration home windows various considerably throughout auditory cortex, we opted for a conservative window size of 500 ms to separate onset responses, the place the combination window may nonetheless comprise the prestimulus baseline, from sustained responses, the place the combination window totally overlaps with the stimulus.

To analyze response traits to V versus NV stimuli, we then calculated the HGA ratio, outlined because the distinction in imply HGA between these stimulus classes, normalized to the sum of those responses. Normalization helps account for a possible signal-to-noise confound: If a given channel’s total response is scaled, the V–NV distinction will probably be amplified as nicely.

Assuming constructive values for each HGAv and HGAnv, this measure ranges between −1 and 1, with a worth close to 1 indicating a robust choice for vocal stimuli and a worth close to −1 indicating a robust choice for nonvocal stimuli. Amongst auditory-responsive channels that additionally demonstrated V–NV separability, a small fraction of them (7%) exhibited detrimental imply HGAs throughout NV stimuli, representing a NV-associated lower within the HGA response relative to baseline. To constrain the HGA ratio between −1 and 1, these values have been set to 0 earlier than the calculation.

Encoding fashions

By averaging throughout stimulus classes, the acoustic variability between particular person stimuli has so far been ignored. One risk is that channels exhibiting a robust choice for vocal sounds are literally encoding lower-level acoustic properties which are inherently totally different between vocal and nonvocal sounds. To discover this risk, encoding fashions have been used to foretell stimulus responses, i.e., imply HGA in each onset and sustained home windows, for every channel that exhibited V–NV separability. Our method carefully mirrored the encoding mannequin strategies reported by Staib and Fruholz [21].

The OpenSMILE acoustic processing package deal, carried out in python, was used to extract the “functionals” set of 88 acoustic options for every NatS stimulus [39,40]. These options include statistical summaries (e.g., imply, normal deviation, and percentiles) of acoustic options of various complexity (e.g., loudness, mel-frequency cepstral coefficients, spectral flux, and formant values). This function house contained a excessive diploma of collinearity between options; due to this fact, we used principal element evaluation to cut back its dimensionality. The primary n principal parts that captured 99.99% of the variance within the authentic function house have been saved. Final, a categorical function was added indicating vocal class membership. One stimulus (chopping meals) was eliminated as a result of outliers in its acoustic options.

Linear regression encoding fashions have been then inbuilt 1 of two methods, similar to 2 related measures of curiosity. First, the general mannequin match was calculated because the out-of-sample R2 worth utilizing leave-one-out cross-validation. Statistical significance of R2 values was assessed utilizing Bonferroni-corrected p-values generated from permutation checks with 10,000 permutations, wherein rows of the function matrix have been shuffled earlier than mannequin constructing.

Second, the probability ratio check statistic was calculated between the complete mannequin and a nested model that excluded the vocal class function. This statistic estimates the probability that the vocal class function gives further data past the acoustic options and is χ2 distributed beneath the null speculation. Log likelihoods for each full and nested fashions have been attained from fashions skilled on full stimulus units.


  1. 1.
    Romanski LM, Averbeck BB. The primate cortical auditory system and neural illustration of conspecific vocalizations. Annu Rev Neurosci. 2009;32:315–46. pmid:19400713
  2. 2.
    Bodin C et al. Functionally homologous illustration of vocalizations within the auditory cortex of people and macaques. Curr Biol. 2021. pmid:34506729
  3. 3.
    Mathias S. R., von Kriegstein Okay. Voice Processing and Voice-Identification Recognition. in Timbre: Acoustics, Notion, and Cognition (eds. Siedenburg Okay., Saitis C., McAdams S., Popper A. N., Fay R. R.) 175–209. Springer Worldwide Publishing; 2019.
  4. 4.
    Hepper PG, Scott D, Shahidullah S. New child and fetal response to maternal voice. J Reprod Toddler Psychol. 1993;11:147–53.
  5. 5.
    Kuhl PK. Early language acquisition: Cracking the speech code. Nat Rev Neurosci. 2004. pmid:15496861
  6. 6.
    Zarate JM, Tian X, Woods KJP, Poeppel D. A number of ranges of linguistic and paralinguistic options contribute to voice recognition. Sci Rep. 2015. pmid:26088739
  7. 7.
    Belin P, Zatorre RJ, Lafallie P, Ahad P, Pike B. Voice-selective areas in human auditory cortex. Nature. 2000. pmid:10659849
  8. 8.
    Pernet CR et al. The human voice areas: Spatial group and inter-individual variability in temporal and extra-temporal cortices. Neuroimage. 2015. pmid:26116964
  9. 9.
    Belin P, Zatorre RJ, Ahad P. Human temporal-lobe response to vocal sounds. Cogn Mind Res. 2002;13:17–26. pmid:11867247
  10. 10.
    Bodin C, Takerkart S, Belin P, Coulon O. Anatomo-functional correspondence within the superior temporal sulcus. Mind Struct Funct. 2018;223:221–32. pmid:28756487
  11. 11.
    Kriegstein KV, Giraud A-L. Distinct useful substrates alongside the correct superior temporal sulcus for the processing of voices. Neuroimage. 2004;22:948–55. pmid:15193626
  12. 12.
    Agus TR, Paquette S, Suied C, Pressnitzer D, Belin P. Voice selectivity within the temporal voice space regardless of matched low-level acoustic cues. Sci Rep. 2017;7:11526. pmid:28912437
  13. 13.
    Bodin C, Belin P. Exploring the cerebral substrate of voice notion in primate brains. Philos Trans R Soc B Biol Sci 2020;375:20180386. pmid:31735143
  14. 14.
    Seltzer B, Pandya DN. Parietal, temporal, and occipital projections to cortex of the superior temporal sulcus within the rhesus monkey: A retrograde tracer research. J Comp Neurol. 1994;343:445–63. pmid:8027452
  15. 15.
    Erickson LC, Rauschecker JP, Turkeltaub PE. Meta-analytic connectivity modeling of the human superior temporal sulcus. Mind Struct Funct. 2017;222:267–85. pmid:27003288
  16. 16.
    Romanski LM et al. Twin streams of auditory afferents goal a number of domains within the primate prefrontal cortex. Nat Neurosci. 1999;2:1131–6. pmid:10570492
  17. 17.
    Perrodin C, Kayser C, Abel TJ, Logothetis NK, Petkov CI. Who’s That? Mind Networks and Mechanisms for Figuring out People. Traits Cogn Sci. 2015;19. pmid:26454482
  18. 18.
    Zhang Y et al. Hierarchical cortical networks of “voice patches” for processing voices in human mind. Proc Natl Acad Sci U S A. 2021;118:e2113887118. pmid:34930846
  19. 19.
    von Kriegstein Okay, Eger E, Kleinschmidt A, Giraud AL. Modulation of neural responses to speech by directing consideration to voices or verbal content material. Cogn Mind Res. 2003;17:48–55. pmid:12763191
  20. 20.
    Norman-Haignere S, Kanwisher NG, McDermott JH. Distinct Cortical Pathways for Music and Speech Revealed by Speculation-Free Voxel Decomposition. Neuron. 2015;88:1281–96. pmid:26687225
  21. 21.
    Staib M, Frühholz S. Cortical voice processing is grounded in elementary sound analyses for vocalization related sound patterns. Prog Neurobiol. 2021;200:101982. pmid:33338555
  22. 22.
    Perrodin C, Kayser C, Logothetis NK, Petkov CI. Auditory and visible modulation of temporal lobe neurons in voice-sensitive and affiliation cortices. J Neurosci. 2014 34 2524–37. pmid:24523543
  23. 23.
    Sadagopan S, Temiz-Karayol NZ, Voss HU. Excessive-field useful magnetic resonance imaging of vocalization processing in marmosets. Sci Rep. 2015. pmid:26091254
  24. 24.
    Perrachione TK, Tufo SND, Gabrieli JDE. Human Voice Recognition Depends upon Language Skill. Science. 2011;333:595–5. pmid:21798942
  25. 25.
    Peretz I, Vuvan D, Lagrois M-É, Armony JL. Neural overlap in processing music and speech. Philos Trans R Soc B Biol Sci. 2015;370:20140090. pmid:25646513
  26. 26.
    Zatorre RJ, Belin P, Penhune VB. Construction and performance of auditory cortex: music and speech. Traits Cogn Sci. 2002;6:37–46. pmid:11849614
  27. 27.
    Norman-Haignere SV et al. A neural inhabitants selective for tune in human auditory cortex. Curr Biol. 2022;0.
  28. 28.
    Chevillet M, Riesenhuber M, Rauschecker JP. Practical Correlates of the Anterolateral Processing Hierarchy in Human Auditory Cortex. J Neurosci. 2011;31:9345–52. pmid:21697384
  29. 29.
    Hickok G, Poeppel D. The cortical group of speech processing. Nat Rev Neurosci. 2007;8:393–402. pmid:17431404
  30. 30.
    Staeren N, Renvall H, De Martino F, Goebel R, Formisano E. Sound Classes Are Represented as Distributed Patterns within the Human Auditory Cortex. Curr Biol. 2009;19:498–502. pmid:19268594
  31. 31.
    Fontolan L, Morillon B, Liegeois-Chauvel C, Giraud A-L. The contribution of frequency-specific exercise to hierarchical data processing within the human auditory cortex. Nat Commun. 2014;5:4694. pmid:25178489
  32. 32.
    Norman-Haignere SV et al. Multiscale temporal integration organizes hierarchical computation in human auditory cortex. Nat Hum Behav. 2022. pmid:35145280
  33. 33.
    Abel TJ et al. Frameless robot-assisted stereoelectroencephalography in kids: Technical points and comparability with Talairach body approach. J Neurosurg Pediatr. 2018;22. pmid:29676681
  34. 34.
    Perrodin C, Kayser C, Logothetis NK, Petkov CI. Voice Cells within the Primate Temporal Lobe. Curr Biol. 2011;21:1408–15. pmid:21835625
  35. 35.
    Miller KJ, Abel TJ, Hebb AO, Ojemann JG. Fast on-line language mapping with electrocorticography: Medical article. J Neurosurg Pediatr. 2011;7:482–90. pmid:21529188
  36. 36.
    Crone N. E., Sinai A., Korzeniewska A. Excessive-frequency gamma oscillations and human mind mapping with electrocorticography. in Progress in Mind Analysis (eds. Neuper C., Klimesch W.) vol. 159 275–295. Elsevier; 2006.
  37. 37.
    Buzsáki G, Anastassiou CA, Koch C. The origin of extracellular fields and currents—EEG, ECoG, LFP and spikes. Nat Rev Neurosci. 2012;13:407–20. pmid:22595786
  38. 38.
    Ray S, Maunsell JHR. Totally different Origins of Gamma Rhythm and Excessive-Gamma Exercise in Macaque Visible Cortex. PLoS Biol. 2011;9:e1000610. pmid:21532743
  39. 39.
    Eyben F., Weninger F., Gross F., Schuller B. Current developments in openSMILE, the munich open-source multimedia function extractor. in Proceedings of the twenty first ACM worldwide convention on Multimedia 835–838. Affiliation for Computing Equipment; 2013.
  40. 40.
    Eyben F et al. The Geneva Minimalistic Acoustic Parameter Set (GeMAPS) for Voice Analysis and Affective Computing. IEEE Trans Have an effect on Comput. 2016;7:190–202.
  41. 41.
    Frühholz S., Belin P. The Oxford Handbook of Voice Notion. Oxford College Press; 2018.
  42. 42.
    Tsao DY, Livingstone MS. Mechanisms of face notion. Annu Rev Neurosci. 2008;31:411–37. pmid:18558862
  43. 43.
    Younger AW, Bruce V. Understanding individual notion. Br J Psychol. 2011;102:959–74. pmid:21988395
  44. 44.
    Latinus M, McAleer P, Bestelmeyer PEG, Belin P. Norm-Based mostly Coding of Voice Identification in Human Auditory Cortex. Curr Biol. 2013;23:1075–80. pmid:23707425
  45. 45.
    Norman-Haignere S. V. et al. Multiscale integration organizes hierarchical computation in human auditory cortex. 2020.
  46. 46.
    Charest I et al. Electrophysiological proof for an early processing of human voices. BMC Neurosci. 2009;10:127. pmid:19843323
  47. 47.
    Capilla A, Belin P, Gross J. The Early Spatio-Temporal Correlates and Process Independence of Cerebral Voice Processing Studied with MEG. Cereb Cortex. 2013;23:1388–95. pmid:22610392
  48. 48.
    Boersma P., Weenink D. Praat: doing phonetics by laptop. 2021.
  49. 49.
    Fischl B et al. Routinely Parcellating the Human Cerebral Cortex. Cereb Cortex. 2004;14:11–22. pmid:14654453
  50. 50.
    Amunts Okay, Mohlberg H, Bludau S, Zilles Okay. Julich-Mind: A 3D probabilistic atlas of the human mind’s cytoarchitecture. Science. 2020;369:988–92. pmid:32732281
  51. 51.
    Eickhoff SB et al. A brand new SPM toolbox for combining probabilistic cytoarchitectonic maps and useful imaging knowledge. Neuroimage. 2005;25:1325–35. pmid:15850749
  52. 52.
    Evans AC, Janke AL, Collins DL, Baillet S. Mind templates and atlases. Neuroimage. 2012;62:911–22. pmid:22248580
  53. 53.
    Li G et al. Optimum referencing for stereo-electroencephalographic (SEEG) recordings. Neuroimage. 2018;183:327–35. pmid:30121338
  54. 54.
    Maris E, Oostenveld R. Nonparametric statistical testing of EEG- and MEG-data. J Neurosci Strategies. 2007;164:177–90. pmid:17517438


Please enter your comment!
Please enter your name here

Most Popular

Recent Comments