When we speak, we engage nearly 100 muscles, continuously moving our lips, jaw, tongue, and throat to shape our breath into the fluent sequences of sounds that form our words and sentences. A new study by UC San Francisco scientists reveals how these complex articulatory movements are coordinated in the brain.
The new research reveals that the brain’s speech centers are organized more according to the physical needs of the vocal tract as it produces speech than by how the speech sounds (its “phonetics”). Linguists divide speech into abstract units of sound called “phonemes” and consider the /k/ sound in “keep” the same as the /k/ in “coop.” But in reality, your mouth forms the sound differently in these two words to prepare for the different vowels that follow, and this physical distinction now appears to be more important to the brain regions responsible for producing speech than the theoretical sameness of the phoneme.
The findings, which extend previous studies on how the brain interprets the sounds of spoken language, could help guide the creation of new generation of prosthetic devices for those who are unable to speak: brain implants could monitor neural activity related to speech production and rapidly and directly translate those signals into synthetic spoken language.
The new study, published on May 17, 2018, in Neuron, was conducted by Josh Chartier and Gopala K. Anumanchipalli, PhD, both researchers in the laboratory of senior author Edward Chang, MD, professor of neurological surgery, Bowes Biomedical Investigator, and member of the UCSF Weill Institute for Neurosciences. They were joined by Keith Johnson, PhD, professor of linguistics at UC Berkeley.
A Neural Code for Vocal Tract Movements
Chang, a neurosurgeon at the UCSF Epilepsy Center, specializes in surgeries to remove brain tissue that causes seizures in patients with epilepsy. In some cases, to prepare for these operations, he places high-density arrays of tiny electrodes onto the surface of the patients’ brains, both to help identify the location triggering the patients’ seizures and to map out other important areas, such as those involved in language, to make sure the surgery avoids damaging them.
In addition to its clinical importance, this method, known as electrocorticography, or ECoG, is a powerful tool for research. “It’s a unique means of looking at thousands of neurons activating in unison,” Chartier said.
In the new study, Chartier and Anumanchipalli asked five volunteers awaiting surgery, with ECoG electrodes placed over a region of ventral sensorimotor cortex that is a key center of speech production, to read aloud a collection of 460 natural sentences. The sentences were expressly constructed to encapsulate nearly all the possible articulatory contexts in American English. This comprehensiveness was crucial to capture the complete range of “coarticulation,” the blending of phonemes that is essential to natural speech.
“Without coarticulation, our speech would be blocky and segmented to the point where we couldn’t really understand it,” said Chartier.
The research team was not able to simultaneously record the volunteers’ neural activity and their tongue, mouth and larynx movements. Instead, they recorded only audio of the volunteers speaking and developed a novel deep learning algorithm to estimate which movements were made during specific speaking tasks.
This approach allowed the researchers to identify distinct populations of neurons responsible for the specific vocal tract movement patterns needed to produce fluent speech sounds, a level of complexity that had not been seen in previous experiments that used simpler syllable-by-syllable speech tasks.
The experiments revealed that a remarkable diversity of different movements were encoded by neurons surrounding individual electrodes. The researchers found there were four emergent groups of neurons that appeared to be responsible for coordinating movements of muscles of the lips, tongue, and throat into the four main configurations of the vocal tract used in American English. The researchers also identified neural populations associated with specific classes of phonetic phenomena, including separate clusters for consonants and vowels of different types, but their analysis suggested that these phonetic groupings were more of a byproduct of more natural groupings based on different types of muscle movement.
Regarding coarticulation, the researchers discovered that our brains’ speech centers coordinate different muscle movement patterns based on the context of what’s being said, and the order in which different sounds occur. For example, the jaw opens more to say the word “tap” than to say the word “has” — despite having the same vowel sound (/ae/), the mouth has to get ready to close to make the /z/ sound in “has.” The researchers found that neurons in the ventral sensorimotor cortex were highly attuned to this and other co-articulatory features of English, suggesting that the brain cells are tuned to produce fluid, context-dependent speech as opposed to reading out discrete speech segments in serial order.
“During speech production, there is clearly another layer of neural processing that happens, which enables the speaker to merge phonemes together into something the listener can understand,” said Anumanchipalli.
Path to a Speech Prosthetic
“This study highlights why we need to take into account vocal tract movements and not just linguistic features like phonemes when studying speech production,” Chartier said. He thinks that this work paves the way not only for additional studies that tackle the sensorimotor aspect of speech production, but could also pay practical dividends.
“We know now that the sensorimotor cortex encodes vocal tract movements, so we can use that knowledge to decode cortical activity and translate that via a speech prosthetic,” said Chartier. “This would give voice to people who can’t speak but have intact neural functions.”
Ultimately, the study could represent a new research avenue for Chartier and Anumanchipalli’s team at UCSF. “It’s really made me think twice about phonemes fit in—in a sense, these units of speech that we pin so much of our research on are just byproducts of a sensorimotor signal,” Anumanchipalli said.