Speech is characterized by 4 types of acoustic segments:
Context Effects occur when perception of one phoneme is altered by changing the acoustic characteristics of nearby sound segments. Trading Relations occur when a phonemic percept can remain unchanged by simultaneously changing more than one acoustic features of the signal - they "trade against each other."
Consider the sentence:
from Repp, Liberman, Eccardt and Pesetsky (1978, JEP:HPP 4(4):621-637). Perception varies based on the silence gap between gray and ship and the duration of the fricative in ship.
The orignal utterance lies in area 1, with no silence between the "ay" and "sh" and a fricative noise of about 122 msec.
great
When exposed to a silent interval inserted between "gray" and "ship," the listener would assimilate the silence and the "sh" into cues for a stop consonant, perceiving "gray" as "great." Given a noise duration of 160 msec, the "t" was perceived even after relative long 100 msec gaps of silence. The "t" was then grouped with the preceeding word "gray" rather than with the temporally contiguous "chip" signal.
By shortening the duration of the fricative noise, there is a switch from "gray ship" or "great ship" to "gray chip." Looking at areas 2 and 3, for a given silence duration shortening the noise duration caused the perceived stop "t" to leave the first syllable "grea" and latch onto the fricative "sh" to form the affricate consonant "ch" ("tsh"). Without changing the amount of silence separating the workds, a variation in the initial segment of the second word can alter the perception of the first word!
Further, the boundary between areas 2 and 3 shows a trading relation between silence and noise durations. At longer silence durations longer noise durations are required to cue the switch.
Three questions:
One possible answer is a hierarchy of processing levels linked by bi-directional pathways:
At the lowest level, peripheral auditory neurons send signals to higher-order neurons that encode iconic sensory features. A pattern of activation across these feature detectors within a small time interval activates an item representation, which are stored in working memory as a temporal succession of sounds. The working memory transforms the sequence of sounds into an evolving pattern of activation.
The activity patterns across these working memory items in turn activate list chunks (representing phonemes, syllables, or words), which are context-sensitive representations of a particular temporal sequence of items. Since sequence is encoded, these are actually list sequences. Active list chunks feed back down to the item working memories to support their neural representations while suppressing items in the working memories that are not represented by the active list (an inhibitory process).
This would explain phonemic restoration experiments where broadband noise is perceived as different phonemes depending upon context.
When a phonemic sequence in working memory excites and then receives confirmatory top-down feedback from a list, the positive feedback loop enhances the activity in both fields through resonance. This model proposes that when listeners perceive fluent speech, a wave of resonant activity plays across the working memory, binding the phonemic items into larger language units and raising them into the listener's conscious perception.
The key problem is the different time scales at these levels. They need to be coordinated in order to form a unified speech percept. The rate of conscious speech is equal to the time scale of the resonance between processing levels.