Human existence

HUMAN SONG AND SPEECH Bookmark and Share

(adapted from section 6, "Respiratory, postural and spatio-kinetic motor stabilization, internal models, top-down timed motor coordination and expanded cerebello-cerebral circuitry: a review".)

Humans are unique vocalizers

The art of vocalization is widespread amongst animals. Birds are, in particular considered, to be exquisite songsters. While speech, is appreciated to be biologically unique to humans, human song, in an albeit hidden way, is also biologically unique. Bird song, unlike humans song, is done with minibreaths between each syllable (with the exception of high frequency "trills" at ca.30s1 in canaries, and 16s-1 in cardinals (Suthers, Goller et al. 1999)). Birds can do this because they use a different respiratory apparatus to mammals. This employs anterior and posterior air sacs to create a unidirectional airflow through their lungs, and this allows for the insertion of such minibreaths in between their song notes (Suthers, Goller et al. 1999).

Such minibreaths suggest bird vocalization is built upon low-level reflexive processes that ensure adequate concurrent respiration. Humans, whether in speech or song, in contrast, produce multiple vocalization upon prolonged single out-breaths, a phenomena called “thoracic breathing”, that in terms of normal respiration is distinct from the everyday nonvocal and reflex controlled form of “quiet” respiratory breathing (Ladefoged 1960; Hixon 1973; Proctor 1986; Provine 1996; MacLarnon and Hewitt 1999; Ghazanfar and Rendall 2008). As with the human uniqueness in bipedality and dexterity, this respiratory phenomena is argued here to link to an unique competence in accurately timed motor stabilization (in this case the stabilization of subglottal pulmonary pressure) that results from expanded cerebello-cerebral circuits and internal modeling that overrides a lower level of preflex and reflex motor control. Further, this capacity for such timed control in vocalization, even more than for dexterity and bipedality, allows for the construction of the novel kinds of complex motor executions; in this case, the interarticulator actions in the vocal tract that create the phonetic features that provide different phones with their distinct phonetic identities (Lofqvist and Gracco 1999).

There exist several unique related traits in human vocalization.

Hierarchically stringing of units

Humans generate vocalizations strung together at several levels of hierarchical organization. Such vocalization can be made up of speech phones (syllables, words, clauses, sentences), or song notes (beats, meter, phrases, melodies).

Diverse recombinable units

Human vocalization is done in regard to a large set of recombinable phone/note units (most languages contain 20 to 45 vowel, consonant, and diphthong phones; there are 12 semitones in an octave and most singers can range across several). The International Phonetic Alphabet (International Phonetic Association 1999) lists for consonants 12 places and nine types of articulation that can be either voiced or unvoiced (plus five types of anterior release clicks); for vowels it lists five positions and seven manners (plus being rounded or not). In addition, it notes the existence of three kinds of suprasegmentals stress (seven types), tone (15 types), and intonation (four types). Such features create a large pool of potential phones: for example, in one sample of 317 human languages, there were 757 different kinds of phones (Maddieson 1981).

Diverse uses and modes of production

Humans modify and use their vocalizations in diverse mastered ways as distinct as falsetto, esophageal speech (after laryngectomies), yodeling, whistle speech, throat singing, and entertainment ventriloquism. Further, some human hunters learn to imitate the vocalizations of their prey to stalk them (Willerslev 2004).

Optional subcomponents

The various components of human vocalization provided by the lung (pulmonary), glottis (vocal cords), larynx, and supralaryngeal vocal tract can be isolated, omitted, or used for other purposes. Humans can, for example, speak without glottal phonation, as in whistle speech, or without normal pulmonary pressure and phonation as in esophageal speech or buccal-source speech (also called ‘Donald Duck’ talk). (In this, the vocal tract is partially blocked by the back of the tongue, and teeth, cheeks, and oral pressure is created by the tongue that causes the arches in the back of the mouth to vibrate (Smith 1994: p. 4221). Following spinal injuries, the pulmonary control of thoracic breathing can shift to being based upon the diaphragm without employing the normally used abdominal and intercostal muscles (Meyer 2003). Respiration control is used for nonvocalization activities such as playing woodwind and brass instruments. In addition to such respiratory control, saxophonists and clarinetists can modify their instrument’s sound in the altissimo register by changing their vocal trait resonance (Fritz and Wolfe 2005; Chen, Smith et al. 2008).

Unique amongst primates

Human vocal capacities are of particular biological of interest because no nonhuman primate makes any comparable vocalizations. This is in spite of nonhuman primates already having many of the required competences: they can produce singularly some of the phonetic units of human speech (Richman 1976), hear them (Steinschneider, Arezzo et al. 1982), and if trained, can comprehend the pronunciation of spoken words (Savage-Rumbaugh and Lewin 1994), and intersperse vocalizations with human and other conspecies interactors in a conversational manner (Savage-Rumbaugh, Fields et al. 2004). However, even with these vocal-related advantages, while nonhuman primates can be tutored to communicate with gesture and sign-board based languages, they cannot be tutored to talk (Hayes 1951). The language tutored, Kanzi, for example, is no more able in his vocal interactions than to contextually modulate the spectral and temporal features of his vocalizations—a notable contrast to his considerable abilities to communicate manually with a sign board (Taglialatela, Savage-Rambaugh et al. 2003). This is odd since gesturing and sign board pointing would seem of comparable motor complexity to speech, and, nonhuman primates already use vocalization (unlike sign boards) to communicate. Indeed, evolution has enhanced nonhuman ape vocalization in a manner not found in humans in the form of vocal sacs (Nishimura, Mikami et al. 2007; Ghazanfar and Rendall 2008). The shape of the hyoid bone in a partial Australopithecus afarensis skeletal suggests interestingly that preHomo hominins also might have possessed such vocal sacs (Alemseged, Spoor et al. 2006, p. 300). Such vocal sacs enable chimpanzees (and perhaps other apes) to produce very loud piercing calls that in the case of chimpanzees are made of two simultaneous tones that are three octaves distant from each other (Yerkes and Learned 1925, pp. 61-62).

Unlocked vocal chain

Much less is understood about motor stabilization in human vocalization than for bipedality and dexterity, (it is known though that the vocal articulators adjust quickly after perturbation (Gracco and Lfqvist 1994)). Research upon attempts to teach higher apes to make voluntarily vocalizations suggest a link to an unique human ability to control the respiratory/vocal tract musculoskeletal system. There are two such accounts (Furness 1916; Hayes 1951); both report difficulties in directly controlling the vocal apparatus. The account provided upon Viki is most detailed.

Viki could create some speech sounds but this depended upon her first being prompted with external help (Hayes 1951). Keith (her human speech tutor) trained Viki by positioning his fingers in her mouth to open and shut them to form speech syllables. This was because Viki could make an “asking sound” but without such external help she could not modify it on her own into other sounds. As his wife Catherine Hayes noted in her book upon Viki (1951: p. 67.): "She soon got the idea and began to inhibit her asking sound until Keith’s fingers were on her lips. If he was too slow in getting ready, Viki often took his hand and put it in the helping positions". Much earlier William Furness (1916) reported upon his attempts to teach an orangutan. In order to say “cup”, he used a spatula to push her tongue make to the /k/ phone: “after several lessons .. she would draw back her tongue to the position even before the spatula had touched it, but she would not say ka unless I place my finger over her nose. The next advance was that she herself would place my finger over her nose and then said it without any use of the spatula” (Furness 1916, p.284).

To take the case of Viki, she could create the pulmonary pressure and phonation needed for a particular “asking” vocalization, and she could also manipulate her lips to create a different one (as evidenced when triggered to do so by Keith’s hand). What she could not do, or found very difficult, was combine them as independent motor elements so she could pronounce on her own a new type of nonevolved vocalization. The nearest she could do was use another part of her motor system (her hands) to get hold of Keith’s hand to reshape her mouth, and so use this indirect and external means to control her vocal articulation. A similar phenomena seems also to have characterized the attempts of Furness’ orangutan to vocalize. This suggests that nonhuman apes have problems unlocking the separate musculoskeletal elements that make up the vocalization chain to create the motor coordination that underlies the motor production of human speech. That the nonhuman vocal chain should be locked in this way makes evolutionary sense in the view of the critical importance of the links of respiration to cardiovascular and locomotion (Lee and Banzett 1997), and that the larynx is involved not only in phonation but also in several survival critical reflexive actions such as swallowing, respiration and cough (Ludlow 2005).

Reflecting this innate locking, while breathing is under voluntary control in humans (Loucks, Poletto et al. 2007; Simonyan, Saad et al. 2007), it is difficult to train in nonhuman primates such as chimpanzees (Hayes 1951: p. 69). Humans also seem unique in related voluntary respiratory abilities such as suppressing and voluntarily activating (in the absence of sensory triggers) coughing and sniffing (Simonyan, Saad et al. 2007). Nonhuman vocalizations, when made, moreover, are nearly always done in emotional contexts and performed in a highly stereotypically and a genetically determined manner. This is evidenced in the strong correlations that exist between the vocalizations of chimpanzees and bonobos (in spite of them being two species), a correlation that does not exist, in contrast, for their manual gestures (Pollick and de Waal 2007). The human brain control needed for voluntary respiratory such as that for exhalation and the production of sound syllables also seem to be closely related in that they involve similar cerebello-cerebral circuit activations (except for the auditory cortices) (Loucks, Poletto et al. 2007).

Subglottal pressure stabilization

To control pulmonary pressure requires that thoracic muscles can stabilize lung exhalation as a separate motor control element in a time sensitive manner from the later ones in the vocal chain involved in phonation (voicing), vocal resonance change (vowels), and its gestural modification (consonants). There is here a direct parallel with anticipatory adjustment used in human bipedality and dexterity, but in regards to the stabilization of the motor parameter of pulmonary pressure below the glottis (vocal cords). This, for functional speech, needs to be maintained at a constant level (for a given degree of loudness) throughout successive strings of vocalizations in spite of this producing considerable decrease in lung volume (Ladefoged 1960; Hixon 1973; Proctor 1986). For this pulmonary pressure stability to exist requires that the muscles controlling it are anticipatorily adjusted in regard to each upcoming vocalization and its particular subglottal pressure needs (which might vary, for example, in regard to its individual phones, vocalization loudness, and prosodic stress and emotional emphasis). There also needs to be in regard to forthcoming speech and song pauses action planning of thoracic muscles as to when to refill the lung (Whalen and Kinsella-Shaw 1997).

Time-scheduling and phone articulation construction

Humans not only engage in thoracic breathing but also when articulating phones, engage in exquisite “dexterity” of the vocal tract. The reason for this, I suggest, is that in nonhuman animals, pulmonary pressure and the vocal tract are restricted by reflexes to articulating a limited set of evolved vocalizations. But because human vocal tract actions are “unlocked” from such reflexes in humans by direct cortical control (Kuyper 1958; Liscic, Zidar et al. 1998; Ludlow 2005; Ghazanfar and Rendall 2008; Teitti, Maatta et al. 2008), it can be synchronized and motor coordinated in complex sequences of diverse and differently timed glottal, laryngeal and supralaryngeal movements. It is this ability to combine as independent motor elements glottal phonation, laryngeal/ supralaryngeal gesture and vocal tract modifications (Lofqvist and Gracco 1999) with timed anticipatory motor adjustment that, could be responsible for enabling the human motor system to create, and then string together, its rich diversity of speech phones into spoken words.

If glottal phonation, for example, can be adjusted independently and anticipatorily to the rest of the vocal chain, it can be time-schedule synchronized to create speech sounds that differ in the timing between their glottal onset and their acoustic shaping by vocal tract gestures (voiced/ unvoiced contrast; glottal phones). Likewise, if the laryngeal shape is not reflexively locked to articulators higher up the vocal chain, then its resonance “vowel” quality can be changed independent of them so that vowel vocalizations can be conjoined in a time exact manner with a great variety of gestures in different vocal tract locations (bilabial, labio-dental, dental, alveolar, post-alveolar, retroflex, palatal, velar, uvular, pharyngeal, epiglottal, and glottal), and manners (nasal, plosive, fricative, approximant, trill, tap/flap, and their lateral variants). As a result, vowels can be provided with diverse kinds of associated consonantal sounds. For example, using data from the International Phonetic Alphabet (International Phonetic Association 1999), the movement of the lips (bilabiality) can create six consonants depending upon their timing with the on-start of phonation in the glottis (voiced vs. unvoiced), the presence or not of nasality (/m/) (created by soft palette opening), and how that lip movement is carried out (plosive, /p/, /b/; fricative, /ф/, //; or trill, /в/). The lips with such top-down control can create further pronunciations such as anterior release “click” consonants that do not even use pulmonary air pressure. This motor ability to independently stabilize different vocal components explains the diversity, that was noted above with which the human vocal apparatus can be used.

In this context, it is interesting to note that internal models in the cerebellum upon the auditory signal of phone production have been suggested to underlie phone perception (Callan, Tsytsarev et al. 2006), vocal tract articulation (right side) (Callan, Kawato et al. 2007) and speech prosody (left side) (Callan, Kawato et al. 2007). There is evidence that phone perception involves processes used in its production (Liberman, Cooper et al. 1967; Pulvermuller, Huss et al. 2006). This research suggests that there may be a considerable opportunity to explain phenomena already identified in phonetic and speech sciences with the internal model processes that became more complex when the human brain expanded.

Possible link to syntax

As with knapping, the nature of internal models allows that such musculoskeletal level predictive internal models can engage in complex hierarchical interaction with higher internal model ones. As noted above, it is a peculiarity of human vocalization that it is made in the context of several layers of hierarchical organization that concern not only productive ones (such as in speech syllable, word, phrase, and sentence) but also those involved in communication such as semantics, syntax, pragmatics and emotions. There is even evidence that the speech production system does not only aid the perception of speech (Liberman, Cooper et al. 1967; Pulvermuller, Huss et al. 2006) but provides prediction and imitation abilities that also aids higher level language comprehension (Pickering and Garrod 2007).

Of particular importance in this context is that strings of phones are made into units that are organized and arranged in planned syntactic ways. This syntax level directly interacts down upon the lower musculoskeletal ones—a phenomena that can be seen in the way that syntactic tense can modify vowel vocalization such as in "swim", "swum", "swam". This suggests that the syntax and musculoskeletal levels are in some way closely interlinked. While any ideas in this area are necessarily preliminary, this raises the possibility that the internal models needed for low-level musculoskeletal control of the vocal tract could have created the opportunity by which higher-level models are constructed in motor control upon them so that the speech units that they create can be structured to support communication and semantics. It is interesting to note that the Broca’s area, a brain region in the premotor cortex traditionally associated with syntax, and more recently, syntactic working memory (Fiebach, Schlesewsky et al. 2005), has also been recently identified as underlying the anticipatory hierarchization of actions (Fiebach and Schubotz 2006). This is consistent with lower motor level models in vocalization providing the basis for the development of higher-level ones that have come in their organization of lower ones to possess what are analyzed as syntactic functions.

Summary of vocalization and internal models

These brief observations show that human vocalization and voluntary respiration control could gain their evolutionary novelty like human dexterity and bipedality from top-down internal model timed motor stabilization. Like them, this is consistent with them being linked to the cerebello-cerebral cortex circuits (Murphy, Corfield et al. 1997; Dresel, Castrop et al. 2005; Schulz, Varga et al. 2005; Callan, Tsytsarev et al. 2006; Callan, Kawato et al. 2007; Loucks, Poletto et al. 2007; Spencer and Slocomb 2007). Further, like dexterity and bipedality, the kinematics of speech production continues to be refined into adolescence and after (Smith and Zelaznik 2004).


Alemseged, Z., F. Spoor, et al. (2006). "A juvenile early hominin skeleton from Dikika, Ethiopia." Nature 443: 296-301.

Callan, D. E., M. Kawato, et al. (2007). "Speech and song: The role of the cerebellum." Cerebellum: 1-7.

Callan, D. E., V. Tsytsarev, et al. (2006). "Song and speech: brain regions involved with perception and covert production." Neuroimage 31(3): 1327-42.

Chen, J. M., J. Smith, et al. (2008). "Experienced saxophonists learn to tune their vocal tracts." Science 319(5864): 776.

Dresel, C., F. Castrop, et al. (2005). "The functional neuroanatomy of coordinated orofacial movements: sparse sampling fMRI of whistling." Neuroimage 28(3): 588-97.

Fiebach, C. J., M. Schlesewsky, et al. (2005). "Revisiting the role of Broca's area in sentence processing: syntactic integration versus syntactic working memory." Hum Brain Mapp 24(2): 79-91.

Fiebach, C. J. and R. I. Schubotz (2006). "Dynamic anticipatory processing of hierarchical sequential events: A common role for Broca's area and ventral premotor cortex across domains?" Cortex 42: 499-502.

Fritz, C. and J. Wolfe (2005). "How do clarinet players adjust the resonances of their vocal tracts for different playing effects?" J Acoust Soc Am 118(5): 3306-15.

Furness, W. H. (1916). "Observations on the mentality of chimpanzees and orang-utans." Proceedings of the American Philosophical Society 55: 281-290.

Ghazanfar, A. A. and D. Rendall (2008). "Evolution of human vocal production." Current Biology 18: R457-R460.

Gracco, V. L. and A. Lfqvist (1994). " Speech motor coordination and control, Evidence form lip, jaw, and laryngeal movements." Journal of Neuroscience 14: 6585-6597.

Hayes, C. (1951). The ape in our house. New York, Harper.

Hixon, T. J. (1973). "Kinematics of the chest wall during speech production: volume displacements of the rib cage, abdomen, and lung." J Speech Hear Res 16(1): 78-115.

International Phonetic Association (1999). Handbook of the International Phonetic Association. Cambridge, Cambridge University Press.

Kuyper, H. G. (1958). "Corticobulbar connexions to the pons and lower brain-stem in man." Brain 81: 364-388.

Ladefoged, P. (1960). "The regulation of sub-glottal pressure." Folia Phoniatrica 12: 169-175.

Lee, H.-t. and R. B. Banzett (1997). "Mechanical links between locomotion and breathing." News in Physiological Science 12: 273-.

Liberman, A. M., F. S. Cooper, et al. (1967). "Perception of the speech code." Psychological Review 74: 431-461.

Liscic, R. M., J. Zidar, et al. (1998). "Evidence of direct connection of corticobulbar fibers to orafacial muscles in man." Muscle and Nerve 21: 561-566.

Lofqvist, A. and V. L. Gracco (1999). "Interarticulator programming in VCV sequences: lip and tongue movements." J Acoust Soc Am 105(3): 1864-76.

Loucks, T. M., C. J. Poletto, et al. (2007). "Human brain activation during phonation and exhalation: Common volutional control for two upper airway functions." Neuroimage 15(131-143).

Ludlow, C. L. (2005). "Central nervious system control of the laryngeal muscles in humans." Respiratory Physiology & Neurobiology 147: 205-222.

MacLarnon, A. M. and G. P. Hewitt (1999). "The evolution of human speech: the role of enhanced breathing control." Am J Phys Anthropol 109(3): 341-63.

Maddieson, I. (1981). "UCLA phonological segment inventory database." UCLA Working Papers in Phonetics 53: 1-243.

Meyer, M. (2003). "Vertebrae and Language Ability in Early Hominids." PaleoAnthropology 1: 20-21.

Murphy, K., D. R. Corfield, et al. (1997). "Cerebral areas associated with motor cortrol of speech in humans." Journal of Applied Physiology 85: 1438-1447.

Nishimura, T., A. Mikami, et al. (2007). "Development of the Laryngeal Air Sac in Chimpanzees." International Journal of Primatology 28: 483-492.

Pickering, M. J. and S. Garrod (2007). "Do people use language production to make predictions during comprehension?" Trends in Cognitive Science 11: 105-110.

Pollick, A. S. and F. B. M. de Waal (2007). "Ape gestures and language evolution." Proceedings of the National Academy of Sciences of the United States of America 104: 8164-6168.

Proctor, D. F. (1986). Modifications of breathing for phonation. Handbook of ohysiology, The respiratory system. A. P. Fishman. Bethesda. III: 587-647.

Provine, R. R. (1996). "Laughter." American Scientist 84: 38-45.

Pulvermuller, F., M. Huss, et al. (2006). "Motor cortex maps articulatory features of speech sounds." Proc Natl Acad Sci U S A 103(20): 7865-70.

Richman, B. (1976). "Some vocal distinctive features used by gelada monkeys." Journal of the Acoustical Society of America 60: 718-724.

Savage-Rumbaugh, s., W. M. Fields, et al. (2004). "The emergence of knapping and vocal expression embedded in a Pan/Homo culture." Biology and philosophy 19: 541-575.

Savage-Rumbaugh, S. and R. Lewin (1994). Kanzi. London, Doubleday.

Schulz, G. M., M. Varga, et al. (2005). "Functional neuroanatomy of human vocalization: an H215O PET study." Cereb Cortex 15(12): 1835-47.

Simonyan, K., Z. S. Saad, et al. (2007). "Functional neuroanatomy of human voluntary cough and sniff production." Neuroimage 37: 401-409.

Smith, A. and H. N. Zelaznik (2004). "Development of functional synergies for speech motor coordination in childhood and adolescence." Developmental Psychobiology 45: 22-33.

Smith, B. L. (1994). Speech production, Atypical aspects. The encyclopedia of language and linguistics. R. E. Asher. Oxford, Pergamon Press: 4221-4231.

Spencer, K. A. and D. L. Slocomb (2007). "The neural basis of ataxic dysarthria." Cerebellum 6: 58-65.

Steinschneider, M., J. Arezzo, et al. (1982). "Speech evoked activity in the auditory radiations and cortex of the awake monkey." Brain Res 252(2): 353-65.

Suthers, R. A., F. Goller, et al. (1999). "The neuromuscular control of birdsong." Philosophical Transactions of the Royal Society of London. Series B: Biological Sciences 354: 927-939.

Taglialatela, J. P., S. Savage-Rambaugh, et al. (2003). "Vocal production by a language-competent Pan paniscus." International Journal of Primatology 24: 1-47.

Teitti, S., S. Maatta, et al. (2008). "Non-primary motor areas in the human frontal lobe are connected directly to hand muscles." Neuroimage 40(3): 1243-50.

Whalen, D. H. and J. M. Kinsella-Shaw (1997). "Exploring the relationship of inspiration duration to utterance duration." Phonetica 54: 138-152.

Willerslev, R. (2004). "Not animal, not not-animal: Hunting, imitation and empathetic knowledge among the Siberian Yukaghirs." Journal of the Royal Anthropological Institute 10: 629-652.

Yerkes, R. M. and B. W. Learned (1925). Chimpanzee intelligence and its vocal expressions. Baltimore, MD, Williams & Wilkins.