This is a summary and synthesis of existing literature on the neural mechanisms underlying music perception and production. Originally written for a course on cognitive psychology, this paper is part of a broader effort to ground my materialist musicology in neuroscientific and evolutionary psychological research.
All known human cultures have produced music and dance (McDermott & Hauser, 2005), a combined assemblage that includes the perception and interpretation of hierarchically organized temporal groupings and pitch relations, as well as the embodied perception of and reaction to an imagined isochronous pulse. In this essay, I defend the stance that the capacity for music production in humans can be sufficiently explained as an exaptation of naturally selected biological mechanisms. I suggest that the experience of music can be understood in terms of a number of distinct cognitive processes: (1) the learned association of sounds with meaningful concepts, (2) the hierarchical organization of discrete sonic units in temporal relation, (3) the hierarchical organization of discrete sonic units in frequency relation, and (4) the interpolation of a regular rhythmic pulse that is felt in the body. The first of these is a natural part of auditory processing, common to all species and not specific to music. The remaining three, I argue are byproducts of the faculty of language in the broad sense (FLB) in combination with auditory scene analysis (ASA). Specifically, process 2 relates to the rhythms of spoken language, process 3 relates to the pitch contours of spoken language alongside overtone-matching mechanisms used in ASA, and process 4 emerges from the connection between the regions of the brain used for auditory learning and those used in motor movement and interval timing, a connection which is necessary for speech production and which simultaneously gives rise to dance. After outlining the processes and mechanisms that underly musical experience, I briefly discuss three modes of musical communication: semantic interpretation, effective interpretation, and metric interpretation. In the final section, I offer some critiques and alternatives to this model.
Evolutionary Origins of Music
Darwin was the first to speculate about music’s evolutionary origins, hypothesizing that music, despite offering no survival benefits, serves as a seduction strategy (1871). Much like bird plumage, music can be explained by the handicap principle, which suggests that a trait may be attractive to mates precisely for its survival detriments (e.g. consuming energy, attracting predators) because it signals an excess of energy or fitness (Zahavi, 1975). The sexual selection hypothesis of music has received some empirical validation (Marin & Rathgeber, 2022), but fails to explain the conspicuous absence of fundamental systems for musical processing in our primate relatives, especially compared to experimental results (outlined below) demonstrating limited music processing in more distantly related animals.
More widely reproduced and accepted are the various theories that hold music to have emerged to strengthen social bonds between humans. Freeman (2001) argues that music originated as a collective experience that functions to transcend epistemological solipsism, i.e. natural tendency for minds to develop along divergent trajectories as they accumulate individualized experiences. The notion that music is closely related to human sociability is supported by studies of developmental disorders like Williams syndrome, where subjects display severe cognitive impairment in most domains but remain unaffected in their social and musical abilities, as well as Autism Spectrum Disorder, which selectively impairs both social and musical abilities (see Karp, 2012). Social bonding theories of musical evolution face some challenge from empirical research on the functions of music, which show that people listen to music primarily to regulate arousal and mood or to achieve self-awareness, and only secondarily for social purposes (Schäfer et al., 2013). This evidence is not damning, however, as it is plausible for an evolutionarily selected social function to have been overpowered in contemporary cultures by the affective and intellectual side effects.
Other accounts suggest that music is an evolutionary accident that emerged from other, more necessary functions, what Gould & Lewontin describe as an evolutionary spandrel (1979). This is the basis of Pinker’s (1997) famous description of music as “auditory cheesecake, an exquisite confection crafted to tickle the sensitive spots of at least six of our mental faculties” (534). That is, music hijacks existing reward systems but provides no reproductive value. Reviewing research on the innateness and domain specificity of various elements of the capacity for music production, Justus & Hustler (2005) indeed found no compelling evidence that any aspect of the faculty for music is domain specific, meaning “it is impossible to argue categorically that the domain in question underwent natural selection” (21). Further, the vocal learning and rhythmic synchronization hypothesis (VLRSH) suggests that even the perception and spontaneous bodily synchronization to a regular beat – which plays little to no part in speech production or auditory scene analysis – can be attributed to the interaction between the auditory and motor regions of the brain (Patel 2006; Patel et al. 2009).
A mechanism is an explanatory device, a theoretical tool used to understand a process at a particular scale of analysis. As Machamer, Darden, & Craver (2000) define the term, “[m]echanisms are entities and activities organized such that they are productive of regular changes from start or set-up to finish or termination conditions.” A mechanisms consists of constitutive entities and their activities, functional roles in a larger mechanism, and conditions for the beginning and end of a specific process. Each of the phenomena involved in music cognition – the association of sounds with meaningful concepts, the hierarchical organization of discrete sonic units in temporal and frequency relation, and the embodied perception of a regular rhythmic pulse – can be understood as the product of one or more mechanisms. I argue that each of these phenomena can be explained in reference to the mechanisms underlying other, naturally selected traits.
The association of sounds with meanings is a natural part of auditory processing, not specific to music. Certain associations and inferences found in music are derived from simple analogies (e.g. fast tempos evoke energy) or ordinary auditory phenomena and learned associations (e.g. the rhythm of a heartbeat, or the cultural association between the sound of an organ and religious worship). Hierarchical syntax (i.e. recursion) is the only aspect of music production and processing thought to be entirely unique to humans. Hauser, Chomsky, & Fitch (2002) claim that that, while the faculty of language in the broad sense (FLB) includes sensory-motor and conceptual-intentional traits common to nonhuman animals, there is a human-specific faculty of language in the narrow sense (FLN) constituted by a computational system capable of recursion. The temporal organization of notes into phrases is related to the FLN, and parallels the rhythmic groupings of sounds in spoken language. This is evidenced in part by empirical research on the relationship between a culture’s musical and linguistic rhythms (Patel & Daniele, 2003). Likewise, the general melodic contour within a phrase can be explained by the pitch relations of affective speech (Justus & Hustler, 2005). While language and music deal largely with different representational systems and thus have different types of semantic meaning, syntactic interpretation of music and language has also been shown to draw on the same pool of cognitive resources, supporting the shared syntactic integration resource hypothesis (SSIRH) (Slvec, Rosenberg, & Patel, 2009). Together, these findings suggest that the linguistic property of recursion provides the necessary substrate for the hierarchical organization component of the first two processes.
While the FLN can explain the hierarchical organization of rhythmic groupings and pitch relations, it does not explain the culturally universal associations of simple harmonic ratios as consonant and more complex ratios as more dissonant. Furthermore, language, unlike music, is not interpreted in relation to a steady pulse (Patel, 2003). The perception of simple frequency ratios as consonant and and complex frequency ratios as dissonance can be attributed to the mechanisms underlying ASA, which group sounds together based on the overlap in their harmonic makeup. In regards to the limitations placed by every culture in the form of scales of discrete notes, Miller suggests that the fundamental constraints of short term memory explain why humans can only manage the relationship between “7, plus or minus 2” pitches at a time (1956). Tension and release are highly contextual (both musicologically and culturally), but comprise the basic lexicon of musical grammar (Lerdahl, 2001).
The vocal learning and rhythmic synchronization hypothesis (VLRSH) suggests that the tendency to interpret and physically move to a steady rhythmic pulse (beat perception and synchronization, or BPS) emerges from a robust connection between the circuits for auditory learning and motor movement (Patel, 2006). This connection is necessary in species that demonstrate vocal learning, such as humans as well as several bird species, including songbirds and parrots. Empirical evidence shows that BPS is absent from humans’ primate relatives, but common to species who evolved vocal learning in convergence, indicating that BPS is a byproduct of the connections between the auditory system and motor systems related to interval timing (Patel et al., 2009). Experiments demonstrating BPS in sea lions – which are not known to be vocal learners – offer a challenge to the VLRSH (Rouse, Cook, Large, & Reichmuth, 2016), but the lack of data on the full extent of vocal learning in sea lions and the evidence for vocal learning in other pinnipeds renders this an ongoing debate (Patel, 2021).
Music and Communication
Based on the research outlined above, I propose three types of parallel musical information processing: semantic interpretation, effective interpretation, and metric interpretation. In semantic interpretation, inferences about a state of being are drawn from a set of musical sounds. This includes the learned association of sounds with meanings, encompassing a variety of sonic metaphors (up/down, fast/slow), representations of real concepts (instruments, binaural positioning), and culturally or evolutionarily trained evocations (threat responses triggered by low frequency information [Leventhal, Pelmear, & Benton, 2003]). In the case of music with accompanying lyrics, the sung lyrics carry semantic information in the form of language as well as the character of the voice, though some elements of vocal timbre are effective, rather than semantic.
Effective interpretation does not relate to semantic meaning, but provokes an affective response. This type of interpretation is analogous (and, based on the research outlined above, biologically homologous) to birdsong, as well as other displays of seduction and threat projection found among nonhuman animals. To clarify the distinction between effective and semantic communication, I invoke Adams & Beighley’s (2013) distinction between meaningful human language and animal communication: “genuine meaning (linguistic meaning) has semantic requirements not met by animal signalling. Meaningful signals must be stimulus-free, must establish meaning and then must permit false tokening” (416). That is to say, with semantic communication it is possible to abstract from the present situation and make false assertions about the state of affairs (in music, consider the semantic operation of tone painting – when a song’s music metaphorically corresponds to the accompanying lyric – as well as its ironic disjunctive opposite). This is not possible in effective communication, because the information is intentional rather than meaningful; a minor key may evoke sadness, but it makes no falsifiable assertion as to who or what is sad. Meanwhile, the instruments heard in a piece of music or the spatial positioning of objects in a stereo mix cognitively correspond to actual states of affairs – the presence/absence of instruments and their position relative the listener, respectively.
The third type of musical information processing is metric interpretation, which is equivalent to BPS. Per the SSIRH, the same mechanisms for syntactic computation are used in both effective and semantic interpretation. Because BPS lacks syntactic organization, this model predicts that groove interpretation will not pull on the same cognitive resources to process. While effective vocal communication is common among nonhuman animals, and while semantic vocal communication is thought to be unique to humans, the research previously outlined suggests that BPS is unique to species capable of vocal learning.
Problems and Critiques
While the evolution of music can be adequately explained in terms of the evolution of vocal learning, it remains unclear which of these two faculties came first. Brown (2017) hypothesizes a shared evolutionary ancestor, a "musilanguage" which contained characteristics shared by both music and speech: lexical tone, combinatorial formation of small phrases, and expressive phrasing principles. Lexical tone describes the importance of pitch in the construction of semantic meaning. Its connection to spoken language can be seen overtly in tonal languages like Mandarin, but nontonal languages also involve simple pitch information in semantic interpretation. Combinatorial formation has already been discussed, and involves the capacity for recursion that characterizes FLN. Expressive phrasing describes the semantic significance of emphasis and timing (e.g. the sentence “I only gave her flowers” changes meaning when different words are given emphasis). Mithen (2006) elaborates that music and language evolved to specialize in different areas of communication: language to express representational meaning and music to expresses emotional meaning. However, the musilanguage hypothesis does not seem to account for BPS, nor the connection between music and dance that is missing from spoken language. Another alternative perspective is offered by Dunbar (2004), who suggests that music preceded speech evolutionarily. This hypothesis holds that, as human social groups became larger and larger, music and dance succeeded grooming as the predominant mode of social bonding. The mechanisms outlined above, then, originated in the social use of music, and later served to aid the development of representational language.
While these alternatives offer compelling narratives, there remains no conclusive evidence as to whether or not musical production originated as an evolutionary adaptation for seduction or survival. Given the central role that music has played in human culture since its earliest records, the principles of natural selection have almost certainly shaped the development of music by selecting for those genes best able to express their culture’s music within accepted ritual frameworks. However, because all the faculties required for music can be explained in reference to other faculties with more obvious evolutionary benefits, the simplest and most effective explanation for the origins of music and dance are in the cooption of conceptual and sensory-motor mechanisms first developed for language production, interval timing, and auditory scene analysis.
Adams, F., & Beighley, S. M. (2013). Information, meaning and animal communication. In U. Stegmann (Ed.), Animal communication theory: Information and influence (pp. 399–420). Cambridge University Press.
Brown, S. (2017). A joint prosodic origin of language and music. Frontiers in Psychology, 8. https://doi.org/10.3389/fpsyg.2017.01894
Darwin, C. (1871). The descent of man. Murray.
Dunbar, R. I. M. (2004). Language, Music, and Laughter in Evolutionary Perspective. In D. K. Oller & U. Griebel (Eds.), Evolution of communication systems: A comparative approach (pp. 257–274). MIT Press.
Freeman, W. (2001). A neurobiological role of music in social bonding. In N. L. Wallin & B. Merker (Eds.), The origins of Music (pp. 411–424). MIT Press.
Gould, S. J., & Lewontin, R. C. (1979). The spandrels of San Marco and the Panglossian Paradigm: A critique of the adaptationist programme. Proceedings of the Royal Society of London. Series B. Biological Sciences, 205(1161), 581–598. https://doi.org/10.1098/rspb.1979.0086
Hauser, M. D., Chomsky, N., & Fitch, W. T. (2002). The faculty of language: What is it, who has it, and how did it evolve? Science, 298(5598), 1569–1579. https://doi.org/10.1126/science.298.5598.1569
Justus, T., & Hustler, J. J. (2005). Fundamental issues in the evolutionary psychology of music: Assessing innateness and domain specificity. Music Perception, 23(1), 1–27. https://doi.org/10.1525/mp.2005.23.1.1
Karp, M. (2012). Dubstep, Darwin, and the Prehistoric Invention of Music (thesis).
Lerdahl, F. (2001). Tonal pitch space. Oxford University Press.
Leventhall, G., Pelmear, P., & Benton, S. (2003). (rep.). A review of published research on low frequency noise and its effects. London: Department for Environment, Food and Rural Affairs.
Machamer, P., Darden, L., & Craver, C. F. (2000). Thinking about mechanisms. Philosophy of Science, 67(1), 1–25. https://doi.org/10.1086/392759
Marin, M. M., & Rathgeber, I. (2022). Darwin’s sexual selection hypothesis revisited: Musicality increases sexual attraction in both sexes. Frontiers in Psychology, 13. https://doi.org/10.3389/fpsyg.2022.971988
McDermott, J., & Hauser, M. (2005). The origins of music: Innateness, uniqueness, and evolution. Music Perception, 23(1), 29–59.
Miller, G. A. (1956). The magical number seven, plus or minus two: Some limits on our capacity for processing information. Psychological Review, 63(2), 81–97. https://doi.org/10.1037/h0043158
Mithen, S. (2007). The singing neanderthals: The origins of music, language, mind and body. Harvard University Press.
Patel, A. D. (2003). Rhythm in language and music: Parallels and differences. Annals of the New York Academy of Sciences, 999, 140–143. https://doi.org/10.1196/annals.1284.015
Patel, A. D. (2006). Musical Rhythm, linguistic rhythm, and human evolution. Music Perception, 24(1), 99–104. https://doi.org/10.1525/mp.2006.24.1.99
Patel, A. D. (2021). Vocal learning as a preadaptation for the evolution of human beat perception and synchronization. Philosophical Transactions of the Royal Society B: Biological Sciences, 376(1835). https://doi.org/10.1098/rstb.2020.0326
Patel, A. D., & Daniele, J. R. (2003). An empirical comparison of rhythm in language and Music. Cognition, 87(1), B35–B45. https://doi.org/10.1016/s0010-0277(02)00187-7
Patel, A. D., Iversen, J. R., Bregman, M. R., & Schulz, I. (2009). Experimental evidence for synchronization to a musical beat in a nonhuman animal. Current Biology, 19(10), 827–830. https://doi.org/10.1016/j.cub.2009.05.023
Pinker, S. (1997). How the mind works. Norton.
Rouse, A. A., Cook, P. F., Large, E. W., & Reichmuth, C. (2016). Beat keeping in a sea lion as coupled oscillation: Implications for comparative understanding of human rhythm. Frontiers in Neuroscience, 10. https://doi.org/10.3389/fnins.2016.00257
Schäfer, T., Sedlmeier, P., Städtler, C., & Huron, D. (2013). The psychological functions of music listening. Frontiers in Psychology, 4. https://doi.org/10.3389/fpsyg.2013.00511
Slevc, L. R., Rosenberg, J. C., & Patel, A. D. (2009). Making psycholinguistics musical: Self-paced reading time evidence for shared processing of linguistic and musical syntax. Psychonomic Bulletin & Review, 16(2), 374–381. https://doi.org/10.3758/16.2.374
Zahavi, A. (1975). Mate selection—a selection for a handicap. Journal of Theoretical Biology, 53(1), 205–214. https://doi.org/10.1016/0022-5193(75)90111-3