Theories of Audiovisual Perception
Many of the phenomena of audiovisual perception demonstrate how our perceptual system actively constructs our experience of reality by integrating information across different sensory modalities, rather than passively receiving information from each sense independently.
The effects and phenomena described in the following section have implications for user interface design, audiovisual artforms virtual and augmented reality systems, film and media production, communication systems, as well as clinical applications.
Multisensory integration
Audio-Visual Speech Enhancement
Audio-Visual Speech Enhancement (AVSE) represents a phenomenon in multi-modal speech perception where visual information from a speaker's face, particularly their articulatory movements, significantly improves speech comprehension and recognition, especially in adverse acoustic conditions. This effect demonstrates the integrative nature of speech perception and the brain's ability to combine complementary information from multiple sensory modalities.
Visual speech cues typically precede acoustic signals providing predictive information about upcoming acoustic events. A mental "forward model" is created that enhances speech processing.
Articles
Sumby, W. H., & Pollack, I. (1954). "Visual contribution to speech intelligibility in noise." Journal of the Acoustical Society of America, 26(2), 212-215. First quantified the 15-20 dB equivalent improvement claim.
Van Wassenhove, V., Grant, K. W., & Poeppel, D. (2007). "Temporal window of integration in auditory-visual speech perception." Neuropsychologia, 45(3), 598-607. Critical work on temporal aspects of integration.
McGurk-Effect
Empirical studies of the McGurk effect, among others, show influences between modalities.
"It stems from an observation that, on being shown a film of a young woman's talking head, in which repeated utterances of the syllable [ba] had been dubbed on to lip movements for [ga], normal adults reported hearing [da]. With the reverse dubbing process, a majority reported hearing [bagba] or [gaba]. When these subjects listened to the soundtrack from the film, without visual input, or when they watched untreated film, they reported the syllables accurately as repetitions of [ba] or [ga]. Hearing lips and seeing voices
Ventriloquism-Effect
Despite the spatial separation of sound source (loudspeaker) and screen, recipients readily connect auditory and visual stimuli if they appear to "belong together". In this context, "...the localization of sounds is influenced to a large extent by the visual arrangement of the sound source" (Bullerjahn, 2008, p. 210). Empirical studies on the ventriloquism effect demonstrate the tendency of the perceptual apparatus to group temporally synchronous events together, even though they do not have the same origin or thus have congruent properties only in perception, but not in the physical world (cf. Marks, 2004, p. 89).
In the sense of a holistic perception, this integration of stimuli represents the "normal state". For it was only the storage of sound that enabled the reception of music detached from its original visual component - it had always been holistic and audiovisual in the concert hall, on stage, among other things, due to the movements of the musicians (cf. RĂśsing, 2003, p. 10). Hearing and seeing intertwine to form a holistic perception (cf. Krause, 2001, p. 106).
Temporal Ventriloquism is a perceptual phenomenon that demonstrates the influence of auditory timing on visual temporal perception, representing a temporal analog to the classical spatial ventriloquism effect.
Temporal ventriloquism manifests through auditory signals' capacity to modulate the perceived timing of visual events during multisensory processing. The effect operates via temporal displacement, wherein the perceived temporal position of visual stimuli shifts toward concurrent auditory events, demonstrating the neural mechanisms underlying cross-modal temporal integration. Experimental evidence indicates that the phenomenon functions optimally within a temporal integration window of approximately 100-200 milliseconds between auditory and visual stimuli presentation. The temporal binding demonstrates systematic asymmetry, with auditory signals exerting predominant influence over visual temporal perception, consistent with the differential temporal resolution capabilities of auditory versus visual sensory systems.
Empirical investigations reveal substantial inter-individual variability in temporal integration window parameters, indicating heterogeneous susceptibility to the effect across experimental populations. The magnitude of temporal ventriloquism exhibits inverse correlation with audiovisual temporal disparity, diminishing progressively as separation exceeds the optimal integration window. This temporal constraint aligns with established principles of multisensory integration, wherein temporal contiguity functions as a critical determinant of perceptual binding across sensory modalities.
Articles
Morein-Zamir, S., Soto-Faraco, S., & Kingstone, A. (2003). "Auditory capture of vision: Examining temporal ventriloquism." Cognitive Brain Research, 17(1), 154-163. Foundational work establishing the basic effect.
Bertelson, P., & Aschersleben, G. (2003). "Temporal ventriloquism: Crossmodal interaction on the time dimension." International Journal of Psychophysiology, 50(1-2), 147-155. Key study on auditory dominance in temporal perception.
Chen, L., & Vroomen, J. (2013). "Intersensory binding across space and time: A tutorial review." Attention, Perception, & Psychophysics, 75(5), 790-811. Comprehensive review of temporal binding mechanisms.
Double Flash Illusion
The Double Flash Illusion (or Sound-Induced Flash Illusion) shows auditory influence on visual perception wherein a single visual flash, when accompanied by multiple auditory beeps, is erroneously perceived as multiple flashes. The effect demonstrates the dominance of auditory temporal resolution over visual perception within specific temporal integration windows.
Basic Mechanism: When a single visual stimulus is presented concurrently with multiple auditory stimuli within a specific temporal window (~100ms between beeps), the brain integrates these signals in a way that generates the perception of multiple visual events, despite the presence of only one physical flash. This integration occurs due to the higher temporal resolution of the auditory system compared to the visual system, leading to a perceptual reorganization where the number of perceived visual events matches the number of auditory events.
Articles
Shams, L., Kamitani, Y., & Shimojo, S. (2000). "Illusions: What you see is what you hear." Nature, 408(6814), 788.
Hirst, R. J., et al. (2020). "What you see is what you hear: Twenty years of research using the Sound-Induced Flash Illusion." Neuroscience & Biobehavioral Reviews, 118, 759-774.https://pdf.sciencedirectassets.com/271127/1-s2.0-S0149763420X00108/1-s2.0-S0149763420305637/
Intermodal analogies
Intermodal analogies are not fixed in the individual and can vary subjectively and depending on the situation. It is assumed that all people, insofar as there are no mental or physical impairments, are capable of analogy formation and do so intuitively within the framework of actively constructing perception (cf. Haverkamp, 2003, p. 184). Intermodal analogies differ fundamentally from synesthesia, which are spoken of when a stimulus inevitably triggers a sensation in another sensory modality (cf. RĂśsing, 2003, p. 11).
Human perception is thus able to perceive objectively different perceptible auditory and visual stimuli as a subjective unit. There are neuronal correspondences in the processing of information of all sensory channels in the brain (Marks, 1978, p. 6). Temporally or spatially slightly divergent stimuli are sometimes integrated into a single piece of information due to intermodal analogies in the process of cross-modal pairing (cf. Schlemmer, 2005, p. 176). Analogously, temporal synchronicity is also a criterion for the combination of auditory and visual stimuli in film (cf. Lipscomb, 2005). The combination of the two time-shaping arts of film and music thus offered strong starting points for the aesthetic treatment of coinciding or contrapuntal auditory and visual events from the very beginning.

In addition to these basic temporal connections, intermodal analogies between auditory and visual perception seem to ensure that some stimulus combinations are perceived as particularly plausible. Basic research in perception psychology has provided empirical results in recent years. It was based on the theory of intersensory properties (cf. Werner, 1966), which states that both a sound and a visual perceptual stimulus can be described by common qualities such as (cf. Haverkamp, 2002, p. 125):
intensity (loudness)
brightness (spectral centroid)
volume (sonority)
density
roughness
Recipients of audiovisual events can therefore perceive a sound and a (moving) visual object as similar with respect to their value on, for example, the dimension of density, if both have the corresponding properties. The interpersonal variance of analogies is relatively small and some analogies are perceived as particularly close by most individuals (cf. Haverkamp, 2003, p. 184). In particular, the empirical results of recent perceptual psychological research demonstrate that intermodal congruence of stimuli promotes the integration of auditory and visual stimuli into a common perceptual object (cf. Parise/Spence, 2009, p. 6).

Taking into account the results of empirical perceptual psychological experiments of the last decades, intermodal analogies can be systematically compiled for the audiovisual domain. As the above table shows, the connections are sometimes ambiguous - several auditory parameters can show congruences with several visual parameters (cf. Kim/Iwamiya, 2008, p. 430).
The taketeâmaluma phenomenon (or bouba/kiki effect)
In audiovisual perception, Maluma and Takete refer to an important phenomenon first described by Wolfgang KĂśhler in 1929, which demonstrates a natural tendency for humans to associate certain sounds with specific visual shapes.
The experiment involves showing people two abstract shapes:
One is rounded and blob-like with curved edges
One is angular and spiky with sharp points
When people are asked which shape is "Maluma" and which is "Takete" (nonsense words created for the experiment), the vast majority of people across different cultures and languages consistently:
Associate "Maluma" with the rounded, curved shape
Associate "Takete" with the angular, spiky shape
This demonstrates what is also called the "bouba/kiki effect" (a later variation of the same concept) - a non-arbitrary mapping between speech sounds and visual shapes. The soft sounds in "maluma" feel congruent with curved shapes, while the sharp sounds in "takete" match angular shapes.
This finding has been influential in:
Understanding cross-modal perception (how different senses interact)
Research on sound symbolism (how sounds carry inherent meaning)
Cognitive psychology and neuroscience
It's considered one of the earliest and most famous demonstrations that the connection between words and what they represent isn't entirely arbitrary.
Articles
KĂśhler, Wolfgang (1929). Gestalt Psychology. New York: Liveright.
Äwiek, Aleksandra; Fuchs, Susanne; Draxler, Christoph; Asu, Eva Liina; Dediu, Dan; Hiovain, Katri; Kawahara, Shigeto; Koutalidis, Sofia; Krifka, Manfred; Lippus, Pärtel; Lupyan, Gary; Oh, Grace E.; Paul, Jing; Petrone, Caterina; Ridouane, Rachid; Reiter, Sabine; SchĂźmchen, Nathalie; Szalontai, ĂdĂĄm; Ănal-Logacev, Ăzlem; Zeller, Jochen; Perlman, Marcus; Winter, Bodo (2022). "The bouba/Kiki effect is robust across cultures and writing systems". Philosophical Transactions of the Royal Society B: Biological Sciences. 377 (1841). doi:10.1098/rstb.2020.0390
The SMARC effect
The SMARC (Spatial-Musical Association of Response Codes) effect refers to a cognitive phenomenon in which people associate pitches with spatial dimensions, particularly the vertical dimension. In this effect, higher pitches tend to be associated with âupwardâ spatial locations, while lower pitches are associated with âdownwardâ spatial locations. The SMARC effect suggests that pitch perception is not purely auditory, but is linked to spatial processing in the brain.
The SMARC effect is often studied in contexts where people are asked to make spatial decisions (e.g. moving their hands up or down) in response to sounds of different pitch. People tend to respond faster and more accurately when the pitch and spatial direction are congruent - higher pitch with upward movements and lower pitch with downward movements - suggesting a strong cognitive link between these two dimensions.
This effect has been studied in areas such as music cognition, human-computer interaction and auditory display design, and helps researchers understand how pitch and spatial orientation influence perception and action.
We asked a group of musically naive participants to perform a pitch comparison task, a different group of musically naive participants and a group of musicians to perform a musical instrument identification task on sounds having different pitch. A SMARC effect (i.e. high-frequency pitches favouring up responses and low-frequencies pitches favouring down responses) was present both when pitch was task relevant, and when it was task irrelevant. Moreover, when pitch height was task irrelevant, a horizontal mapping of pitches appeared for musicians only.
In conclusion, we found that a representational dimension (pitch height) influences performance with vertically aligned responses irrespective of its relevance to the task. This suggests that our cognitive system maps pitch onto a mental representation of space. Our results are thus consistent with studies pointing to the integral nature of spatial and spectral processing of auditory stimuli.
Articles
Rusconi, E., Kwan, B., Giordano, B. L., UmiltĂ , C., and Butterworth, B. (2006). Spatial representation of pitch height: the SMARC effect. Cognition 99, 113â129. doi: 10.1016/j.cognition.2005.01.004
Last updated