
As input to a modality-specific network, we consider a time-dependent signal deriving from the embeddings of the video and audio modalities. The framework employs two networks, with each one being dedicated to one modality. We propose a novel framework that consists of bi-modal time windows spanning short video clips labeled with discrete emotions. Attention mechanisms are used to exploit the importance of each modality over time. Thereby, the study aims to exploit temporal information of audio-visual cues and detect their informative time segments.


In order to enhance emotion communication in human-computer interaction, this paper studies emotion recognition from audio and visual signals in video clips, utilizing facial expressions and vocal utterances. Emotions play a crucial role in human-human communications with complex socio-psychological nature.
