##plugins.themes.bootstrap3.article.main##

Zhanpeng Li Yuming Qi Sanpeng Deng Xiumin Shi

Abstract

Multimodal emotion recognition plays a crucial role in human–computer interaction and related fields. However, deep learning models often face severe overfitting due to insufficient feature interaction, data imbalance, and high model complexity. To address these challenges, this paper systematically explores the evolution from simple feature concatenation to advanced attention-based fusion, and proposes a Multi-Path Attention Fusion Network (MPAF-Net), which processes original unimodal features and cross-modal attention–generated interactive features in parallel to achieve more comprehensive information representation. Furthermore, we integrate a set of comprehensive optimization strategies, including multi-level regularization, class-weighted Focal Loss, and learning rate warmup with cosine annealing. Experimental results on the IEMOCAP dataset show that, compared with the baseline, MPAF-Net achieves a macro F1-score of 50.2% on the validation set and 47.42% on the test set, significantly outperforming the baseline. This study not only validates the effectiveness of the proposed method but also demonstrates the great potential of multi-path fusion architectures in enhancing model performance and robustness for complex multimodal tasks.

Downloads

Keine Nutzungsdaten vorhanden.

##plugins.themes.bootstrap3.article.details##

Rubrik
Articles