Finding emotions in human speech is a difficult task. It is even harder for sounds without words, like laughs, gasps, and sighs. Normal audio models fail at this task because these sounds are very short and the audio patterns are complex. To fix this problem, we created a new model called the Hybrid Semantic-Acoustic Transformer. Our system uses a Wav2Vec 2.0 model to get acoustic features. At the same time, it uses a Whisper ASR model to get phonetic features. We mix these two types of data together using a Cross-Attention layer. We tested our model on the EmoGator dataset. This dataset has 32,130 audio files across 30 different emotion classes. We split the data strictly into 80% for training, 10% for validation, and 10% for testing. Our new model achieved an overall accuracy of 74.8%. We also did an ablation study. This study proves that using cross-attention is much better than simply adding the features together. Our final result is a 6.4% increase in the F1-score compared to the original EmoGator baseline model. This sets a new high score for classifying non-speech sounds in different noisy environments. Our model also reached over 90% precision when telling the difference between a ‘Sigh’ and a ‘Gasp’. Standard speech models usually fail at this specific task.
- Címlap
- Publikációk
- A Hybrid Semantic-Acoustic Transformer for Vocal Burst Emotion Recognition Using Wav2Vec 2.0 and Whisper ASR