CNN-Transformer Based Speech Emotion Detection

Authors

  • Rojina Baral Department of Electronics and Computer Engineering Pulchowk Engineering Campus, IOE
  • Sanjivan Satyal Department of Electronics and Computer Engineering Pulchowk Engineering Campus, IOE
  • Anisha Pokhrel Department of Electronics and Computer Engineering Western Region Campus, IOE

DOI:

https://doi.org/10.3126/jacem.v10i1.76324

Keywords:

Speech Emotion Recognition, Convolution Neural Network, Mel Frequency Cepstral Coefficient, Additive White Gaussian Noise

Abstract

In this study, a parallel network technique trained on the Ryerson Audio-Visual Dataset of Speech and Song (RAVDESS) was used to perform an autonomous speech emotion recognition (SER) challenge to categorize four distinct emotions. To capture both spatial and temporal data, the architecture comprised attention-based networks with CNN-based networks that ran in tandem. Additive White Gaussian Noise (AWGN) was used as augmentation techniques for multiple folds to improve the model’s generalization.  The model’s input was MFCC, which was created from the raw audio data. The MFCC were represented as images, with the height and breadth corresponding to the time and frequency dimensions of the MFCC, in order to take use of the proven effectiveness of CNNs in image classification. Transformer Encoder layer, an attention-based model, was used to capture temporal characteristics. The projects’ findings demonstrated that the Parallel CNN-Transformer network’s accuracy as 88.16% for 1-fold augmentation, 92.11% for 2-fold augmentation and 86.84% of accuracy for 3-fold augmentation.

Downloads

Download data is not yet available.
Abstract
68
pdf
50

Downloads

Published

2025-03-11

How to Cite

Baral, R., Satyal, S., & Pokhrel, A. (2025). CNN-Transformer Based Speech Emotion Detection. Journal of Advanced College of Engineering and Management, 10(1), 135–145. https://doi.org/10.3126/jacem.v10i1.76324

Issue

Section

Articles