CNN-Transformer Based Speech Emotion Detection
DOI:
https://doi.org/10.3126/jacem.v10i1.76324Keywords:
Speech Emotion Recognition, Convolution Neural Network, Mel Frequency Cepstral Coefficient, Additive White Gaussian NoiseAbstract
In this study, a parallel network technique trained on the Ryerson Audio-Visual Dataset of Speech and Song (RAVDESS) was used to perform an autonomous speech emotion recognition (SER) challenge to categorize four distinct emotions. To capture both spatial and temporal data, the architecture comprised attention-based networks with CNN-based networks that ran in tandem. Additive White Gaussian Noise (AWGN) was used as augmentation techniques for multiple folds to improve the model’s generalization. The model’s input was MFCC, which was created from the raw audio data. The MFCC were represented as images, with the height and breadth corresponding to the time and frequency dimensions of the MFCC, in order to take use of the proven effectiveness of CNNs in image classification. Transformer Encoder layer, an attention-based model, was used to capture temporal characteristics. The projects’ findings demonstrated that the Parallel CNN-Transformer network’s accuracy as 88.16% for 1-fold augmentation, 92.11% for 2-fold augmentation and 86.84% of accuracy for 3-fold augmentation.
Downloads
Downloads
Published
How to Cite
Issue
Section
License
JACEM reserves the copyright for the published papers. Author will have right to use content of the published paper in part or in full for their own work.