Advancements in Nepali Speech Recognition: A Comparative Study of BiLSTM, Transformer, and Hybrid Models
DOI:
https://doi.org/10.3126/injet.v2i1.72525Keywords:
Automatic Speech Recognition, Convolutional Neural Networks, Connectionist Temporal Classification, Mel-frequency cepstral coefficients, Residual Networks, Bidirectional Long Short-Term MemoryAbstract
In today's world, leveraging Automatic Speech Recognition (ASR) technology to process and understand spoken language is highly desirable. Our proposed Nepali Speech Recognition employs an advanced generation to recognize and interpret spoken Nepali language. It approaches Nepali speech, allowing it to reply to user queries effectively. To attain this, we rent a mixture of superior neural network fashions. We extract Mel-frequency cepstral coefficients (MFCCs) from the preprocessed audio information; these MFCCs capture crucial spectral characteristics of Nepali speech and serve as essential input features for our neural network model. To design a top-rated version for textual content-based query processing, we make use of convolutional neural networks (CNN), residual networks (ResNet), and bidirectional long short-term memory (BiLSTM) layers. The CNN layers excel at extracting neighborhood patterns and spatial features from the MFCC input; the ResNet layers capture deeper representations to enhance performance. The BiLSTM layers are also employed to model temporal dependencies in the textual content-based query processing, we make use of convolutional neural networks (CNN), residual networks (ResNet), and bidirectional long short-term memory (BiLSTM) layers. The CNN layers excel at extracting neighborhood patterns and spatial features from the MFCC input; the ResNet layers capture deeper representations to enhance performance. The BiLSTM layers are also employed to model temporal dependencies in the textual content records. We hired the Connectionist Temporal classification (CTC) loss feature to enable sequence-to-series mapping, aligning the input speech with corresponding text outputs. This approach permits our gadget to successfully process textual content queries and provide correct responses, enhancing the user's usefulness. The model, after being trained with 1.55 million parameters in about 1 lakh 57 thousand audio datasets for 47 epochs, achieved a CTC of 17.98% (82.02%-character accuracy rate) with this model.
Downloads
Downloads
Published
How to Cite
Issue
Section
License
Copyright (c) 2024 International Journal on Engineering Technology
This work is licensed under a Creative Commons Attribution 4.0 International License.
This license enables reusers to distribute, remix, adapt, and build upon the material in any medium or format, so long as attribution is given to the creator. The license allows for commercial use.