Advancements in Nepali Speech Recognition: A Comparative Study of BiLSTM, Transformer, and Hybrid Models

Ankit Kafle; Jenith Rajlawat; Nawaraj Shah; Neetish Paudel; Bishal Thapa

doi:10.3126/injet.v2i1.72525

Authors

Ankit Kafle Department of Computer and Electronics Engineering, Kantipur Engineering College, Dhapakhel, Lalitpur, Nepal
Jenith Rajlawat Department of Computer and Electronics Engineering, Kantipur Engineering College, Dhapakhel, Lalitpur, Nepal
Nawaraj Shah Department of Computer and Electronics Engineering, Kantipur Engineering College, Dhapakhel, Lalitpur, Nepal
Neetish Paudel Department of Computer and Electronics Engineering, Kantipur Engineering College, Dhapakhel, Lalitpur, Nepal
Bishal Thapa Department of Computer and Electronics Engineering, Kantipur Engineering College, Dhapakhel, Lalitpur, Nepal

DOI:

https://doi.org/10.3126/injet.v2i1.72525

Keywords:

Automatic Speech Recognition, Convolutional Neural Networks, Connectionist Temporal Classification, Mel-frequency cepstral coefficients, Residual Networks, Bidirectional Long Short-Term Memory

Abstract

In today's world, leveraging Automatic Speech Recognition (ASR) technology to process and understand spoken language is highly desirable. Our proposed Nepali Speech Recognition employs an advanced generation to recognize and interpret spoken Nepali language. It approaches Nepali speech, allowing it to reply to user queries effectively. To attain this, we rent a mixture of superior neural network fashions. We extract Mel-frequency cepstral coefficients (MFCCs) from the preprocessed audio information; these MFCCs capture crucial spectral characteristics of Nepali speech and serve as essential input features for our neural network model. To design a top-rated version for textual content-based query processing, we make use of convolutional neural networks (CNN), residual networks (ResNet), and bidirectional long short-term memory (BiLSTM) layers. The CNN layers excel at extracting neighborhood patterns and spatial features from the MFCC input; the ResNet layers capture deeper representations to enhance performance. The BiLSTM layers are also employed to model temporal dependencies in the textual content-based query processing, we make use of convolutional neural networks (CNN), residual networks (ResNet), and bidirectional long short-term memory (BiLSTM) layers. The CNN layers excel at extracting neighborhood patterns and spatial features from the MFCC input; the ResNet layers capture deeper representations to enhance performance. The BiLSTM layers are also employed to model temporal dependencies in the textual content records. We hired the Connectionist Temporal classification (CTC) loss feature to enable sequence-to-series mapping, aligning the input speech with corresponding text outputs. This approach permits our gadget to successfully process textual content queries and provide correct responses, enhancing the user's usefulness. The model, after being trained with 1.55 million parameters in about 1 lakh 57 thousand audio datasets for 47 epochs, achieved a CTC of 17.98% (82.02%-character accuracy rate) with this model.

Downloads

Download data is not yet available.

Abstract

100

PDF

150

Advancements in Nepali Speech Recognition: A Comparative Study of BiLSTM, Transformer, and Hybrid Models

Authors

DOI:

Keywords:

Abstract

Downloads

Downloads

Published

How to Cite

Issue

Section

License

Current Issue

Information