1. To collect and annotate the large Urdu speech corpus
2. To build an Automatic speech recognition (ASR) model that could recognize and analyze human speech signals and convert them into text
3. A deep neural network–based ASR model that can understand the human voice in the national language of Pakistan (Urdu).
4. To develop a web–based interface to interact with end user and process speech in Urdu language.
Urdu is the lingua franca of the world’s fifth–most populous country and it is the 21st most spoken first language in the world, with approximately 66 million speakers. Along with that, the medium of instruction in higher education institutions (HEI) in Pakistan is Urdu with 65% precedence. Ideally, when we are trying to implement a new idea or explore a dream, we try to adapt it from every perspective by thinking and
expressing it in our own language, but in the digital era, people face a hindrance in the shape of the language gap. The transitions in terms of advancement and development, which we are witnessing in the information age are possible due to vast adaptation from the world. But, without overcoming the language gap, bilinguals can never explore the opportunities, prospects, and concerns of the digital world in a true manner.
The growth in the demand of ASR is both unprecedented and promising. Over the past decades, Deep learning, Artificial Neural Network, and other methods for speech recognition has yielded better results and has solved many problems. Although, very small–scale work has been done for collecting and identifying Urdu data sets which ultimately reflects the infinitesimal progress in the specified domain. Despite the short volume of data sets, they are not even open source. This limited diversity in the form of data sets and tools holds the innovative ideas to flourish and does not pave the path for collaborative research. The speech signal that is recorded by the microphone is generally infected by noise originating from various sources. Such communication can change the characteristics of the speech signals and degrade speech quality intelligibility. Hence, to stream the flow of research and impact full projects through speech recognition in Urdu dialect, a vast data set, imperative pre–processing and comparative study is essential.
Speech recognition is the process of converting human sound signals into words or instructions. Currently, large bodies of researchers are focusing on developing ASR systems for high–resource languages such as English, Chinese, etc. While limited studies have been conducted to develop speech models for low–resource languages. In the last few years, speech recognition for the Urdu language has received noticeable attention with the emergence of Deep Neural networks. So, we aim to accumulate an Urdu speech corpus and develop a DNN–based Urdu ASR model that can recognize and analyze human speech signals and convert them into text. Training high–quality ASR systems remain challenging. The accuracy of the ASR model depends on the quality of the training data. Firstly, we will collect Urdu speech large corpus of data from diverse environments and a wide variety of speakers. Data includes text streams, audio clips, video clips, time–series data, etc. The transcription of the speech data is required for training the model which is the most crucial step. Secondly, the gathered data is in raw format initially. It is important to clean data before processing, which is referred to as preprocessing. Pre-processing includes lowering the noise and filtering the signal. The pre-processing is done to enhance the audio signal. Especially, background noise and the different accents make it
hard for the model to recognize. neural networks have significantly improved speech recognition with the advancements in technology. Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), and more recently Transformer networks have demonstrated excellent performance, among other techniques. Thirdly, we will feed our collected data to the different acoustic models and the language models to obtain good accuracy for the Urdu language. In the last, we will test the model on the testing data and analyze the performance through comparative analysis, accuracy, and Word Error Rate (WER). Moreover will design a web-based interface to interact with end-user, process speech in Urdu language and show results in text.