A Enhanced Speech Command Recognition using Convolutional Neural Networks
DOI:
https://doi.org/10.31272/jeasd.28.6.8Keywords:
Convolutional Neural Network, Human voice, Machine voice, Mel-frequency cepstral coefficientAbstract
In recent years, the growing interest in automatic speech recognition (ASR) has been driven by its wide-ranging applications across various domains. Integrating speech recognition technologies into smart systems highlights the pivotal role of human-machine interaction. This study introduces a robust ASR system that leverages convolutional neural networks (CNNs) in conjunction with Mel-frequency cepstral coefficients (MFCCs). The model's architecture was improved by extensively examining hyperparameters, effectively recognizing ten different spoken commands. The model conducted training and evaluation using the Google Speech dataset, comprising 65,000 audio clips collected from a wide range of speakers across the globe. This dataset accurately represents the natural variations in speech found in real-world scenarios. The design comprises eight storage layers, encompassing convolutional and fully connected layers. It consists of a total of 183,345 weights and utilizes ReLU activation. It is worth mentioning that the average F1-score obtained during the training, validation, and testing stages is 99.06 %, 94.68%, and 95.27%, respectively. Furthermore, the proposed model exhibits about 1.3% improvement in experimental test accuracy over existing methods, confirming its effectiveness in real-world applications.
References
M. A. Grasso, “The long-term adoption of speech recognition in medical applications,” 16th IEEE Symposium Computer-Based Medical Systems, 2003. Proceedings., doi: https://doi.org/10.1109/cbms.2003.1212798.
A. Cocciolo, “Using speech recognition technology in the classroom,” Proceedings of the 9th international conference on Computer supported collaborative learning - CSCL’09, 2009, doi: https://doi.org/10.3115/1599503.1599538.
F. S. Hassen, “Performance of Discrete Wavelet Transform (DWT) Based Speech Denoising in Impulsive and Gaussian Noise,” Journal of Engineering and Sustainable Development, vol. 10, no. 2, pp. 175–193, Jun. 2006
S. A. E. Mohamed, A. S. Hassanin, and M. T. B. Othman, “Educational System for the Holy Quran and Its Sciences for Blind and Handicapped People Based on Google Speech API,” Journal of Software Engineering and Applications, vol. 07, no. 03, pp. 150–161, 2014, doi: https://doi.org/10.4236/jsea.2014.73017.
H. Sak, F. Beaufays, K. Nakajima, and C. Allauzen, “Language model verbalization for automatic speech recognition,” 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, May 2013, doi: https://doi.org/10.1109/icassp.2013.6639276.
V. Kepuska and G. Bohouta, "Next-generation of virtual personal assistants (Microsoft Cortana, Apple Siri, Amazon Alexa, and Google Home)," 2018 IEEE 8th Annual Computing and Communication Workshop and Conference (CCWC), Jan. 2018, doi: https://doi.org/10.1109/ccwc.2018.8301638.
S. Husnjak, D. Perakovic, and I. Jovovic, “Possibilities of Using Speech Recognition Systems of Smart Terminal Devices in Traffic Environment,” Procedia Engineering, vol. 69, pp. 778–787, 2014, doi: https://doi.org/10.1016/j.proeng.2014.03.054.
M. Soltanian, J. Malik, J. Raitoharju, A. Iosifidis, S. Kiranyaz, and M. Gabbouj, “Speech Command Recognition in Computationally Constrained Environments with a Quadratic Self-Organized Operational Layer,” 2021 International Joint Conference on Neural Networks (IJCNN), Jul. 2021, doi: https://doi.org/10.1109/ijcnn52387.2021.9534232.
S. Okada, Y. Tanaba, H. Yamauchi, and S. Sato, “Single-surgeon thoracoscopic surgery with a voice-controlled robot,” The Lancet, vol. 351, no. 9111, p. 1249, Apr. 1998, doi: https://doi.org/10.1016/s0140-6736(98)24017-7.
B. Dal and M. Askar, “Fixed-point FPGA Implementation of ECG Classification using Artificial Neural Network,” 2022 Medical Technologies Congress (TIPTEKNO), Oct. 2022, doi: https://doi.org/10.1109/tiptekno56568.2022.9960216.
A. N. Azhiimah, K. Khotimah, M. S. Sumbawati, and A. B. Santosa, “Automatic Control Based on Voice Commands and Arduino,” Proceedings of the International Joint Conference on Science and Engineering (IJCSE 2020), 2020, doi: https://doi.org/10.2991/aer.k.201124.006.
G. H. Shakoory, “FPGA Implementation of Multilayer Perceptron for Speech Recognition,” Journal of Engineering and Sustainable Development, vol. 17, no. 6, pp. 175–185, Dec. 2013
K.-F. Lee and H.-W. Hon, “Speaker-independent phone recognition using hidden Markov models,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 37, no. 11, pp. 1641–1648, 1989, doi: https://doi.org/10.1109/29.46546.
A. Ouisaadane and S. Safi, “A comparative study for Arabic speech recognition system in noisy environments,” International Journal of Speech Technology, vol. 24, no. 3, pp. 761–770, Apr. 2021, doi: https://doi.org/10.1007/s10772-021-09847-7.
H.-P. Lin, Y.-J. Zhang, and C.-P. Chen, “Systems for Low-Resource Speech Recognition Tasks in Open Automatic Speech Recognition and Formosa Speech Recognition Challenges,” Interspeech 2021, Aug. 2021, doi: https://doi.org/10.21437/interspeech.2021-358.
D. Palaz, M. Magimai-Doss, and R. Collobert, “End-to-end acoustic modeling using convolutional neural networks for HMM-based automatic speech recognition,” Speech Communication, vol. 108, pp. 15–32, Apr. 2019, doi: https://doi.org/10.1016/j.specom.2019.01.004.
A. S. Dhanjal and W. Singh, “A comprehensive survey on automatic speech recognition using neural networks,” Multimedia Tools and Applications, vol. 83, no. 8, pp. 23367–23412, Aug. 2023, doi: https://doi.org/10.1007/s11042-023-16438-y.
H. F. Pardede, P. Adhi, V. Zilvan, A. Ramdan, and D. Krisnandi, “Deep convolutional neural networks-based features for Indonesian large vocabulary speech recognition,” IAES International Journal of Artificial Intelligence (IJ-AI), vol. 12, no. 2, p. 610, Jun. 2023, doi: https://doi.org/10.11591/ijai.v12.i2.pp610-617.
A. Graves, A. Mohamed, and G. Hinton, “Speech recognition with deep recurrent neural networks,” 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, May 2013, doi: https://doi.org/10.1109/icassp.2013.6638947.
L. Meng, P. Kuppuswamy, J. Upadhyay, S. Kumar, S. V. Athawale, and M. A. Shah, “Nonlinear Network Speech Recognition Structure in a Deep Learning Algorithm,” Computational Intelligence and Neuroscience, vol. 2022, pp. 1–7, Mar. 2022, doi: https://doi.org/10.1155/2022/6785642.
P. Lakkhanawannakun and C. Noyunsan, “Speech Recognition using Deep Learning,” 2019 34th International Technical Conference on Circuits/Systems, Computers and Communications (ITC-CSCC), Jun. 2019, doi: https://doi.org/10.1109/itc-cscc.2019.8793338.
F. R. Jr. Arnel Fajardo, “Convolutional Neural Network for Automatic Speech Recognition of Filipino Language,” International Journal of Advanced Trends in Computer Science and Engineering, vol. 9, no. 1.1 S I, pp. 34–40, Feb. 2020, doi: https://doi.org/10.30534/ijatcse/2020/0791.12020.
L. Khurana, A. Chauhan, M. Naved, and P. Singh, “Speech Recognition with Deep Learning,” Journal of Physics: Conference Series, vol. 1854, no. 1, p. 012047, Apr. 2021, doi: https://doi.org/10.1088/1742-6596/1854/1/012047.
S. Ahmed Sumon, J. Chowdhury, S. Debnath, N. Mohammed, and S. Momen, “Bangla Short Speech Commands Recognition Using Convolutional Neural Networks,” 2018 International Conference on Bangla Speech and Language Processing (ICBSLP), Sep. 2018, doi: https://doi.org/10.1109/icbslp.2018.8554395.
A. S. Mahfoudh BA WAZIR and J. Huang CHUAH, “Spoken Arabic Digits Recognition Using Deep Learning,” 2019 IEEE International Conference on Automatic Control and Intelligent Systems (I2CACIS), Jun. 2019, doi: https://doi.org/10.1109/i2cacis.2019.8825004.
Y.-Y. Lin et al., “A Speech Command Control-Based Recognition System for Dysarthric Patients Based on Deep Learning Technology,” Applied Sciences, vol. 11, no. 6, p. 2477, Mar. 2021, doi: https://doi.org/10.3390/app11062477.
A. H. Shah, A. H. Miry, and T. M. Salman, “Automatic Modulation Classification Using Deep Learning Polar Feature,” Journal of Engineering and Sustainable Development, vol. 27, no. 4, pp. 477–486, Jul. 2023, doi: https://doi.org/10.31272/jeasd.27.4.5.
A. Patra, C. Pandey, K. Palaniappan, and P. K. Sethy, “Convolutional Neural Network-Enabling Speech Command Recognition,” Lecture Notes on Data Engineering and Communications Technologies, pp. 321–332, Oct. 2022, doi: https://doi.org/10.1007/978-981-19-3035-5_25.
R. Serizel, V. Bisot, S. Essid, and G. Richard, “Acoustic Features for Environmental Sound Analysis,” Computational Analysis of Sound Scenes and Events, pp. 71–101, Sep. 2017, doi: https://doi.org/10.1007/978-3-319-63450-0_4.
S. Ajibola Alim and N. Khair Alang Rashid, “Some Commonly Used Speech Feature Extraction Algorithms,” From Natural to Artificial Intelligence - Algorithms and Applications, Dec. 2018, doi: https://doi.org/10.5772/intechopen.80419.
G. Sharma, K. Umapathy, and S. Krishnan, “Trends in audio signal feature extraction methods,” Applied Acoustics, vol. 158, p. 107020, Jan. 2020, doi: https://doi.org/10.1016/j.apacoust.2019.107020.
P. Warden, (2018). Speech commands: A dataset for limited-vocabulary speech recognition. arXiv preprint arXiv:1804.03209.
A. Adnan Alnawas, M. Al-Jawad, and H. Alharbi, “A Prediction Model Based On Students’s Behavior In E-Learning Environments Using Data Mining Techniques,” Journal of Engineering and Sustainable Development, vol. 26, no. 5, pp. 115–126, Sep. 2022, doi: https://doi.org/10.31272/jeasd.26.5.11.
Downloads
Key Dates
Received
Revised
Accepted
Published Online First
Published
Issue
Section
License
Copyright (c) 2024 Inas Jawad Kadhim, Tawfeeq E. Abdulabbas, Riyadh Ali, Ali F. Hassoon, Prashan Premaratne (Author)
This work is licensed under a Creative Commons Attribution 4.0 International License.