Speech Recognition Using Artificial Neural Network – a Review
Visual Spoken communication Recognition of Korean Words Using Convolutional Neural Network
Sung-Won Lee, Je-Hun Yu, Seung Min Park, and Kwee-Bo Sim
Department of Electronic and Electrical Applied science, Chung-Ang University, Seoul, Korea
Correspondence to :
Kwee-Bo Sim, (kbsim@cau.ac.kr)
Received: June 5, 2018; Revised: September seven, 2018; Accepted: September 7, 2018
This is an Open Admission commodity distributed nether the terms of the Creative Commons Attribution Not-Commercial License (http://creativecommons.org/licenses/by-nc/iii.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original piece of work is properly cited.
Abstract
In recent studies, spoken communication recognition functioning is greatly improved by using HMM and CNN. HMM is studying statistical modeling of voice to construct an acoustic model and to reduce the error rate by predicting vocalisation through image of mouth region using CNN. In this paper, nosotros propose visual speech recognition (VSR) using lip images. To implement VSR, we repeatedly recorded three subjects speaking 53 words chosen from an emergency medical service vocabulary book. To excerpt images of consonants, vowels, and final consonants in the recorded video, audio signals were used. The Viola–Jones algorithm was used for lip tracking on the extracted images. The lip tracking images were grouped and and then classified using CNNs. To classify the components of a syllable including consonants, vowels, and terminal consonants, the structure of the CNN used VGG-s and modified LeNet-5, which has more layers. All syllable components were classified, and then the word was institute by the Euclidean distance. From this experiment, a classification rate of 72.327% using 318 total testing words was obtained when VGG-south was used. When LeNet-v applied this classifier for words, however, the nomenclature rate was 22.327%.
Keywords: Convolutional neural network, Human–robot interaction, Korean discussion recognition, Viola–Jones algorithm, Visual speech recognition
1. Introduction
Many people accept an involvement in service robots attributable to developments in artificial intelligence (AI). Thus, researchers on robots or AI are developing diverse robots to recognize human expression, emotions, and speech. Such research is called human–robot interaction (HRI) [1, 2]. HRI can be applied to various fields such equally factories, hospitals, and amusement parks.
To implement the HRI system, voice communication recognition is one of the important aspects. People require speech recognition considering many products utilize this technology—e.g., vehicle navigation, voice recognition services of jail cell phones, and voice searching on the Internet. However, these programs for speech recognition have the problem of inaccuracy [iii–6]. In the presence of noise, these programs cannot hear and analyze the control of the user. Thus, such programs are not used in emergency situations.
To overcome this problem, many ideas take been proposed by researchers. The solution to the speech recognition problem is visual speech recognition (VSR). Current speech recognition engineering uses people's voices. In dissimilarity, VSR uses lip shapes to improve the accurateness of spoken communication recognition.
Thus, VSR is used in human–figurer interaction, speaker recognition, audio-visual spoken communication recognition, sign language recognition, and video surveillance for convenience [5, 7]. VSR applied science has 2 methods of approach: the visemic arroyo and the holistic approach. The visemic approach is the conventional and mutual method.
The viseme uses the phoneme of a give-and-take'south mouth shapes. Nevertheless, the holistic approach uses the whole word. Thus, the holistic method has a meliorate result than the visemic approach [7]. However, the holistic method has not nevertheless been made for the Korean language.
In Korean, the syllable of a word consists of three parts. Figure 1 shows the construction of a word. The give-and-take has three consonants, three vowels, and two terminal consonants. Moreover, a syllable generally consists of a consonant, vowel, and final consonant. The pronunciation of a syllable is also a sequence of a consonant, vowel, and concluding consonant [viii]. However, the use of only the data from the consonant, vowel, and final consonant images cannot determine the correct word because the data of the consonant and concluding consonant is not exact and makes no difference except bilabial [nine].
In this paper, a holistic approach and the images of a syllable were combined to solve this problem. To use the holistic arroyo in Korean, the consonant, vowel, and final consonant parts of words were categorized. In addition, the words were classified by collecting the classification results of a syllable's components in the order of time.
Fifty-iii Korean words were called for the holistic approach and recorded using a photographic camera. The 53 Korean words were selected from an emergency medical service vocabulary book that was published by the National Medical Middle in Korea [10]. Using the Viola–Jones detection algorithm, the lip shapes of subjects were found. From the lip shapes, the 53 words were classified by convolution neural network.
two. Related Piece of work
Previous research on speech recognition has focused on improving the accurateness. Therefore, many results on speech recognition accept been proposed using VSR for decades. For our VSR, the classification algorithm, lip extraction, and VSR method were investigated. In VSR, lip tracking is important considering the extraction of the lip can simplify classification and recognition. In the department on the VSR method, VSR methods from the literature are explained.
2.1 Convolutional Neural Network
For nomenclature of Korean words, a convolutional neural network was used. The convolutional neural network is a powerful nomenclature algorithm developed in 1998 by LeCun et al. [eleven]. All the same, the convolutional neural network attracted few researchers until a few years agone attributable to its number of operations for classification. Now, given the development of computer hardware, many researchers have paid attention to convolutional neural network theory.
A convolutional neural network has three stages: the convolution layer, subsampling layer, and fully connected layer. The convolution and subsampling layers are used for feature extraction of an input image. In add-on, the fully continued layer classifies an input image. This is the advantage of the convolutional neural network because information technology has no other feature extraction. Figure two shows a simple structure of a convolutional neural network, also known as the LeNet-5 model. However, this LeNet-five was modified to increase the classification rates. Moreover, this LeNet-5 has more layers than conventional LeNet-v [1, two, 11].
The convolutional neural network has other structures in addition to LeNet-v. In 2012, AlexNet was introduced and won the 2012 ImageNet Big-Calibration Visual Recognition Challenge (ILSVRC). AlexNet used 2 GPUs to increment performance and obtained good results in classifying images [12]. In 2014, many structures using GPUs such every bit GoogLeNet and VGGNet were influenced by AlexNet. Moreover, GoogLeNet won the 2014 ILSVRC, and VGGNet ranked second [13–fifteen].
2.2 Visual Speech Recognition Method
For VSR, various methods have been proposed. Nigh approaches used extracted oral cavity images. Notwithstanding, the proposed methods of diverse researchers differ regarding how to classify mouth images or extract the oral cavity from an image
In 1994, Bregler and Konig [sixteen] introduced word classification using "Eigenlips". The authors used the energy function of measured the image features and a contour model. To allocate two, 955 German words, they used a multilayer perceptron and a hidden Markov model (HMM).
In 2011, Shin et al. [17] made an interface device for a vehicle navigation device. They used not only VSR but also audiovisual speech recognition (AVSR). AVSR was by and large used when the phonic data independent significant racket. Therefore, to overcome the noise problem, they used a robust lip tracker such as the Lucas–Kanade (LK) method and nomenclature such as the subconscious Markov model, artificial neural network, and one thousand-nearest neighbor.
In 2015, Noda et al. [five] also used AVSR. Data for classification used Japanese speech video that was recorded by six males. They used the convolutional neural network for nomenclature of lip images and used a multistream HMM for AVSR. The input sizes of the convolutional neural network were 16 × 16, 32 × 32, and 64 × 64.
In 2015, Kumaravel [seven] classified English words using a histogram method for feature and support vector machines for classification. For recognition, the data of English words were recorded using a camera. In improver, the video included images of 33 people including men and women.
2.3 Viola-Jones Object Detection Algorithm
The Viola–Jones object detection algorithm is i of the most pop algorithms. In 2001, Viola and Jones [18] developed this algorithm, whose advantages are adept performance and fast processing speed. This algorithm consists of a Haar feature, integral image, Adaboost, and cascade classifier.
In 2016, Yu and Kim [i] used the Viola–Jones algorithm to excerpt and classify subject faces. The extracted faces of subjects were classified using the classifying algorithm. He also improved the performance of the Viola–Jones algorithm using the convolutional neural network to find and extract the facial signal [ii].
3. Experimental Method
3.1 Database
In this newspaper, speech videos of Korean words were recorded as classifying data. The speech communication data of Korean words include voice communication video of three males speaking 53 words five times for training information and two times for testing information. Then, nine people (six males and 3 females) were recorded 53 words three times for training data and two times for testing data to bank check the effectiveness of VSR. In this experiment, half dozen males and three females are that the Korean language is mother natural language.
To record the Korean oral communication of subjects, a smartphone video camera was used. The photographic camera recorded the vocalization and image of the subjects with a video frame size of 1920 × 1080. The recorded environment included a white wall for the background. Bones lighting was used without additional lights. The proposed experimental process is shown schematically in Effigy 3.
3.2 Set of Words
To allocate the words, we recorded the subjects speaking 53 words from an emergency medical service vocabulary [10]. The words were selected to test the use in emergency situations. The list of selected words in the experiment is shown in Table 1.
four. VSR Method
To allocate the words using lip shape, the proposed VSR method shown in Figure 4 was used in the experiment. Starting time, each oral communication sound in the recorded video was categorized past consonants, vowels, and final consonants using audio spoken language signal analysis. The images of consonants, vowels, and terminal consonants were extracted using the categorized voice communication sounds. A tracking algorithm constitute the lip images in the extracted images. Using the lip images of consonants, vowels, and concluding consonants, a classifier was trained and tested. The words were then classified by the results of the classifier output.
4.1 Categorization of Images
To categorize the recorded video, the video sounds were used. First, the Daum PotEncoder video encoding program was used to extract the video sounds. In the experiment, the categorized images were divided into consonants, vowels, and final consonants using the extracted sound and MATLAB. In MATLAB, the audio data had a threshold of 0.8 to eliminate noise. To observe frames with information of consonant images, the starting points of each syllable were used. The final consonant images were found using the endpoint of each syllable. The vowel images were extracted using the mean value of each starting point and endpoint of the syllables. Figure 5 is an case of finding each image using the MATLAB programming and sound file of the recorded video.
iv.two Lip Tracker
To track the lip images, the Viola–Jones algorithm was used. Using the Viola–Jones algorithm, the faces of field of study images are extracted. The lip images are so found using the extracted facial images and the Viola–Jones algorithm. The process to detect the lip images is shown in Effigy 6.
The extracted facial images were resized to 400 × 400 × 3 (RGB data) in the case of VGG-southward. The lip images were too resized to 224 × 224 × 3 (RGB data) considering the input size of VGG-due south is 224×224×3. However, the video data were encoded into 100 MB in the instance of LeNet-5. The video was then resized to 272 × 480 × iii by the encoding. In the video, the faces of subjects were extracted and then resized to 300 × 280. For the input size of LeNet-5, the mouth images extracted from the faces were resized to 32 × 32 × 3. To extract stock-still lip images, the minimum and maximum sizes were decided for VGG-s and LeNet-five.
4.3 Group the Pronunciations
To group the pronunciations, the vowels, consonants, and final consonants of the syllables depended on the lip shapes and pronunciations. The method to group the vowel is shown in Tabular array 2. The number in Table ii is the characterization order.
The consonant consisted of bilabials such as 'm', 'b', and 'p' equally well as no bilabial. The final consonant consisted of the bilabial, the other concluding consonant, and no final consonant. The components of the consonant and final consonant had the labels. The results of categorized lip images are shown in Effigy 7.
To allocate the lip images, the convolutional neural network is used. The structure of the convolutional neural network is VGG-s, which was developed by the University of Oxford. Using the results of the lip tracker, the lip shape images are trained. The functioning of the classification is then checked using the testing lip images. The proposed method of classifying the Korean language is shown in Figure 8. The structure for give-and-take nomenclature consists of the bounds of the vowels, consonants, and last consonants. The consonant and terminal consonant bounds in Effigy viii are used to classify consonants such as 'm', 'b', and 'p'. In the example of the concluding consonant leap, the convolutional neural network also classified the lack of a final consonant. The vowel bound finds vowels 'a', 'e', 'i', 'o', and 'u'. The number of times the 53 words were reproduced was and so classified using Euclidean distance with the desired labels, which are sets of each component'due south labels, and estimated labels, which are results of classification.
The Euclidean distance to calculate divergence with desired and resulting output could be described as below:
where
For nomenclature, the VGG-southward and modified LeNet-five of the convolutional neural network structure are used with MatConvNet based on MATLAB [19]. In the case of LeNet-5, the Daum PotEncoder plan was used for video encoding owing to its input size of 32 × 32 × 3. The input image size of LeNet-5 was 32 × 32 × 3 (RGB information), and the size of VGG-s was 224 × 224 × 3 (RGB information). The number of consonant output nodes was two, the number of terminal consonant nodes was three, and the number of vowel nodes was nine.
five. Experiment Result
The 1, 989 lip images were used for consonant, final consonant, and vowel training information. For testing classification of the consonant, terminal consonant, and vowel images, 804 images were used. The structures of LeNet-5 and VGG-s have fifty iterations for training. To train the consonant using VGG-s and LeNet-5, five, 967 images of mouth shapes were used. In addition, 2, 412 images were used for testing the consonant. For training and testing the final consonant and vowel, the aforementioned images were used. Grooming and testing information were extracted from the recorded speech video of three subjects. Figure 9 shows the classification results of the test images. The classification results of each subject are shown in Table 3.
From these results, the functioning of these classifiers could be compared. Moreover, we could discover a more powerful VGG-due south for VSR. Using only the lip images, 72.327% of the total nomenclature rates were obtained. In the result, field of study two has an fourscore.189% rate, which is the highest value using VGG-s. When VGG-s was used, the lowest value was 65.094%. However, a maximum value of 24.528% and minimum value of xix.811% were obtained when LeNet-5 was used. In addition, the total classification rate was 22.327%.
In order to cheque the operation of this algorithm used VGG-s, 9 subject'southward videos that included three subjects in previous experiments was used. Three times for grooming information and two times for testing data of nine subject'southward 53 words videos (six males and three females) were used in this experiment. Total training images were 11, 850 and testing lip'southward images were vii, 962. VGG-s have thirty iterations for preparation. Effigy 10 shows the results of the classification rates of 53 words by three subjects. From these results, subject ii has 48.0769% that is highest value and average value is 32.9554%.
6. Conclusion
The nomenclature rates show that using VGG-southward that is the structure of the convolutional neural network was the better method than the LeNet-5 construction of the convolutional neural network for the visual speech recognition. In addition, the performance of this algorithm was checked past ix field of study's videos. However, in that location was the ambivalence of images in the case of the final consonant because not having a final consonant'due south images and a final consonant without bilabial images make no divergence. The vowel images besides take similar lip'south shape. And, we knew that the label'southward society is of import considering of the similar lip's shape at the word nomenclature. If the label is randomly decided, the results will be different of word nomenclature and will non be good because the Euclidean distance was used. In future enquiry, other nomenclature algorithms such equally GoogLeNet, deep belief network, and restricted Boltzmann automobile will be used for classification of Korean words. We program to implement and use a new algorithm for authentic detection and extraction of the lip's images. Furthermore, studies on reducing the delay fourth dimension needed for the preparation of convolutional neural network algorithm will be conducted. The results of this inquiry showed that a machine can recognize the person's oral communication. With further this research and experiments, it volition exist able to help the speech-impaired person and the elderly that difficult to speak using this technology. Furthermore, it will exist possible to help the people in emergency situations with noise and the crime prevention. Thus, visual spoken language recognition has the potential to be adopted in various human-robot interaction surface area and aid devices for rehabilitation.
Conflict of Interest
No potential conflict of involvement relevant to this commodity was reported.
Figures
Fig. 2.
Example of a convolutional neural network.
Fig. iii.
The method of recording the Korean speech.
Fig. vi.
Procedure to extract the face and lip images using the Viola– Jones detection algorithm: (a) extracted prototype using the audio data, (b) face detection and extraction using the Viola–Jones detection algorithm, (c) mouth detection using the Viola–Jones detection algorithm, and (d) extracted oral cavity.
Fig. 7.
Results of group method and the lip images of the vowels, consonants, and final consonants.
Fig. 8.
Method of discussion classification.
Fig. nine.
Classification results of pronunciation.
Fig. 10.
Classification results of 53 words past three subjects.
Fig. 11.
Classification results of 53 words.
Tables
Table. 1.
Table 1. Selected 53 words in emergency medical service vocabulary.
Korean | Pronunciation | English |
---|---|---|
가려움 | garyeoum | itch |
가슴 | gaseum | breast |
간호사 | ganhosa | nurse |
감각이상 | gamgag isang | paresthesia |
경련 | gyeonglyeon | convulsion |
경찰 | gyeongchal | police |
고름 | goleum | pus |
고열 | goyeol | loftier fever |
고혈압 | gohyeol-ab | high blood pressure |
골절 | goljeol | fracture |
구급차 | gugeubcha | ambulance |
구토 | guto | throw up |
긴급 | gingeub | emergency |
내장 | naejang | guts |
뇌진탕 | noejintang | concussion |
당뇨 | dangnyo | diabetes |
도와주세요 | dowajuseyo | help |
사고 | sago | accident |
살려주세요 | sallyeojuseyo | please spare |
설사 | seolsa | diarrhea |
소생 | sosaeng | revival |
소생술 | sosaengsul | resuscitation |
식중독 | sigjungdog | nutrient poisoning |
신고 | singo | notify |
실신 | silsin | faint |
심폐소생술 | simpye sosaengsul | CPR |
어지럼 | eojileom | dizziness |
엠블런스 | embyulleonseu | ambulance |
열 | yeol | fever |
염증 | yeomjeung | Inflammation |
응급실 | eung-geubsil | emergency room |
응급치료 | eung-geub chilyo | beginning aid |
의사 | uisa | doctor |
의식 | uisig | consciousness |
맥박 | maegbag | pulse |
멀미 | meolmi | movement sickness |
목구멍 | moggumeong | throat |
무감각 | mugamgag | stupor |
무기력 | mugilyeog | lethargy |
무의식 | muuisig | unconscious |
발열 | bal-yeol | fever |
발작 | baljag | seizure |
병원 | byeong-won | hospital |
빈혈 | binhyeol | anemia |
장염 | jang-yeom | enteritis |
저혈압 | jeohyeol-ab | hypotension |
전화 | jeonhwa | telephone |
주사 | jusa | injection |
지혈 | jihyeol | hemostasis |
진통 | jintong | throes |
환자 | hwanja | patient |
화상 | hwasang | burn |
환자 | hwanja | patient |
Table. 2.
Tabular array two. The grouping method to classify the vowel images.
Number | Korean | English |
---|---|---|
i | ㅏ, ㅑ | a, ya |
2 | ㅓ, ㅕ | eo, yeo |
3 | ㅗ, ㅛ | o, yo |
4 | ㅜ, ㅠ | u, yu |
5 | ㅡ | european union |
6 | ㅣ | I |
vii | ㅔ, ㅐ, ㅖ | due east, ae, ye |
8 | ㅚ | oe |
9 | ㅝ | wi |
Tabular array. 3.
Table 3. Classification results of each subject (unit of measurement: %).
Structure | Subject1 | Subject2 | Subject3 | |
---|---|---|---|---|
VGG-s | Consonant | 93.657 | 95.149 | 94.030 |
Last consonant | 76.493 | 82.836 | 72.761 | |
Vowel | 79.478 | 89.552 | 74.254 | |
Total | 83.209 | 89.179 | 80.348 | |
| ||||
LeNet-5 | Consonant | 96.269 | 94.776 | 94.776 |
Final consonant | 36.940 | 24.627 | 7.090 | |
Vowel | 48.507 | 36.194 | 22.388 | |
Total | 60.572 | 51.866 | 41.418 |
References
-
Yu, JH, and Sim, KB (2016). Face classification using cascade facial detection and convolutional neural network. Journal of Korean Plant of Intelligent Systems. 26, 70-75. https://doi.org/10.5391/jkiis.2016.26.1.070
-
Yu, JH, Ko, KE, and Sim, KB (2016). Facial indicate classifier using convolution neural network and cascade facial betoken detector. Journal of Institute of Command, Robotics and Systems. 22, 241-246. https://doi.org/ten.5302/j.icros.2016.xv.0156
-
Li, J, Deng, L, Gong, Y, and Haeb-Umbach, R (2014). An overview of noise-robust automatic oral communication recognition. IEEE/ACM Transactions on Sound, Speech, and Linguistic communication Processing. 22, 745-777. https://doi.org/10.1109/TASLP.2014.2304637
-
Zhou, Z, Zhao, G, Hong, Ten, and Pietikainen, M (2014). A review of contempo advances in visual speech communication decoding. Image and Vision Computing. 32, 590-605. https://doi.org/10.1016/j.imavis.2014.06.004
-
Noda, K, Yamaguchi, Y, Nakadai, Grand, Okuno, HG, and Ogata, T (2015). Audio-visual speech recognition using deep learning. Applied Intelligence. 42, 722-737. https://doi.org/10.1007/s10489-014-0629-7
-
Seltzer, ML, Yu, D, and Wang, Y 2013. An investigation of deep neural networks for racket robust speech recognition., Proceedings of 2013 IEEE International Conference on Acoustics, Speech communication and Indicate Processing (ICASSP), Vancouver, Canada, Assortment, pp.7398-7402. https://doi.org/10.1109/ICASSP.2013.6639100
-
Kumaravel, SS 2015. Visual speech recognition using histogram of oriented displacements. MS thesis. Clemson University. Clemson, SC.
-
Yi, KO (1998). The internal structure of Korean syllables: rhyme or body?. Korean Periodical of Experimental & Cognitive Psychology. 10, 67-83.
-
Kwon, YM (2010). Development of bilabialization (labialization). Korean Linguistics. 47, 93-130.
-
National Emergency Medical Centre (2005). Emergency Medical Dictionary. Seoul: National Emergency Medical Centre
-
LeCun, Y, Bottou, L, Bengio, Y, and Haffner, P (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE. 86, 2278-2324. https://doi.org/ten.1109/5.726791
-
Krizhevsky, A, Sutskever, I, and Hinton, GE (2012). ImageNet nomenclature with deep convolutional neural networks. Advances in Neural Information Processing Systems. 25, 1097-1105.
-
Simonyan, One thousand, and Zisserman, A. (2014) . Very deep convolutional networks for large-scale image recognition. Bachelor https://arxiv.org/abs/1409.1556
-
Szegedy, C, Liu, Westward, Jia, Y, Sermanet, P, Reed, S, Anguelov, D, Erhan, D, Vanhoucke, Five, and Rabinovich, A 2015. Going deeper with convolutions., Proceedings of the IEEE Conference on Reckoner Vision and Pattern Recognition, Boston, MA, Array, pp.1-9. https://doi.org/10.1109/CVPR.2015.7298594
-
Long, J, Shelhamer, E, and Darrell, T 2015. Fully convolutional networks for semantic segmentation., Proceedings of the IEEE Briefing on Computer Vision and Pattern Recognition, Boston, MA, pp.3431-3440.
-
Bregler, C, and Konig, Y 1994. "Eigenlips" for robust oral communication recognition., Proceedings of IEEE International Conference on Acoustics, Speech, and Betoken Processing, Adelaide, Australia, Array, pp.669-672. https://doi.org/x.1109/icassp.1994.389567
-
Shin, J, Lee, J, and Kim, D (2011). Real-time lip reading system for isolated Korean give-and-take recognition. Pattern Recognition. 44, 559-571. https://doi.org/10.1016/j.patcog.2010.09.011
-
Viola, P, and Jones, 1000 . Rapid object detection using a boosted pour of simple features., Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2001, Kauai, HI, Assortment, pp.511-518. https://doi.org/10.1109/CVPR.2001.990517
-
Vedaldi, A, and Lenc, Thousand 2015. MatConvNet: convolutional neural networks for Matlab., Proceedings of the 23rd ACM International Conference on Multimedia, Brisbane, Australia, Array, pp.689-692. https://doi.org/10.1145/2733373.2807412
Biographies
Eastward-mail: sungwon8912@cau.ac.kr
Electronic mail: yjhoon651@cau.ac.kr
E-postal service: sminpark@cau.air conditioning.kr
E-post: kbsim@cau.air-conditioning.kr
Source: https://www.ijfis.org/journal/view.html?uid=858&vmd=Full
0 Response to "Speech Recognition Using Artificial Neural Network – a Review"
Post a Comment