Speech Recognition Using Artificial Neural Network – a Review

Visual Spoken communication Recognition of Korean Words Using Convolutional Neural Network

Sung-Won Lee, Je-Hun Yu, Seung Min Park, and Kwee-Bo Sim

Department of Electronic and Electrical Applied science, Chung-Ang University, Seoul, Korea

Correspondence to :
Kwee-Bo Sim, (kbsim@cau.ac.kr)

Received: June 5, 2018; Revised: September seven, 2018; Accepted: September 7, 2018

This is an Open Admission commodity distributed nether the terms of the Creative Commons Attribution Not-Commercial License (http://creativecommons.org/licenses/by-nc/iii.0/) which permits unrestricted noncommercial use, distribution, and reproduction in any medium, provided the original piece of work is properly cited.

Abstract

In recent studies, spoken communication recognition functioning is greatly improved by using HMM and CNN. HMM is studying statistical modeling of voice to construct an acoustic model and to reduce the error rate by predicting vocalisation through image of mouth region using CNN. In this paper, nosotros propose visual speech recognition (VSR) using lip images. To implement VSR, we repeatedly recorded three subjects speaking 53 words chosen from an emergency medical service vocabulary book. To excerpt images of consonants, vowels, and final consonants in the recorded video, audio signals were used. The Viola–Jones algorithm was used for lip tracking on the extracted images. The lip tracking images were grouped and and then classified using CNNs. To classify the components of a syllable including consonants, vowels, and terminal consonants, the structure of the CNN used VGG-s and modified LeNet-5, which has more layers. All syllable components were classified, and then the word was institute by the Euclidean distance. From this experiment, a classification rate of 72.327% using 318 total testing words was obtained when VGG-south was used. When LeNet-v applied this classifier for words, however, the nomenclature rate was 22.327%.

Keywords: Convolutional neural network, Human–robot interaction, Korean discussion recognition, Viola–Jones algorithm, Visual speech recognition

1. Introduction

Many people accept an involvement in service robots attributable to developments in artificial intelligence (AI). Thus, researchers on robots or AI are developing diverse robots to recognize human expression, emotions, and speech. Such research is called human–robot interaction (HRI) [1, 2]. HRI can be applied to various fields such equally factories, hospitals, and amusement parks.

To implement the HRI system, voice communication recognition is one of the important aspects. People require speech recognition considering many products utilize this technology—e.g., vehicle navigation, voice recognition services of jail cell phones, and voice searching on the Internet. However, these programs for speech recognition have the problem of inaccuracy [iii6]. In the presence of noise, these programs cannot hear and analyze the control of the user. Thus, such programs are not used in emergency situations.

To overcome this problem, many ideas take been proposed by researchers. The solution to the speech recognition problem is visual speech recognition (VSR). Current speech recognition engineering uses people's voices. In dissimilarity, VSR uses lip shapes to improve the accurateness of spoken communication recognition.

Thus, VSR is used in human–figurer interaction, speaker recognition, audio-visual spoken communication recognition, sign language recognition, and video surveillance for convenience [5, 7]. VSR applied science has 2 methods of approach: the visemic arroyo and the holistic approach. The visemic approach is the conventional and mutual method.

The viseme uses the phoneme of a give-and-take'south mouth shapes. Nevertheless, the holistic approach uses the whole word. Thus, the holistic method has a meliorate result than the visemic approach [7]. However, the holistic method has not nevertheless been made for the Korean language.

In Korean, the syllable of a word consists of three parts. Figure 1 shows the construction of a word. The give-and-take has three consonants, three vowels, and two terminal consonants. Moreover, a syllable generally consists of a consonant, vowel, and final consonant. The pronunciation of a syllable is also a sequence of a consonant, vowel, and concluding consonant [viii]. However, the use of only the data from the consonant, vowel, and final consonant images cannot determine the correct word because the data of the consonant and concluding consonant is not exact and makes no difference except bilabial [nine].

In this paper, a holistic approach and the images of a syllable were combined to solve this problem. To use the holistic arroyo in Korean, the consonant, vowel, and final consonant parts of words were categorized. In addition, the words were classified by collecting the classification results of a syllable's components in the order of time.

Fifty-iii Korean words were called for the holistic approach and recorded using a photographic camera. The 53 Korean words were selected from an emergency medical service vocabulary book that was published by the National Medical Middle in Korea [10]. Using the Viola–Jones detection algorithm, the lip shapes of subjects were found. From the lip shapes, the 53 words were classified by convolution neural network.

two. Related Piece of work

Previous research on speech recognition has focused on improving the accurateness. Therefore, many results on speech recognition accept been proposed using VSR for decades. For our VSR, the classification algorithm, lip extraction, and VSR method were investigated. In VSR, lip tracking is important considering the extraction of the lip can simplify classification and recognition. In the department on the VSR method, VSR methods from the literature are explained.

2.1 Convolutional Neural Network

For nomenclature of Korean words, a convolutional neural network was used. The convolutional neural network is a powerful nomenclature algorithm developed in 1998 by LeCun et al. [eleven]. All the same, the convolutional neural network attracted few researchers until a few years agone attributable to its number of operations for classification. Now, given the development of computer hardware, many researchers have paid attention to convolutional neural network theory.

A convolutional neural network has three stages: the convolution layer, subsampling layer, and fully connected layer. The convolution and subsampling layers are used for feature extraction of an input image. In add-on, the fully continued layer classifies an input image. This is the advantage of the convolutional neural network because information technology has no other feature extraction. Figure two shows a simple structure of a convolutional neural network, also known as the LeNet-5 model. However, this LeNet-five was modified to increase the classification rates. Moreover, this LeNet-5 has more layers than conventional LeNet-v [1, two, 11].

The convolutional neural network has other structures in addition to LeNet-v. In 2012, AlexNet was introduced and won the 2012 ImageNet Big-Calibration Visual Recognition Challenge (ILSVRC). AlexNet used 2 GPUs to increment performance and obtained good results in classifying images [12]. In 2014, many structures using GPUs such every bit GoogLeNet and VGGNet were influenced by AlexNet. Moreover, GoogLeNet won the 2014 ILSVRC, and VGGNet ranked second [13fifteen].

2.2 Visual Speech Recognition Method

For VSR, various methods have been proposed. Nigh approaches used extracted oral cavity images. Notwithstanding, the proposed methods of diverse researchers differ regarding how to classify mouth images or extract the oral cavity from an image

In 1994, Bregler and Konig [sixteen] introduced word classification using "Eigenlips". The authors used the energy function of measured the image features and a contour model. To allocate two, 955 German words, they used a multilayer perceptron and a hidden Markov model (HMM).

In 2011, Shin et al. [17] made an interface device for a vehicle navigation device. They used not only VSR but also audiovisual speech recognition (AVSR). AVSR was by and large used when the phonic data independent significant racket. Therefore, to overcome the noise problem, they used a robust lip tracker such as the Lucas–Kanade (LK) method and nomenclature such as the subconscious Markov model, artificial neural network, and one thousand-nearest neighbor.

In 2015, Noda et al. [five] also used AVSR. Data for classification used Japanese speech video that was recorded by six males. They used the convolutional neural network for nomenclature of lip images and used a multistream HMM for AVSR. The input sizes of the convolutional neural network were 16 × 16, 32 × 32, and 64 × 64.

In 2015, Kumaravel [seven] classified English words using a histogram method for feature and support vector machines for classification. For recognition, the data of English words were recorded using a camera. In improver, the video included images of 33 people including men and women.

2.3 Viola-Jones Object Detection Algorithm

The Viola–Jones object detection algorithm is i of the most pop algorithms. In 2001, Viola and Jones [18] developed this algorithm, whose advantages are adept performance and fast processing speed. This algorithm consists of a Haar feature, integral image, Adaboost, and cascade classifier.

In 2016, Yu and Kim [i] used the Viola–Jones algorithm to excerpt and classify subject faces. The extracted faces of subjects were classified using the classifying algorithm. He also improved the performance of the Viola–Jones algorithm using the convolutional neural network to find and extract the facial signal [ii].

3. Experimental Method

3.1 Database

In this newspaper, speech videos of Korean words were recorded as classifying data. The speech communication data of Korean words include voice communication video of three males speaking 53 words five times for training information and two times for testing information. Then, nine people (six males and 3 females) were recorded 53 words three times for training data and two times for testing data to bank check the effectiveness of VSR. In this experiment, half dozen males and three females are that the Korean language is mother natural language.

To record the Korean oral communication of subjects, a smartphone video camera was used. The photographic camera recorded the vocalization and image of the subjects with a video frame size of 1920 × 1080. The recorded environment included a white wall for the background. Bones lighting was used without additional lights. The proposed experimental process is shown schematically in Effigy 3.

3.2 Set of Words

To allocate the words, we recorded the subjects speaking 53 words from an emergency medical service vocabulary [10]. The words were selected to test the use in emergency situations. The list of selected words in the experiment is shown in Table 1.

four. VSR Method

To allocate the words using lip shape, the proposed VSR method shown in Figure 4 was used in the experiment. Starting time, each oral communication sound in the recorded video was categorized past consonants, vowels, and final consonants using audio spoken language signal analysis. The images of consonants, vowels, and terminal consonants were extracted using the categorized voice communication sounds. A tracking algorithm constitute the lip images in the extracted images. Using the lip images of consonants, vowels, and concluding consonants, a classifier was trained and tested. The words were then classified by the results of the classifier output.

4.1 Categorization of Images

To categorize the recorded video, the video sounds were used. First, the Daum PotEncoder video encoding program was used to extract the video sounds. In the experiment, the categorized images were divided into consonants, vowels, and final consonants using the extracted sound and MATLAB. In MATLAB, the audio data had a threshold of 0.8 to eliminate noise. To observe frames with information of consonant images, the starting points of each syllable were used. The final consonant images were found using the endpoint of each syllable. The vowel images were extracted using the mean value of each starting point and endpoint of the syllables. Figure 5 is an case of finding each image using the MATLAB programming and sound file of the recorded video.

iv.two Lip Tracker

To track the lip images, the Viola–Jones algorithm was used. Using the Viola–Jones algorithm, the faces of field of study images are extracted. The lip images are so found using the extracted facial images and the Viola–Jones algorithm. The process to detect the lip images is shown in Effigy 6.

The extracted facial images were resized to 400 × 400 × 3 (RGB data) in the case of VGG-southward. The lip images were too resized to 224 × 224 × 3 (RGB data) considering the input size of VGG-due south is 224×224×3. However, the video data were encoded into 100 MB in the instance of LeNet-5. The video was then resized to 272 × 480 × iii by the encoding. In the video, the faces of subjects were extracted and then resized to 300 × 280. For the input size of LeNet-5, the mouth images extracted from the faces were resized to 32 × 32 × 3. To extract stock-still lip images, the minimum and maximum sizes were decided for VGG-s and LeNet-five.

4.3 Group the Pronunciations

To group the pronunciations, the vowels, consonants, and final consonants of the syllables depended on the lip shapes and pronunciations. The method to group the vowel is shown in Tabular array 2. The number in Table ii is the characterization order.

The consonant consisted of bilabials such as 'm', 'b', and 'p' equally well as no bilabial. The final consonant consisted of the bilabial, the other concluding consonant, and no final consonant. The components of the consonant and final consonant had the labels. The results of categorized lip images are shown in Effigy 7.

To allocate the lip images, the convolutional neural network is used. The structure of the convolutional neural network is VGG-s, which was developed by the University of Oxford. Using the results of the lip tracker, the lip shape images are trained. The functioning of the classification is then checked using the testing lip images. The proposed method of classifying the Korean language is shown in Figure 8. The structure for give-and-take nomenclature consists of the bounds of the vowels, consonants, and last consonants. The consonant and terminal consonant bounds in Effigy viii are used to classify consonants such as 'm', 'b', and 'p'. In the example of the concluding consonant leap, the convolutional neural network also classified the lack of a final consonant. The vowel bound finds vowels 'a', 'e', 'i', 'o', and 'u'. The number of times the 53 words were reproduced was and so classified using Euclidean distance with the desired labels, which are sets of each component'due south labels, and estimated labels, which are results of classification.

The Euclidean distance to calculate divergence with desired and resulting output could be described as below:

( t ane c - o g 1 c ) 2 + ( t 1 c - o yard i c ) 2 + ( t 1 c - o k one c ) 2 + + ( t 1 c - o k 1 c ) ii + ( t 1 c - o k i c ) 2 + ( t i c - o k 1 c ) 2 ,

where t due north and O kn is resulting and desired output. The junior letter north is the number of messages of a word. Inferior letter k means a characterization number from one of 53 words. Junior letters c, five, and f is consonant, vowel, and last consonant. From results of (1), nosotros selected the answer that has minimum value.

For nomenclature, the VGG-southward and modified LeNet-five of the convolutional neural network structure are used with MatConvNet based on MATLAB [19]. In the case of LeNet-5, the Daum PotEncoder plan was used for video encoding owing to its input size of 32 × 32 × 3. The input image size of LeNet-5 was 32 × 32 × 3 (RGB information), and the size of VGG-s was 224 × 224 × 3 (RGB information). The number of consonant output nodes was two, the number of terminal consonant nodes was three, and the number of vowel nodes was nine.

five. Experiment Result

The 1, 989 lip images were used for consonant, final consonant, and vowel training information. For testing classification of the consonant, terminal consonant, and vowel images, 804 images were used. The structures of LeNet-5 and VGG-s have fifty iterations for training. To train the consonant using VGG-s and LeNet-5, five, 967 images of mouth shapes were used. In addition, 2, 412 images were used for testing the consonant. For training and testing the final consonant and vowel, the aforementioned images were used. Grooming and testing information were extracted from the recorded speech video of three subjects. Figure 9 shows the classification results of the test images. The classification results of each subject are shown in Table 3.

From these results, the functioning of these classifiers could be compared. Moreover, we could discover a more powerful VGG-due south for VSR. Using only the lip images, 72.327% of the total nomenclature rates were obtained. In the result, field of study two has an fourscore.189% rate, which is the highest value using VGG-s. When VGG-s was used, the lowest value was 65.094%. However, a maximum value of 24.528% and minimum value of xix.811% were obtained when LeNet-5 was used. In addition, the total classification rate was 22.327%.

In order to cheque the operation of this algorithm used VGG-s, 9 subject'southward videos that included three subjects in previous experiments was used. Three times for grooming information and two times for testing data of nine subject'southward 53 words videos (six males and three females) were used in this experiment. Total training images were 11, 850 and testing lip'southward images were vii, 962. VGG-s have thirty iterations for preparation. Effigy 10 shows the results of the classification rates of 53 words by three subjects. From these results, subject ii has 48.0769% that is highest value and average value is 32.9554%.

6. Conclusion

The nomenclature rates show that using VGG-southward that is the structure of the convolutional neural network was the better method than the LeNet-5 construction of the convolutional neural network for the visual speech recognition. In addition, the performance of this algorithm was checked past ix field of study's videos. However, in that location was the ambivalence of images in the case of the final consonant because not having a final consonant'due south images and a final consonant without bilabial images make no divergence. The vowel images besides take similar lip'south shape. And, we knew that the label'southward society is of import considering of the similar lip's shape at the word nomenclature. If the label is randomly decided, the results will be different of word nomenclature and will non be good because the Euclidean distance was used. In future enquiry, other nomenclature algorithms such equally GoogLeNet, deep belief network, and restricted Boltzmann automobile will be used for classification of Korean words. We program to implement and use a new algorithm for authentic detection and extraction of the lip's images. Furthermore, studies on reducing the delay fourth dimension needed for the preparation of convolutional neural network algorithm will be conducted. The results of this inquiry showed that a machine can recognize the person's oral communication. With further this research and experiments, it volition exist able to help the speech-impaired person and the elderly that difficult to speak using this technology. Furthermore, it will exist possible to help the people in emergency situations with noise and the crime prevention. Thus, visual spoken language recognition has the potential to be adopted in various human-robot interaction surface area and aid devices for rehabilitation.

Conflict of Interest

No potential conflict of involvement relevant to this commodity was reported.

Figures

Fig. 2.

Example of a convolutional neural network.


Fig. iii.

The method of recording the Korean speech.


Fig. vi.

Procedure to extract the face and lip images using the Viola– Jones detection algorithm: (a) extracted prototype using the audio data, (b) face detection and extraction using the Viola–Jones detection algorithm, (c) mouth detection using the Viola–Jones detection algorithm, and (d) extracted oral cavity.


Fig. 7.

Results of group method and the lip images of the vowels, consonants, and final consonants.


Fig. 8.

Method of discussion classification.


Fig. nine.

Classification results of pronunciation.


Fig. 10.

Classification results of 53 words past three subjects.


Fig. 11.

Classification results of 53 words.


Tables

Table. 1.

Table 1. Selected 53 words in emergency medical service vocabulary.

Korean Pronunciation English
가려움 garyeoum itch
가슴 gaseum breast
간호사 ganhosa nurse
감각이상 gamgag isang paresthesia
경련 gyeonglyeon convulsion
경찰 gyeongchal police
고름 goleum pus
고열 goyeol loftier fever
고혈압 gohyeol-ab high blood pressure
골절 goljeol fracture
구급차 gugeubcha ambulance
구토 guto throw up
긴급 gingeub emergency
내장 naejang guts
뇌진탕 noejintang concussion
당뇨 dangnyo diabetes
도와주세요 dowajuseyo help
사고 sago accident
살려주세요 sallyeojuseyo please spare
설사 seolsa diarrhea
소생 sosaeng revival
소생술 sosaengsul resuscitation
식중독 sigjungdog nutrient poisoning
신고 singo notify
실신 silsin faint
심폐소생술 simpye sosaengsul CPR
어지럼 eojileom dizziness
엠블런스 embyulleonseu ambulance
yeol fever
염증 yeomjeung Inflammation
응급실 eung-geubsil emergency room
응급치료 eung-geub chilyo beginning aid
의사 uisa doctor
의식 uisig consciousness
맥박 maegbag pulse
멀미 meolmi movement sickness
목구멍 moggumeong throat
무감각 mugamgag stupor
무기력 mugilyeog lethargy
무의식 muuisig unconscious
발열 bal-yeol fever
발작 baljag seizure
병원 byeong-won hospital
빈혈 binhyeol anemia
장염 jang-yeom enteritis
저혈압 jeohyeol-ab hypotension
전화 jeonhwa telephone
주사 jusa injection
지혈 jihyeol hemostasis
진통 jintong throes
환자 hwanja patient
화상 hwasang burn
환자 hwanja patient

Table. 2.

Tabular array two. The grouping method to classify the vowel images.

Number Korean English
i ㅏ, ㅑ a, ya
2 ㅓ, ㅕ eo, yeo
3 ㅗ, ㅛ o, yo
4 ㅜ, ㅠ u, yu
5 european union
6 I
vii ㅔ, ㅐ, ㅖ due east, ae, ye
8 oe
9 wi

Tabular array. 3.

Table 3. Classification results of each subject (unit of measurement: %).

Structure Subject1 Subject2 Subject3
VGG-s Consonant 93.657 95.149 94.030
Last consonant 76.493 82.836 72.761
Vowel 79.478 89.552 74.254
Total 83.209 89.179 80.348

LeNet-5 Consonant 96.269 94.776 94.776
Final consonant 36.940 24.627 7.090
Vowel 48.507 36.194 22.388
Total 60.572 51.866 41.418

References

  1. Yu, JH, and Sim, KB (2016). Face classification using cascade facial detection and convolutional neural network. Journal of Korean Plant of Intelligent Systems. 26, 70-75. https://doi.org/10.5391/jkiis.2016.26.1.070

    CrossRef

  2. Yu, JH, Ko, KE, and Sim, KB (2016). Facial indicate classifier using convolution neural network and cascade facial betoken detector. Journal of Institute of Command, Robotics and Systems. 22, 241-246. https://doi.org/ten.5302/j.icros.2016.xv.0156

    CrossRef

  3. Li, J, Deng, L, Gong, Y, and Haeb-Umbach, R (2014). An overview of noise-robust automatic oral communication recognition. IEEE/ACM Transactions on Sound, Speech, and Linguistic communication Processing. 22, 745-777. https://doi.org/10.1109/TASLP.2014.2304637

    CrossRef

  4. Zhou, Z, Zhao, G, Hong, Ten, and Pietikainen, M (2014). A review of contempo advances in visual speech communication decoding. Image and Vision Computing. 32, 590-605. https://doi.org/10.1016/j.imavis.2014.06.004

    CrossRef

  5. Noda, K, Yamaguchi, Y, Nakadai, Grand, Okuno, HG, and Ogata, T (2015). Audio-visual speech recognition using deep learning. Applied Intelligence. 42, 722-737. https://doi.org/10.1007/s10489-014-0629-7

    CrossRef

  6. Seltzer, ML, Yu, D, and Wang, Y 2013. An investigation of deep neural networks for racket robust speech recognition., Proceedings of 2013 IEEE International Conference on Acoustics, Speech communication and Indicate Processing (ICASSP), Vancouver, Canada, Assortment, pp.7398-7402. https://doi.org/10.1109/ICASSP.2013.6639100

    CrossRef

  7. Kumaravel, SS 2015. Visual speech recognition using histogram of oriented displacements. MS thesis. Clemson University. Clemson, SC.

  8. Yi, KO (1998). The internal structure of Korean syllables: rhyme or body?. Korean Periodical of Experimental & Cognitive Psychology. 10, 67-83.

  9. Kwon, YM (2010). Development of bilabialization (labialization). Korean Linguistics. 47, 93-130.

  10. National Emergency Medical Centre (2005). Emergency Medical Dictionary. Seoul: National Emergency Medical Centre

  11. LeCun, Y, Bottou, L, Bengio, Y, and Haffner, P (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE. 86, 2278-2324. https://doi.org/ten.1109/5.726791

    CrossRef

  12. Krizhevsky, A, Sutskever, I, and Hinton, GE (2012). ImageNet nomenclature with deep convolutional neural networks. Advances in Neural Information Processing Systems. 25, 1097-1105.

    CrossRef

  13. Simonyan, One thousand, and Zisserman, A. (2014) . Very deep convolutional networks for large-scale image recognition. Bachelor https://arxiv.org/abs/1409.1556

  14. Szegedy, C, Liu, Westward, Jia, Y, Sermanet, P, Reed, S, Anguelov, D, Erhan, D, Vanhoucke, Five, and Rabinovich, A 2015. Going deeper with convolutions., Proceedings of the IEEE Conference on Reckoner Vision and Pattern Recognition, Boston, MA, Array, pp.1-9. https://doi.org/10.1109/CVPR.2015.7298594

    CrossRef

  15. Long, J, Shelhamer, E, and Darrell, T 2015. Fully convolutional networks for semantic segmentation., Proceedings of the IEEE Briefing on Computer Vision and Pattern Recognition, Boston, MA, pp.3431-3440.

    Pubmed CrossRef

  16. Bregler, C, and Konig, Y 1994. "Eigenlips" for robust oral communication recognition., Proceedings of IEEE International Conference on Acoustics, Speech, and Betoken Processing, Adelaide, Australia, Array, pp.669-672. https://doi.org/x.1109/icassp.1994.389567

    CrossRef

  17. Shin, J, Lee, J, and Kim, D (2011). Real-time lip reading system for isolated Korean give-and-take recognition. Pattern Recognition. 44, 559-571. https://doi.org/10.1016/j.patcog.2010.09.011

    CrossRef

  18. Viola, P, and Jones, 1000 . Rapid object detection using a boosted pour of simple features., Proceedings of the 2001 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2001, Kauai, HI, Assortment, pp.511-518. https://doi.org/10.1109/CVPR.2001.990517

    CrossRef

  19. Vedaldi, A, and Lenc, Thousand 2015. MatConvNet: convolutional neural networks for Matlab., Proceedings of the 23rd ACM International Conference on Multimedia, Brisbane, Australia, Array, pp.689-692. https://doi.org/10.1145/2733373.2807412

    CrossRef

Biographies

Sung-Won Lee received his B.S. degree in electric and electronic engineering from Seo-Kyeong University, Seoul, Korea, in 2015. He received M.S. degree in electric and electronic engineering science from Chung-Ang University, Seoul, Korea, in 2017. He is currently pursuing the Ph.D. degree in electrical and electronics at Chung-Ang University, Seoul, Korea. His research interests includes IoT, sensor network, embedded, security algorithm.

Eastward-mail: sungwon8912@cau.ac.kr


Je-Hun Yu received his M.S. caste in electric and electronic engineering from Chung-Ang University, Seoul, Korea, in 2017. His research interests includes encephalon-computer interface, intention recognition, emotion recognition, intelligent robot, intelligence system, Internet of Things, and big data.

Electronic mail: yjhoon651@cau.ac.kr


Seung Min Park received B.S., M.Southward., and Ph.D. degrees from the Department of Electrical and Electronics Engineering, Chung-Ang University, Seoul, Korea, in 2010, 2012, and 2019, respectively. In 2017 and 2018, he joined the Department of Electrical Electronics Engineering, Chung-Ang Academy, as a Lecturer. His current research interests include machine learning, brain computer interface, blueprint recognition, intention recognition and deep learning. Dr. Park was a recipient of the prizes for all-time paper from the Korean Institute of Intelligent Systems Conference in 2010, 2011, 2015, 2016, 2018 and Student Newspaper Award from the 13th International Conference on Control, Automation and Systems in 2013. He was the Session Chair of the 7th International Conference on Natural Computation and the 8th International Conference on Fuzzy Systems and Knowledge Discovery (ICNC & FSKD '11) held in Shanghai, China. He is a member of the Korean Establish of Intelligent Systems (KIIS) and Institute of Control, Robotics and Systems (ICROS). He became an IEEE member in 2018.

E-postal service: sminpark@cau.air conditioning.kr


Kwee-Bo Sim received the Ph.D. degree in electronic engineering from University of Tokyo, Nippon, in 1990. From 2007 to 2007, he was a President of Korean Institute of Intelligent Organization. Since 1991, he has been an Professor with Department of Electrical and Electronic Applied science, Chung-Ang Academy, Seoul. His research interests includes artificial life, ubiquitous robotics, intelligent arrangement, soft computing, big-data, deep learning, and recognition. He is a fellow member of the IEEE, SICE, RSJ, IEEK, KIEE, KIIS, KROS, IEMEK, and is an ICROS Fellow.

E-post: kbsim@cau.air-conditioning.kr


bugbeestanothom.blogspot.com

Source: https://www.ijfis.org/journal/view.html?uid=858&vmd=Full

0 Response to "Speech Recognition Using Artificial Neural Network – a Review"

Post a Comment

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel