Subscribe to the PwC Newsletter
Join the community, add a new evaluation result row, lip reading.
38 papers with code • 3 benchmarks • 4 datasets
Lip Reading is a task to infer the speech content in a video by using only the visual information, especially the lip movements. It has many crucial applications in practice, such as assisting audio-based speech recognition, biometric authentication and aiding hearing-impaired people.
Source: Mutual Information Maximization for Effective Lip Reading
Benchmarks Add a Result
Most implemented papers
Combining residual networks with lstms for lipreading.
We propose an end-to-end deep learning architecture for word-level visual speech recognition.
Deep Audio-Visual Speech Recognition
The goal of this work is to recognise phrases and sentences being spoken by a talking face, with or without the audio.
End-to-end Audio-visual Speech Recognition with Conformers
In this work, we present a hybrid CTC/Attention model based on a ResNet-18 and Convolution-augmented transformer (Conformer), that can be trained in an end-to-end manner.
LRW-1000: A Naturally-Distributed Large-Scale Benchmark for Lip Reading in the Wild
It has shown a large variation in this benchmark in several aspects, including the number of samples in each class, video resolution, lighting conditions, and speakers' attributes such as pose, age, gender, and make-up.
Lipreading using Temporal Convolutional Networks
We present results on the largest publicly-available datasets for isolated word recognition in English and Mandarin, LRW and LRW1000, respectively.
AuthNet: A Deep Learning based Authentication Mechanism using Temporal Facial Feature Movements
Biometric systems based on Machine learning and Deep learning are being extensively used as authentication mechanisms in resource-constrained environments like smartphones and other small computing devices.
Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction
The lip-reading WER is further reduced to 26. 9% when using all 433 hours of labeled data from LRS3 and combined with self-training.
MixSpeech: Cross-Modality Self-Learning with Audio-Visual Stream Mixup for Visual Speech Translation and Recognition
exgc/avmust-ted • ICCV 2023
However, despite researchers exploring cross-lingual translation techniques such as machine translation and audio speech translation to overcome language barriers, there is still a shortage of cross-lingual studies on visual speech.
Estimating speech from lip dynamics
Dirivian/dynamic_lips • 3 Aug 2017
The goal of this project is to develop a limited lip reading algorithm for a subset of the English language.
XFlow: Cross-modal Deep Neural Networks for Audiovisual Classification
Our work improves on existing multimodal deep learning algorithms in two essential ways: (1) it presents a novel method for performing cross-modality (before features are learned from individual modalities) and (2) extends the previously proposed cross-connections which only transfer information between streams that process compatible data.
European Conference on Computer Vision
ECCV 2022: Computer Vision – ECCV 2022 pp 576–593 Cite as
Speaker-Adaptive Lip Reading with User-Dependent Padding
- Minsu Kim ORCID: orcid.org/0000-0002-6514-0018 12 ,
- Hyunjun Kim ORCID: orcid.org/0000-0001-6524-8689 12 &
- Yong Man Ro ORCID: orcid.org/0000-0001-5306-6853 12
- Conference paper
- First Online: 29 October 2022
Part of the Lecture Notes in Computer Science book series (LNCS,volume 13696)
Lip reading aims to predict speech based on lip movements alone. As it focuses on visual information to model the speech, its performance is inherently sensitive to personal lip appearances and movements. This makes the lip reading models show degraded performance when they are applied to unseen speakers due to the mismatch between training and testing conditions. Speaker adaptation technique aims to reduce this mismatch between train and test speakers, thus guiding a trained model to focus on modeling the speech content without being intervened by the speaker variations. In contrast to the efforts made in audio-based speech recognition for decades, the speaker adaptation methods have not well been studied in lip reading. In this paper, to remedy the performance degradation of lip reading model on unseen speakers, we propose a speaker-adaptive lip reading method, namely user-dependent padding. The user-dependent padding is a speaker-specific input that can participate in the visual feature extraction stage of a pre-trained lip reading model. Therefore, the lip appearances and movements information of different speakers can be considered during the visual feature encoding, adaptively for individual speakers. Moreover, the proposed method does not need 1) any additional layers, 2) to modify the learned weights of the pre-trained model, and 3) the speaker label of train data used during pre-train. It can directly adapt to unseen speakers by learning the user-dependent padding only, in a supervised or unsupervised manner. Finally, to alleviate the speaker information insufficiency in public lip reading databases, we label the speaker of a well-known audio-visual database, LRW, and design an unseen-speaker lip reading scenario named LRW-ID. The effectiveness of the proposed method is verified on sentence- and word-level lip reading, and we show it can further improve the performance of a well-trained model with large speaker variations.
- Visual speech recognition
- Lip reading
- Speaker-adaptive training
- Speaker adaptation
- User-dependent padding
This is a preview of subscription content, access via your institution .
- Available as PDF
- Read on any device
- Instant download
- Own it forever
- Available as EPUB and PDF
- Compact, lightweight edition
- Dispatched in 3 to 5 business days
- Free shipping worldwide - see info
Tax calculation will be finalised at checkout
Purchases are for personal use only
Abdel-Hamid, O., Jiang, H.: Fast speaker adaptation of hybrid NN/HMM model for speech recognition based on discriminative learning of speaker code. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7942–7946. IEEE (2013)
Abdel-Hamid, O., Jiang, H.: Rapid and effective speaker adaptation of convolutional neural network based models for speech recognition. In: INTERSPEECH, pp. 1248–1252 (2013)
Afouras, T., Chung, J.S., Senior, A., Vinyals, O., Zisserman, A.: Deep audio-visual speech recognition. IEEE Trans. Pattern Anal. Mach. Intell. (2018)
Afouras, T., Chung, J.S., Zisserman, A.: LRS3-TED: a large-scale dataset for visual speech recognition. arXiv preprint arXiv:1809.00496 (2018)
Afouras, T., Chung, J.S., Zisserman, A.: ASR is all you need: cross-modal distillation for lip reading. In: ICASSP 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2143–2147. IEEE (2020)
Almajai, I., Cox, S., Harvey, R., Lan, Y.: Improved speaker independent lip reading using speaker adaptive training and deep neural networks. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2722–2726. IEEE (2016)
Anastasakos, T., McDonough, J., Makhoul, J.: Speaker adaptive training: a maximum likelihood approach to speaker normalization. In: 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 2, pp. 1043–1046. IEEE (1997)
Anvari, Z., Athitsos, V.: A pipeline for automated face dataset creation from unlabeled images. In: Proceedings of the 12th ACM International Conference on PErvasive Technologies Related to Assistive Environments, pp. 227–235 (2019)
Assael, Y.M., Shillingford, B., Whiteson, S., De Freitas, N.: LipNet: end-to-end sentence-level lipreading. arXiv preprint arXiv:1611.01599 (2016)
Chung, J.S., Senior, A., Vinyals, O., Zisserman, A.: Lip reading sentences in the wild. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3444–3453. IEEE (2017)
Chung, J.S., Zisserman, A.: Lip reading in the wild. In: Lai, S.-H., Lepetit, V., Nishino, K., Sato, Y. (eds.) ACCV 2016. LNCS, vol. 10112, pp. 87–103. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-54184-6_6
CrossRef Google Scholar
Chung, J.S., Zisserman, A.: Out of time: automated lip sync in the wild. In: Chen, C.-S., Lu, J., Ma, K.-K. (eds.) ACCV 2016. LNCS, vol. 10117, pp. 251–263. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-54427-4_19
Cooke, M., Barker, J., Cunningham, S., Shao, X.: An audio-visual corpus for speech perception and automatic speech recognition. J. Acoust. Soc. Am. 120 (5), 2421–2424 (2006)
Dehak, N., Kenny, P.J., Dehak, R., Dumouchel, P., Ouellet, P.: Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process. 19 (4), 788–798 (2010)
Deng, J., Guo, J., Ververas, E., Kotsia, I., Zafeiriou, S.: RetinaFace: single-shot multi-level face localisation in the wild. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5203–5212 (2020)
Deng, J., Guo, J., Xue, N., Zafeiriou, S.: ArcFace: additive angular margin loss for deep face recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4690–4699 (2019)
Digalakis, V.V., Rtischev, D., Neumeyer, L.G.: Speaker adaptation using constrained estimation of Gaussian mixtures. IEEE Trans. Speech Audio Process. 3 (5), 357–366 (1995)
Ganin, Y., Lempitsky, V.: Unsupervised domain adaptation by backpropagation. In: International Conference on Machine Learning, pp. 1180–1189. PMLR (2015)
Gopinath, R.A.: Maximum likelihood modeling with gaussian distributions for classification. In: Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 1998 (Cat. No. 98CH36181), vol. 2, pp. 661–664. IEEE (1998)
Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 369–376 (2006)
Guo, Y., Zhang, L., Hu, Y., He, X., Gao, J.: MS-Celeb-1M: a dataset and benchmark for large-scale face recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 87–102. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46487-9_6
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Hong, J., Kim, M., Park, S.J., Ro, Y.M.: Speech reconstruction with reminiscent sound via visual voice memory. IEEE/ACM Trans. Audio Speech Lang. Process. 29 , 3654–3667 (2021)
Huang, Y., He, L., Wei, W., Gale, W., Li, J., Gong, Y.: Using personalized speech synthesis and neural language generator for rapid speaker adaptation. In: ICASSP 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7399–7403. IEEE (2020)
Kandala, P.A., et al.: Speaker adaptation for lip-reading using visual identity vectors. In: INTERSPEECH, pp. 2758–2762 (2019)
Kim, M., Hong, J., Park, S.J., Ro, Y.M.: CroMM-VSR: cross-modal memory augmented visual speech recognition. IEEE Trans. Multimedia (2021)
Kim, M., Hong, J., Park, S.J., Ro, Y.M.: Multi-modality associative bridging through memory: speech sound recollected from face video. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 296–306 (2021)
Kim, M., Hong, J., Ro, Y.M.: Lip to speech synthesis with visual context attentional GAN. Adv. Neural. Inf. Process. Syst. 34 , 2758–2770 (2021)
Kim, M., Yeo, J.H., Ro, Y.M.: Distinguishing homophenes using multi-head visual-audio memory for lip reading. In: Proceedings of the 36th AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, vol. 22 (2022)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Klejch, O., Fainberg, J., Bell, P., Renals, S.: Speaker adaptive training using model agnostic meta-learning. In: 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 881–888. IEEE (2019)
Lee, D.H., et al.: Pseudo-label: the simple and efficient semi-supervised learning method for deep neural networks. In: Workshop on Challenges in Representation Learning, ICML, vol. 3, p. 896 (2013)
Li, B., Sim, K.C.: Comparison of discriminative input and output transformations for speaker adaptation in the hybrid NN/HMM systems. In: Eleventh Annual Conference of the International Speech Communication Association (2010)
Li, X., Bilmes, J.: Regularized adaptation of discriminative classifiers. In: 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings, vol. 1, pp. I. IEEE (2006)
Liao, H., McDermott, E., Senior, A.: Large scale deep neural network acoustic modeling with semi-supervised training data for YouTube video transcription. In: 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, pp. 368–373. IEEE (2013)
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)
Ma, P., Mira, R., Petridis, S., Schuller, B.W., Pantic, M.: LiRA: learning visual speech representations from audio through self-supervision. arXiv preprint arXiv:2106.09171 (2021)
Martinez, B., Ma, P., Petridis, S., Pantic, M.: Lipreading using temporal convolutional networks. In: ICASSP 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6319–6323. IEEE (2020)
Mei, K., Zhu, C., Zou, J., Zhang, S.: Instance adaptive self-training for unsupervised domain adaptation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12371, pp. 415–430. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58574-7_25
Meng, Z., et al.: Speaker-invariant training via adversarial learning. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5969–5973. IEEE (2018)
Miao, Y., Zhang, H., Metze, F.: Towards speaker adaptive training of deep neural network acoustic models. In: Fifteenth Annual Conference of the International Speech Communication Association (2014)
Miao, Y., Zhang, H., Metze, F.: Speaker adaptive training of deep neural network acoustic models using i-vectors. IEEE/ACM Trans. Audio Speech Lang. Process. 23 (11), 1938–1949 (2015)
Mira, R., Haliassos, A., Petridis, S., Schuller, B.W., Pantic, M.: SVTS: scalable video-to-speech synthesis. arXiv preprint arXiv:2205.02058 (2022)
Mira, R., Vougioukas, K., Ma, P., Petridis, S., Schuller, B.W., Pantic, M.: End-to-end video-to-speech synthesis using generative adversarial networks. IEEE Trans. Cybern. (2022)
Neto, J., et al.: Speaker-adaptation for hybrid HMM-ANN continuous speech recognition system (1995)
Noda, K., Yamaguchi, Y., Nakadai, K., Okuno, H.G., Ogata, T.: Lipreading using convolutional neural network. In: Fifteenth Annual Conference of the International Speech Communication Association (2014)
Petridis, S., Stafylakis, T., Ma, P., Cai, F., Tzimiropoulos, G., Pantic, M.: End-to-end audiovisual speech recognition. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6548–6552. IEEE (2018)
Ren, S., Du, Y., Lv, J., Han, G., He, S.: Learning from the master: distilling cross-modal advanced knowledge for lip reading. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13325–13333 (2021)
Seide, F., Li, G., Chen, X., Yu, D.: Feature engineering in context-dependent deep neural networks for conversational speech transcription. In: 2011 IEEE Workshop on Automatic Speech Recognition & Understanding, pp. 24–29. IEEE (2011)
Stafylakis, T., Tzimiropoulos, G.: Combining residual networks with LSTMs for lipreading. arXiv preprint arXiv:1703.04105 (2017)
Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. Adv. Neural Inf. Process. Syst. 27 (2014)
Swietojanski, P., Renals, S.: Learning hidden unit contributions for unsupervised speaker adaptation of neural network acoustic models. In: 2014 IEEE Spoken Language Technology Workshop (SLT), pp. 171–176. IEEE (2014)
Tzeng, E., Hoffman, J., Saenko, K., Darrell, T.: Adversarial discriminative domain adaptation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7167–7176 (2017)
Vaswani, A., et al.: Attention is all you need. Adv. Neural Inf. Process. Syst. 30 (2017)
Veselỳ, K., Hannemann, M., Burget, L.: Semi-supervised training of deep neural networks. In: 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, pp. 267–272. IEEE (2013)
Weng, X., Kitani, K.: Learning spatio-temporal features with two-stream deep 3D CNNs for lipreading. arXiv preprint arXiv:1905.02540 (2019)
Xiao, J., Yang, S., Zhang, Y., Shan, S., Chen, X.: Deformation flow based two-stream network for lip reading. In: 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), pp. 364–370. IEEE (2020)
Xie, Q., Luong, M.T., Hovy, E., Le, Q.V.: Self-training with noisy student improves ImageNet classification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10687–10698 (2020)
Xue, S., Abdel-Hamid, O., Jiang, H., Dai, L., Liu, Q.: Fast adaptation of deep neural network based on discriminant codes for speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 22 (12), 1713–1725 (2014)
Yang, C., Wang, S., Zhang, X., Zhu, Y.: Speaker-independent lipreading with limited data. In: 2020 IEEE International Conference on Image Processing (ICIP), pp. 2181–2185. IEEE (2020)
Yang, S., et al.: LRW-1000: a naturally-distributed large-scale benchmark for lip reading in the wild. In: 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019), pp. 1–8. IEEE (2019)
Yarowsky, D.: Unsupervised word sense disambiguation rivaling supervised methods. In: 33rd Annual Meeting of the Association for Computational Linguistics, pp. 189–196 (1995)
Yu, D., Yao, K., Su, H., Li, G., Seide, F.: KL-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7893–7897. IEEE (2013)
Zhang, Q., Wang, S., Chen, G.: Speaker-independent lipreading by disentangled representation learning. In: 2021 IEEE International Conference on Image Processing (ICIP), pp. 2493–2497. IEEE (2021)
Zhao, G., Barnard, M., Pietikainen, M.: Lipreading with local spatiotemporal descriptors. IEEE Trans. Multimedia 11 (7), 1254–1265 (2009)
Zhao, Y., Xu, R., Wang, X., Hou, P., Tang, H., Song, M.: Hearing lips: improving lip reading by distilling speech recognizers. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 6917–6924 (2020)
This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. 2022-0-00124, Development of Artificial Intelligence Technology for Self-Improving Competency-Aware Learning Capabilities).
Authors and affiliations.
Image and Video Systems Lab, School of Electrical Engineering, KAIST, Daejeon, South Korea
Minsu Kim, Hyunjun Kim & Yong Man Ro
You can also search for this author in PubMed Google Scholar
Correspondence to Yong Man Ro .
Editors and affiliations.
Tel Aviv University, Tel Aviv, Israel
University College London, London, UK
Google AI, Accra, Ghana
University of Catania, Catania, Italy
Giovanni Maria Farinella
Facebook (United States), Menlo Park, CA, USA
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Supplementary material 1 (pdf 325 KB)
Rights and permissions.
Reprints and Permissions
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper.
Kim, M., Kim, H., Ro, Y.M. (2022). Speaker-Adaptive Lip Reading with User-Dependent Padding. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13696. Springer, Cham. https://doi.org/10.1007/978-3-031-20059-5_33
DOI : https://doi.org/10.1007/978-3-031-20059-5_33
Published : 29 October 2022
Publisher Name : Springer, Cham
Print ISBN : 978-3-031-20058-8
Online ISBN : 978-3-031-20059-5
eBook Packages : Computer Science Computer Science (R0)
Share this paper
Anyone you share the following link with will be able to read this content:
Sorry, a shareable link is not currently available for this article.
Provided by the Springer Nature SharedIt content-sharing initiative
- Find a journal
- Publish with us
Lip Reading Sentences Using Deep Learning With Only Visual Cues
- Change Username/Password
- Update Address
- Payment Options
- Order History
- View Purchased Documents
- Communications Preferences
- Profession and Education
- Technical Interests
- US & Canada: +1 800 678 4333
- Worldwide: +1 732 981 0060
- Contact & Support
- About IEEE Xplore
- Nondiscrimination Policy
- Privacy & Opting Out of Cookies
A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2023 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.
An official website of the United States government
The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.
The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.
- Account settings
- Advanced Search
- Journal List
- J Acoust Soc Am
Some normative data on lip-reading skills (L)
The ability to obtain reliable phonetic information from a talker’s face during speech perception is an important skill. However, lip-reading abilities vary considerably across individuals. There is currently a lack of normative data on lip-reading abilities in young normal-hearing listeners. This letter describes results obtained from a visual-only sentence recognition experiment using CUNY sentences and provides the mean number of words correct and the standard deviation for different sentence lengths. Additionally, the method for calculating T-scores is provided to facilitate the conversion between raw and standardized scores. This metric can be utilized by clinicians and researchers in lip-reading studies. This statistic provides a useful benchmark for determining whether an individual’s lip-reading score falls within the normal range, or whether it is above or below this range.
Evidence from studies in audiovisual speech perception has shown that visual speech cues, provided by optical information in the talker’s face, has facilitatory effects in terms of accuracy across a wide range of auditory signal-to-noise ratios ( Grant and Seitz, 1998 ; Sumby and Pollack, 1954 ). In their seminal study, Sumby and Pollack reported that the obtained benefit from the visual speech signal is related to the quality of the auditory information, with a more noticeable gain observed for lower signal-to-noise ratios. Their findings are theoretically important as they also show that visual information provides benefits across many signal-to-noise ratios.
In a more recent study involving neural measures of visual enhancement, van Wassenhove et al. (2005) compared peek amplitudes in an EEG task and observed that lip- reading information “speeds up” the neural processing of auditory speech signals. The effects of visual enhancement of speech continue to be explored using more recent methodologies and tools (for an analysis using fMRI, see also Bernstein et al., 2002 ). Although numerous studies have demonstrated that visual information about speech enhances and facilitates auditory recognition of speech in both normal-hearing and clinical populations ( Bergeson and Pisoni, 2004 ; Kaiser et al., 2003 ), there is currently a lack of basic information regarding more fundamental aspects of visual speech processing. Most surprising perhaps, is that even after decades of research there are no normative data on lip- reading ability available to researchers and clinicians to serve as benchmarks of performance.
When a researcher obtains a cursory assessment of lip-reading ability, how does the score compare to the rest of the population? Simply put, what exactly constitutes a “good” or otherwise above-average lip-reader? Although there is a growing body of literature investigating perceptual and cognitive factors associated with visual-only performance (e.g., Bernstein et al., 1998 ; Feld and Sommers, 2009 ) exactly what constitutes superior, average, and markedly below-average lip-reading ability has yet to be quantified in any precise manner. Auer and Bernstein (2007) did report lip-reading data from sentence recognition tasks using normal-hearing and hearing-impaired populations, and provided some initial descriptive statistics from both populations. In this letter, we go a step further by reporting standardized T-scores, and, additionally, recognition scores for sentences of different word lengths.
Visual-only sentence recognition
To answer the question of what accuracy level makes a good lip reader, we carried out a visual-only sentence recognition task designed to assess lip-reading skills in an ecologically valid manner. Eighty-four young normal-hearing undergraduates were presented with 25 CUNY sentences (with the auditory track removed) of variable length spoken by a female talker ( Boothroyd et al., 1988 ). The use of CUNY sentence materials provides a more ecologically valid measure of language processing than the perception of words or syllables in isolation. One potential objection to using sentences is their predictability. However, language processing requires both sensory processing in addition to the integration of contextual information over time. Therefore, the use of less predictable or anomalous sentences might be of interest in future studies, although it remains beyond the scope of our present report. We shall now describe the details of the study, and provide the results and method for converting raw scores to T-scores.
Eighty-four college-age participants were recruited at Indiana University and were either given course credit or paid for their participation. All participants reported normal hearing and had normal or corrected vision at the time of testing.
The stimulus set consisted of 25 sentences obtained from a database of pre-recorded audiovisual sentences (CUNY sentences) ( Boothroyd et al., 1988 ) spoken by a female talker. The auditory track was removed from each of the 25 sentences using Final Cut Pro HD. The set of 25 sentences was then subdivided into the following word lengths: 3, 5, 7, 9, and 11 words with five sentences for each length. We did this because sentence length naturally varies in everyday conversation. Sentences were presented randomly for each participant and we did not provide any cues with regard to sentence length or semantic content. The sentence materials are shown in the Appendix.
Design and procedure
Data from the 25 visual-only sentences were obtained from a pre-screening session in two experiments designed to test hypotheses related to visual-only sentence recognition abilities. The stimuli were digitized from a laser video disk and rendered into a 720 × 480 pixel movie at a rate of 30 frames∕s. The movies were displayed on a Macintosh monitor with a refresh rate of 75 Hz. Participants were seated approximately 16–24 in. from the computer monitor. Each trial began with the presentation of a fixation cross (+) for approximately 500 ms followed by a video of a female talker, with the sound removed, speaking one of the 25 sentences listed in the Appendix. After the talker finished speaking the sentence, a dialog box appeared in the center of the screen instructing the participant to type in the words they thought the talker said by using a keyboard. Each sentence was given to the participant only once. No feedback was provided on any of the test trials.
Scoring was carried out in the following manner: If the participant correctly typed a word in the sentence, then that word was scored as “correct.” The proportion of words correct was scored across sentences. For the sentence “Is your sister in school,” if the participant typed in “Is the…” only the word “Is” would be scored as correct. In this example, one out of five words would be correct, making the proportion correct = 1∕5 = 0.20. Word order was not a critical criterion for a word to be scored as accurate. However, upon inspection of the data, participants almost never switched word order in their responses. Subject responses were manually corrected for any misspellings. These visual-only word-recognition scores provide a valuable benchmark for assessing overall lip-reading ability in individual participants and can be used as normative data for other research purposes.
The results revealed that the mean lip-reading score in visual-only sentence recognition was 12.4% correct with a standard deviation of 6.67%. Figure Figure1 1 shows a box plot of the results where the lines indicate the mean, 75th and 25th percentile, as well as 1.5 times the interquartile range. Two outliers denoted by open circles, each close to 30% correct, are also plotted. The proportion of words identified correctly was not identical across sentence length. The mean and standard deviation of the accuracy scores for each sentence length are provided in Table TABLE I. . Correct identification across sentence length differed, with increased accuracy for longer sentences (up to nine words) before decreasing again for sentence lengths of 11 words [F(4,83) = 21.46, p < 0.001]. This interesting finding is consistent with the hypothesis that language processing involves the use of higher-order cognitive resources to integrate semantic context over time. Hence, shorter sentences might not provide enough contextual information, whereas longer sentences may burden working memory capacity (see Feld and Sommers, 2009 ). Although sentences do provide contextual cues, such information is quite difficult to obtain in the visual modality, especially when sentence length increases. Incidentally, this reasoning explains why auditory-only accuracy, but not visual-only accuracy, improves for sentence recognition compared to single-word recognition in isolation.
The mean and standard deviation of words correctly identified for each sentence length, including the mean and standard deviation collapsed across all lengths (“Overall”). a
The line in the middle of the box shows the mean visual-only sentence recognition score across all 84 participants. The 75th and 25th percentile are represented by the line above and below the middle line, respectively. The small bars on the top and bottom denote a value of 1.5 times the interquartile range.
Conversion of raw scores to T-scores
In order to determine individual performance relative to a standard benchmark, the method and rationale for calculating T-scores will be provided. Standardized T-scores have a mean of 50 and a standard deviation of 10. These standardized scores are generally preferred by clinicians and psychometricians over Z-scores due to the relative ease of their interpretability and appeal to intuition. For example, T-scores are positive, whereas Z-scores below the mean yield negative numbers, which does not make intuitive sense for visual-only accuracy scores.
The T-scores were computed in the following manner: The overall mean was subtracted from each individual raw score and divided by the standard deviation, thereby converting the raw score into a Z-score. Taking this score, multiplying it by a factor of 10, and then adding 50 provides us with the T-score for that individual:
For example, the mean score of 12.4% correct-word recognition gives us a T-score of 50, whereas an accuracy level of just over 2% yields a T-score of 35 (1.5 standard deviations below the mean) and an accuracy level of just over 22% correct yields a T-score of 65 (1.5 standard deviations above the mean). Computing T-scores is quite convenient, and can be utilized to convert a raw CUNY lip-reading score obtained from an open set sentence recognition test into an interpretable standardized score. This can inform clinicians and researchers where an individual stands relative to the population of young healthy participants.
Qualitatively, the scores reflect the difficulty of lip reading in an open-set sentence recognition task. Mean-word-recognition accuracy scores were barely greater than 10% correct. Further, any individual who achieved a CUNY lip-reading score of 30% correct is considered an outlier, giving them a T-score of nearly 80—three times the standard deviation from the mean. A lip-reading recognition accuracy score of 45% correct places an individual 5 standard deviations above the mean.
These results quantify the inherent difficulty in visual-only sentence recognition. One potential concern is that CUNY sentences tend to yield lower V -only accuracy than other sentence materials (see, e.g., Auer and Bernstein, 2007 ). However, the major contribution of our study is that it provides clinicians and researchers with a valuable benchmark for assessing lip-reading skills using a database that has a well-established history in the clinical and behavioral research community (see, e.g., Bergeson and Pisoni, 2004 ; Boothroyd et al., 1988 ; Kaiser et al., 2003 ). CUNY sentences are used widely in sentence perception tasks using normal hearing, elderly, and patients with cochlear implants as subjects. With these results, it is now possible to quantify the lip-reading ability of an individual participant relative to a normal-hearing population.
One potentially fruitful application would be to determine where a specific hearing-impaired listener falls on a standardized distribution. Research on visual-only speech recognition, for example, has shown that individuals with a progressive, rather than sudden, hearing loss have higher lip-reading recognition scores ( Bergeson et al., 2003 ). Other research has also demonstrated that lip-reading ability serves as an important behavioral predictor of who will benefit from a cochlear implant (see Bergeson and Pisoni, 2004 ). How might the scores from each individual in these populations compare with the standard scores from a normal-hearing population? These examples and numerous other scenarios suggest the importance of having some normative data on lip-reading ability readily available for the speech research community including basic researchers, as well as clinicians, who work with hearing-impaired listeners to determine strengths, weaknesses, and milestones.
Although this study only employed CUNY sentences, the use of sentences from this well-established and widely used database provided a generalized measure of language processing ability, and a T-score conversion method that should be applicable to other open-set sentence identification tasks. Future studies might consider establishing norms for visually presented anomalous sentences, isolated words, and syllables. It will also be worthwhile to obtain normative data for elderly listeners who have been found to have poorer lip-reading skills than younger listeners ( Sommers et al., 2005 ).
This study was supported by the National Institute of Health (Grant No. DC-00111) and the National Institute of Health Speech Training (Grant No. DC-00012) and by NIMH (057717-07) and AFOSR (FA9550-07-1-0078) grants to J.T.T. We would like to acknowledge the members of the Speech Research Laboratory and James T. Townsend’s Psychological Modeling Laboratory at Indiana University for their helpful and insightful discussion. We also wish to thank two anonymous reviewers for their valuable input on an earlier version of this manuscript.
What will we make for dinner when our neighbors come over
Is your sister in school
Does your boss give you a bonus every year
Do not spend so much on new clothes
What is your recipe for cheesecake
Is your nephew having a birthday party next week
What is the humidity
Let the children stay up for Halloween
He plays the bass in a jazz band every Monday night
How long does it take to roast a turkey
Which team won
Take your vitamins every morning after breakfast
People who invest in stocks and bonds now take some risks
Those albums are very old
Aren’t dishwashers convenient
Is it snowing or raining right now
The school will be closed for Washington’s Birthday and Lincoln’s Birthday
Your check arrived by mail
Professional musicians must practice at least three hours everyday
Are whales mammals
Did the basketball game go into overtime
When he went to the dentist he had his teeth cleaned
We’ll plant roses this spring
I always mail in my loan payments on time
Sneakers are comfortable
- Auer, E. T., and Bernstein, L. E. (2007). “ Enhanced visual speech perception in individual with early-onset hearing impairment ,” J. Speech Lang. Hear. Res. 50 , 1157–1165. 10.1044/1092-4388(2007/080) [ PubMed ] [ CrossRef ] [ Google Scholar ]
- Bergeson, T. R., and Pisoni, D.B. (2004). “ Audiovisual speech perception in deaf adults and children following cochlear implantation ,” in The Handbook of Multisensory Processes , edited by G. A.Calvert,C.Spence, andStein B. E. (The MIT Press, Cambridge, MA: ), pp. 153–176. [ Google Scholar ]
- Bergeson, T. R., Pisoni, D. B., Reese, L., and Kirk, K. I. (2003). “ Audiovisual speech perception in adult cochlear implant users: Effects of sudden vs. progressive hearing loss ,” Poster presented at the Annual Midwinter Research Meeting of the Association for Research in Otolaryngology , Daytona Beach, FL. [ Google Scholar ]
- Bernstein, L. E., Auer, E. T., Moore, J. K., Ponton, C., Don, M., and Singh, M. (2002). “ Visual speech perception without primary auditory cortex activation ,” NeuroReport. 13 , 31–315. [ PubMed ] [ Google Scholar ]
- Bernstein, L. E., Demorest, M. E., and Tucker, P. E. (1998). “ What makes a good speechreader? First you have to find one ,” in Hearing by Eye II: Advances in the Psychology of Speechreading and Audio-Visual Speech , edited by R.Campbell,B.Dodd, andBurnham D. (Psychology Press, Erlbaum, UK: ), pp. 211–227. [ Google Scholar ]
- Boothroyd, A., Hnath-Chisolm, T., Hanin, L., and Kishon-Rabin, L. (1988). “ Voice fundamental frequency as an auditory supplement to the speechreading of sentences ,” Ear Hear. 9 , 306–312. 10.1097/00003446-198812000-00006 [ PubMed ] [ CrossRef ] [ Google Scholar ]
- Feld, J. E., and Sommers, M. S. (2009). “ Lipreading, processing speed, and working memory in younger and older adults ,” J. Speech Lang. Hear. Res. 52 , 1555–1565. 10.1044/1092-4388(2009/08-0137) [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
- Grant, K. W., and Seitz, P. F. (1998). “ Measures of auditory-visual integration in nonsense syllables and sentences ,” J. Acoust. Soc. Am. 104 , 2438–2450. 10.1121/1.423751 [ PubMed ] [ CrossRef ] [ Google Scholar ]
- Kaiser, A., Kirk, K., Lachs, L., and Pisoni, D. (2003). “ Talker and lexical effects on audiovisual word recognition by adults with cochlear implants ,” J. Speech Lang. Hear. Res. 46 , 390–404. 10.1044/1092-4388(2003/032) [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
- Sommers, M., Tye-Murray, N., and Spehar, B. (2005). “ Auditory-visual speech perception and auditory-visual enhancement in normal-hearing younger and older adults ,” Ear Hear. 26 , 263–275. 10.1097/00003446-200506000-00003 [ PubMed ] [ CrossRef ] [ Google Scholar ]
- Sumby, W. H., and Pollack, I. (1954). “ Visual contribution to speech intelligibility in noise ,” J. Acoust. Soc. Am. 26 , 12–15. [ Google Scholar ]
- van Wassenhove, V. Grant, K., and Poeppel, D. (2005). “ Visual speech speeds up the neural processing of auditory speech ,” Proc. Natl. Acad. Sci. USA 102 , 1181–1186. 10.1073/pnas.0408949102 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
Help | Advanced Search
Computer Science > Computer Vision and Pattern Recognition
Title: sub-word level lip reading with visual attention.
Abstract: The goal of this paper is to learn strong lip reading models that can recognise speech in silent videos. Most prior works deal with the open-set visual speech recognition problem by adapting existing automatic speech recognition techniques on top of trivially pooled visual features. Instead, in this paper we focus on the unique challenges encountered in lip reading and propose tailored solutions. To this end, we make the following contributions: (1) we propose an attention-based pooling mechanism to aggregate visual speech representations; (2) we use sub-word units for lip reading for the first time and show that this allows us to better model the ambiguities of the task; (3) we propose a model for Visual Speech Detection (VSD), trained on top of the lip reading network. Following the above, we obtain state-of-the-art results on the challenging LRS2 and LRS3 benchmarks when training on public datasets, and even surpass models trained on large-scale industrial datasets by using an order of magnitude less data. Our best model achieves 22.6% word error rate on the LRS2 dataset, a performance unprecedented for lip reading models, significantly reducing the performance gap between lip reading and automatic speech recognition. Moreover, on the AVA-ActiveSpeaker benchmark, our VSD model surpasses all visual-only baselines and even outperforms several recent audio-visual methods.
- Download PDF
- Other Formats
References & Citations
- Google Scholar
- Semantic Scholar
DBLP - CS Bibliography
Bibtex formatted citation.
Bibliographic and Citation Tools
Code, data and media associated with this article, recommenders and search tools.
arXivLabs: experimental projects with community collaborators
arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.
Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.
Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .