Subscribe to the PwC Newsletter

Join the community, add a new evaluation result row, lip reading.

38 papers with code • 3 benchmarks • 4 datasets

Lip Reading is a task to infer the speech content in a video by using only the visual information, especially the lip movements. It has many crucial applications in practice, such as assisting audio-based speech recognition, biometric authentication and aiding hearing-impaired people.

Source: Mutual Information Maximization for Effective Lip Reading

Benchmarks Add a Result

lip reading research paper

Most implemented papers

Combining residual networks with lstms for lipreading.

lip reading research paper

We propose an end-to-end deep learning architecture for word-level visual speech recognition.

Deep Audio-Visual Speech Recognition

The goal of this work is to recognise phrases and sentences being spoken by a talking face, with or without the audio.

End-to-end Audio-visual Speech Recognition with Conformers

In this work, we present a hybrid CTC/Attention model based on a ResNet-18 and Convolution-augmented transformer (Conformer), that can be trained in an end-to-end manner.

LRW-1000: A Naturally-Distributed Large-Scale Benchmark for Lip Reading in the Wild

It has shown a large variation in this benchmark in several aspects, including the number of samples in each class, video resolution, lighting conditions, and speakers' attributes such as pose, age, gender, and make-up.

Lipreading using Temporal Convolutional Networks

We present results on the largest publicly-available datasets for isolated word recognition in English and Mandarin, LRW and LRW1000, respectively.

AuthNet: A Deep Learning based Authentication Mechanism using Temporal Facial Feature Movements

Biometric systems based on Machine learning and Deep learning are being extensively used as authentication mechanisms in resource-constrained environments like smartphones and other small computing devices.

Learning Audio-Visual Speech Representation by Masked Multimodal Cluster Prediction

The lip-reading WER is further reduced to 26. 9% when using all 433 hours of labeled data from LRS3 and combined with self-training.

MixSpeech: Cross-Modality Self-Learning with Audio-Visual Stream Mixup for Visual Speech Translation and Recognition

exgc/avmust-ted • ICCV 2023

However, despite researchers exploring cross-lingual translation techniques such as machine translation and audio speech translation to overcome language barriers, there is still a shortage of cross-lingual studies on visual speech.

Estimating speech from lip dynamics

Dirivian/dynamic_lips • 3 Aug 2017

The goal of this project is to develop a limited lip reading algorithm for a subset of the English language.

XFlow: Cross-modal Deep Neural Networks for Audiovisual Classification

Our work improves on existing multimodal deep learning algorithms in two essential ways: (1) it presents a novel method for performing cross-modality (before features are learned from individual modalities) and (2) extends the previously proposed cross-connections which only transfer information between streams that process compatible data.

Book cover

European Conference on Computer Vision

ECCV 2022: Computer Vision – ECCV 2022 pp 576–593 Cite as

Speaker-Adaptive Lip Reading with User-Dependent Padding

  • Minsu Kim   ORCID: orcid.org/0000-0002-6514-0018 12 ,
  • Hyunjun Kim   ORCID: orcid.org/0000-0001-6524-8689 12 &
  • Yong Man Ro   ORCID: orcid.org/0000-0001-5306-6853 12  
  • Conference paper
  • First Online: 29 October 2022

1343 Accesses

2 Citations

Part of the Lecture Notes in Computer Science book series (LNCS,volume 13696)

Lip reading aims to predict speech based on lip movements alone. As it focuses on visual information to model the speech, its performance is inherently sensitive to personal lip appearances and movements. This makes the lip reading models show degraded performance when they are applied to unseen speakers due to the mismatch between training and testing conditions. Speaker adaptation technique aims to reduce this mismatch between train and test speakers, thus guiding a trained model to focus on modeling the speech content without being intervened by the speaker variations. In contrast to the efforts made in audio-based speech recognition for decades, the speaker adaptation methods have not well been studied in lip reading. In this paper, to remedy the performance degradation of lip reading model on unseen speakers, we propose a speaker-adaptive lip reading method, namely user-dependent padding. The user-dependent padding is a speaker-specific input that can participate in the visual feature extraction stage of a pre-trained lip reading model. Therefore, the lip appearances and movements information of different speakers can be considered during the visual feature encoding, adaptively for individual speakers. Moreover, the proposed method does not need 1) any additional layers, 2) to modify the learned weights of the pre-trained model, and 3) the speaker label of train data used during pre-train. It can directly adapt to unseen speakers by learning the user-dependent padding only, in a supervised or unsupervised manner. Finally, to alleviate the speaker information insufficiency in public lip reading databases, we label the speaker of a well-known audio-visual database, LRW, and design an unseen-speaker lip reading scenario named LRW-ID. The effectiveness of the proposed method is verified on sentence- and word-level lip reading, and we show it can further improve the performance of a well-trained model with large speaker variations.

  • Visual speech recognition
  • Lip reading
  • Speaker-adaptive training
  • Speaker adaptation
  • User-dependent padding

This is a preview of subscription content, access via your institution .

Buying options

  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
  • Available as EPUB and PDF
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Abdel-Hamid, O., Jiang, H.: Fast speaker adaptation of hybrid NN/HMM model for speech recognition based on discriminative learning of speaker code. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7942–7946. IEEE (2013)

Google Scholar  

Abdel-Hamid, O., Jiang, H.: Rapid and effective speaker adaptation of convolutional neural network based models for speech recognition. In: INTERSPEECH, pp. 1248–1252 (2013)

Afouras, T., Chung, J.S., Senior, A., Vinyals, O., Zisserman, A.: Deep audio-visual speech recognition. IEEE Trans. Pattern Anal. Mach. Intell. (2018)

Afouras, T., Chung, J.S., Zisserman, A.: LRS3-TED: a large-scale dataset for visual speech recognition. arXiv preprint arXiv:1809.00496 (2018)

Afouras, T., Chung, J.S., Zisserman, A.: ASR is all you need: cross-modal distillation for lip reading. In: ICASSP 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2143–2147. IEEE (2020)

Almajai, I., Cox, S., Harvey, R., Lan, Y.: Improved speaker independent lip reading using speaker adaptive training and deep neural networks. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 2722–2726. IEEE (2016)

Anastasakos, T., McDonough, J., Makhoul, J.: Speaker adaptive training: a maximum likelihood approach to speaker normalization. In: 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 2, pp. 1043–1046. IEEE (1997)

Anvari, Z., Athitsos, V.: A pipeline for automated face dataset creation from unlabeled images. In: Proceedings of the 12th ACM International Conference on PErvasive Technologies Related to Assistive Environments, pp. 227–235 (2019)

Assael, Y.M., Shillingford, B., Whiteson, S., De Freitas, N.: LipNet: end-to-end sentence-level lipreading. arXiv preprint arXiv:1611.01599 (2016)

Chung, J.S., Senior, A., Vinyals, O., Zisserman, A.: Lip reading sentences in the wild. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3444–3453. IEEE (2017)

Chung, J.S., Zisserman, A.: Lip reading in the wild. In: Lai, S.-H., Lepetit, V., Nishino, K., Sato, Y. (eds.) ACCV 2016. LNCS, vol. 10112, pp. 87–103. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-54184-6_6

CrossRef   Google Scholar  

Chung, J.S., Zisserman, A.: Out of time: automated lip sync in the wild. In: Chen, C.-S., Lu, J., Ma, K.-K. (eds.) ACCV 2016. LNCS, vol. 10117, pp. 251–263. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-54427-4_19

Cooke, M., Barker, J., Cunningham, S., Shao, X.: An audio-visual corpus for speech perception and automatic speech recognition. J. Acoust. Soc. Am. 120 (5), 2421–2424 (2006)

Dehak, N., Kenny, P.J., Dehak, R., Dumouchel, P., Ouellet, P.: Front-end factor analysis for speaker verification. IEEE Trans. Audio Speech Lang. Process. 19 (4), 788–798 (2010)

Deng, J., Guo, J., Ververas, E., Kotsia, I., Zafeiriou, S.: RetinaFace: single-shot multi-level face localisation in the wild. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5203–5212 (2020)

Deng, J., Guo, J., Xue, N., Zafeiriou, S.: ArcFace: additive angular margin loss for deep face recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4690–4699 (2019)

Digalakis, V.V., Rtischev, D., Neumeyer, L.G.: Speaker adaptation using constrained estimation of Gaussian mixtures. IEEE Trans. Speech Audio Process. 3 (5), 357–366 (1995)

Ganin, Y., Lempitsky, V.: Unsupervised domain adaptation by backpropagation. In: International Conference on Machine Learning, pp. 1180–1189. PMLR (2015)

Gopinath, R.A.: Maximum likelihood modeling with gaussian distributions for classification. In: Proceedings of the 1998 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 1998 (Cat. No. 98CH36181), vol. 2, pp. 661–664. IEEE (1998)

Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 369–376 (2006)

Guo, Y., Zhang, L., Hu, Y., He, X., Gao, J.: MS-Celeb-1M: a dataset and benchmark for large-scale face recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9907, pp. 87–102. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46487-9_6

He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

Hong, J., Kim, M., Park, S.J., Ro, Y.M.: Speech reconstruction with reminiscent sound via visual voice memory. IEEE/ACM Trans. Audio Speech Lang. Process. 29 , 3654–3667 (2021)

Huang, Y., He, L., Wei, W., Gale, W., Li, J., Gong, Y.: Using personalized speech synthesis and neural language generator for rapid speaker adaptation. In: ICASSP 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7399–7403. IEEE (2020)

Kandala, P.A., et al.: Speaker adaptation for lip-reading using visual identity vectors. In: INTERSPEECH, pp. 2758–2762 (2019)

Kim, M., Hong, J., Park, S.J., Ro, Y.M.: CroMM-VSR: cross-modal memory augmented visual speech recognition. IEEE Trans. Multimedia (2021)

Kim, M., Hong, J., Park, S.J., Ro, Y.M.: Multi-modality associative bridging through memory: speech sound recollected from face video. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 296–306 (2021)

Kim, M., Hong, J., Ro, Y.M.: Lip to speech synthesis with visual context attentional GAN. Adv. Neural. Inf. Process. Syst. 34 , 2758–2770 (2021)

Kim, M., Yeo, J.H., Ro, Y.M.: Distinguishing homophenes using multi-head visual-audio memory for lip reading. In: Proceedings of the 36th AAAI Conference on Artificial Intelligence, Vancouver, BC, Canada, vol. 22 (2022)

Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

Klejch, O., Fainberg, J., Bell, P., Renals, S.: Speaker adaptive training using model agnostic meta-learning. In: 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 881–888. IEEE (2019)

Lee, D.H., et al.: Pseudo-label: the simple and efficient semi-supervised learning method for deep neural networks. In: Workshop on Challenges in Representation Learning, ICML, vol. 3, p. 896 (2013)

Li, B., Sim, K.C.: Comparison of discriminative input and output transformations for speaker adaptation in the hybrid NN/HMM systems. In: Eleventh Annual Conference of the International Speech Communication Association (2010)

Li, X., Bilmes, J.: Regularized adaptation of discriminative classifiers. In: 2006 IEEE International Conference on Acoustics Speech and Signal Processing Proceedings, vol. 1, pp. I. IEEE (2006)

Liao, H., McDermott, E., Senior, A.: Large scale deep neural network acoustic modeling with semi-supervised training data for YouTube video transcription. In: 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, pp. 368–373. IEEE (2013)

Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)

Ma, P., Mira, R., Petridis, S., Schuller, B.W., Pantic, M.: LiRA: learning visual speech representations from audio through self-supervision. arXiv preprint arXiv:2106.09171 (2021)

Martinez, B., Ma, P., Petridis, S., Pantic, M.: Lipreading using temporal convolutional networks. In: ICASSP 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6319–6323. IEEE (2020)

Mei, K., Zhu, C., Zou, J., Zhang, S.: Instance adaptive self-training for unsupervised domain adaptation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12371, pp. 415–430. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58574-7_25

Meng, Z., et al.: Speaker-invariant training via adversarial learning. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 5969–5973. IEEE (2018)

Miao, Y., Zhang, H., Metze, F.: Towards speaker adaptive training of deep neural network acoustic models. In: Fifteenth Annual Conference of the International Speech Communication Association (2014)

Miao, Y., Zhang, H., Metze, F.: Speaker adaptive training of deep neural network acoustic models using i-vectors. IEEE/ACM Trans. Audio Speech Lang. Process. 23 (11), 1938–1949 (2015)

Mira, R., Haliassos, A., Petridis, S., Schuller, B.W., Pantic, M.: SVTS: scalable video-to-speech synthesis. arXiv preprint arXiv:2205.02058 (2022)

Mira, R., Vougioukas, K., Ma, P., Petridis, S., Schuller, B.W., Pantic, M.: End-to-end video-to-speech synthesis using generative adversarial networks. IEEE Trans. Cybern. (2022)

Neto, J., et al.: Speaker-adaptation for hybrid HMM-ANN continuous speech recognition system (1995)

Noda, K., Yamaguchi, Y., Nakadai, K., Okuno, H.G., Ogata, T.: Lipreading using convolutional neural network. In: Fifteenth Annual Conference of the International Speech Communication Association (2014)

Petridis, S., Stafylakis, T., Ma, P., Cai, F., Tzimiropoulos, G., Pantic, M.: End-to-end audiovisual speech recognition. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6548–6552. IEEE (2018)

Ren, S., Du, Y., Lv, J., Han, G., He, S.: Learning from the master: distilling cross-modal advanced knowledge for lip reading. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13325–13333 (2021)

Seide, F., Li, G., Chen, X., Yu, D.: Feature engineering in context-dependent deep neural networks for conversational speech transcription. In: 2011 IEEE Workshop on Automatic Speech Recognition & Understanding, pp. 24–29. IEEE (2011)

Stafylakis, T., Tzimiropoulos, G.: Combining residual networks with LSTMs for lipreading. arXiv preprint arXiv:1703.04105 (2017)

Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. Adv. Neural Inf. Process. Syst. 27 (2014)

Swietojanski, P., Renals, S.: Learning hidden unit contributions for unsupervised speaker adaptation of neural network acoustic models. In: 2014 IEEE Spoken Language Technology Workshop (SLT), pp. 171–176. IEEE (2014)

Tzeng, E., Hoffman, J., Saenko, K., Darrell, T.: Adversarial discriminative domain adaptation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7167–7176 (2017)

Vaswani, A., et al.: Attention is all you need. Adv. Neural Inf. Process. Syst. 30 (2017)

Veselỳ, K., Hannemann, M., Burget, L.: Semi-supervised training of deep neural networks. In: 2013 IEEE Workshop on Automatic Speech Recognition and Understanding, pp. 267–272. IEEE (2013)

Weng, X., Kitani, K.: Learning spatio-temporal features with two-stream deep 3D CNNs for lipreading. arXiv preprint arXiv:1905.02540 (2019)

Xiao, J., Yang, S., Zhang, Y., Shan, S., Chen, X.: Deformation flow based two-stream network for lip reading. In: 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), pp. 364–370. IEEE (2020)

Xie, Q., Luong, M.T., Hovy, E., Le, Q.V.: Self-training with noisy student improves ImageNet classification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10687–10698 (2020)

Xue, S., Abdel-Hamid, O., Jiang, H., Dai, L., Liu, Q.: Fast adaptation of deep neural network based on discriminant codes for speech recognition. IEEE/ACM Trans. Audio Speech Lang. Process. 22 (12), 1713–1725 (2014)

Yang, C., Wang, S., Zhang, X., Zhu, Y.: Speaker-independent lipreading with limited data. In: 2020 IEEE International Conference on Image Processing (ICIP), pp. 2181–2185. IEEE (2020)

Yang, S., et al.: LRW-1000: a naturally-distributed large-scale benchmark for lip reading in the wild. In: 2019 14th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2019), pp. 1–8. IEEE (2019)

Yarowsky, D.: Unsupervised word sense disambiguation rivaling supervised methods. In: 33rd Annual Meeting of the Association for Computational Linguistics, pp. 189–196 (1995)

Yu, D., Yao, K., Su, H., Li, G., Seide, F.: KL-divergence regularized deep neural network adaptation for improved large vocabulary speech recognition. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing, pp. 7893–7897. IEEE (2013)

Zhang, Q., Wang, S., Chen, G.: Speaker-independent lipreading by disentangled representation learning. In: 2021 IEEE International Conference on Image Processing (ICIP), pp. 2493–2497. IEEE (2021)

Zhao, G., Barnard, M., Pietikainen, M.: Lipreading with local spatiotemporal descriptors. IEEE Trans. Multimedia 11 (7), 1254–1265 (2009)

Zhao, Y., Xu, R., Wang, X., Hou, P., Tang, H., Song, M.: Hearing lips: improving lip reading by distilling speech recognizers. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 6917–6924 (2020)

Download references

Acknowledgment

This work was supported by Institute of Information & communications Technology Planning & Evaluation (IITP) grant funded by the Korea government (MSIT) (No. 2022-0-00124, Development of Artificial Intelligence Technology for Self-Improving Competency-Aware Learning Capabilities).

Author information

Authors and affiliations.

Image and Video Systems Lab, School of Electrical Engineering, KAIST, Daejeon, South Korea

Minsu Kim, Hyunjun Kim & Yong Man Ro

You can also search for this author in PubMed   Google Scholar

Corresponding author

Correspondence to Yong Man Ro .

Editor information

Editors and affiliations.

Tel Aviv University, Tel Aviv, Israel

Shai Avidan

University College London, London, UK

Gabriel Brostow

Google AI, Accra, Ghana

Moustapha Cissé

University of Catania, Catania, Italy

Giovanni Maria Farinella

Facebook (United States), Menlo Park, CA, USA

Tal Hassner

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 325 KB)

Rights and permissions.

Reprints and Permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Cite this paper.

Kim, M., Kim, H., Ro, Y.M. (2022). Speaker-Adaptive Lip Reading with User-Dependent Padding. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13696. Springer, Cham. https://doi.org/10.1007/978-3-031-20059-5_33

Download citation

DOI : https://doi.org/10.1007/978-3-031-20059-5_33

Published : 29 October 2022

Publisher Name : Springer, Cham

Print ISBN : 978-3-031-20058-8

Online ISBN : 978-3-031-20059-5

eBook Packages : Computer Science Computer Science (R0)

Share this paper

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative

  • Find a journal
  • Publish with us

Lip Reading Sentences Using Deep Learning With Only Visual Cues

Ieee account.

  • Change Username/Password
  • Update Address

Purchase Details

  • Payment Options
  • Order History
  • View Purchased Documents

Profile Information

  • Communications Preferences
  • Profession and Education
  • Technical Interests
  • US & Canada: +1 800 678 4333
  • Worldwide: +1 732 981 0060
  • Contact & Support
  • About IEEE Xplore
  • Accessibility
  • Terms of Use
  • Nondiscrimination Policy
  • Privacy & Opting Out of Cookies

A not-for-profit organization, IEEE is the world's largest technical professional organization dedicated to advancing technology for the benefit of humanity. © Copyright 2023 IEEE - All rights reserved. Use of this web site signifies your agreement to the terms and conditions.

U.S. flag

An official website of the United States government

The .gov means it’s official. Federal government websites often end in .gov or .mil. Before sharing sensitive information, make sure you’re on a federal government site.

The site is secure. The https:// ensures that you are connecting to the official website and that any information you provide is encrypted and transmitted securely.

  • Publications
  • Account settings
  • Advanced Search
  • Journal List
  • J Acoust Soc Am

Logo of jas

Some normative data on lip-reading skills (L)

The ability to obtain reliable phonetic information from a talker’s face during speech perception is an important skill. However, lip-reading abilities vary considerably across individuals. There is currently a lack of normative data on lip-reading abilities in young normal-hearing listeners. This letter describes results obtained from a visual-only sentence recognition experiment using CUNY sentences and provides the mean number of words correct and the standard deviation for different sentence lengths. Additionally, the method for calculating T-scores is provided to facilitate the conversion between raw and standardized scores. This metric can be utilized by clinicians and researchers in lip-reading studies. This statistic provides a useful benchmark for determining whether an individual’s lip-reading score falls within the normal range, or whether it is above or below this range.

INTRODUCTION

Evidence from studies in audiovisual speech perception has shown that visual speech cues, provided by optical information in the talker’s face, has facilitatory effects in terms of accuracy across a wide range of auditory signal-to-noise ratios ( Grant and Seitz, 1998 ; Sumby and Pollack, 1954 ). In their seminal study, Sumby and Pollack reported that the obtained benefit from the visual speech signal is related to the quality of the auditory information, with a more noticeable gain observed for lower signal-to-noise ratios. Their findings are theoretically important as they also show that visual information provides benefits across many signal-to-noise ratios.

In a more recent study involving neural measures of visual enhancement, van Wassenhove et al. (2005) compared peek amplitudes in an EEG task and observed that lip- reading information “speeds up” the neural processing of auditory speech signals. The effects of visual enhancement of speech continue to be explored using more recent methodologies and tools (for an analysis using fMRI, see also Bernstein et al., 2002 ). Although numerous studies have demonstrated that visual information about speech enhances and facilitates auditory recognition of speech in both normal-hearing and clinical populations ( Bergeson and Pisoni, 2004 ; Kaiser et al., 2003 ), there is currently a lack of basic information regarding more fundamental aspects of visual speech processing. Most surprising perhaps, is that even after decades of research there are no normative data on lip- reading ability available to researchers and clinicians to serve as benchmarks of performance.

When a researcher obtains a cursory assessment of lip-reading ability, how does the score compare to the rest of the population? Simply put, what exactly constitutes a “good” or otherwise above-average lip-reader? Although there is a growing body of literature investigating perceptual and cognitive factors associated with visual-only performance (e.g., Bernstein et al., 1998 ; Feld and Sommers, 2009 ) exactly what constitutes superior, average, and markedly below-average lip-reading ability has yet to be quantified in any precise manner. Auer and Bernstein (2007) did report lip-reading data from sentence recognition tasks using normal-hearing and hearing-impaired populations, and provided some initial descriptive statistics from both populations. In this letter, we go a step further by reporting standardized T-scores, and, additionally, recognition scores for sentences of different word lengths.

Visual-only sentence recognition

To answer the question of what accuracy level makes a good lip reader, we carried out a visual-only sentence recognition task designed to assess lip-reading skills in an ecologically valid manner. Eighty-four young normal-hearing undergraduates were presented with 25 CUNY sentences (with the auditory track removed) of variable length spoken by a female talker ( Boothroyd et al., 1988 ). The use of CUNY sentence materials provides a more ecologically valid measure of language processing than the perception of words or syllables in isolation. One potential objection to using sentences is their predictability. However, language processing requires both sensory processing in addition to the integration of contextual information over time. Therefore, the use of less predictable or anomalous sentences might be of interest in future studies, although it remains beyond the scope of our present report. We shall now describe the details of the study, and provide the results and method for converting raw scores to T-scores.

Participants

Eighty-four college-age participants were recruited at Indiana University and were either given course credit or paid for their participation. All participants reported normal hearing and had normal or corrected vision at the time of testing.

Stimulus materials

The stimulus set consisted of 25 sentences obtained from a database of pre-recorded audiovisual sentences (CUNY sentences) ( Boothroyd et al., 1988 ) spoken by a female talker. The auditory track was removed from each of the 25 sentences using Final Cut Pro HD. The set of 25 sentences was then subdivided into the following word lengths: 3, 5, 7, 9, and 11 words with five sentences for each length. We did this because sentence length naturally varies in everyday conversation. Sentences were presented randomly for each participant and we did not provide any cues with regard to sentence length or semantic content. The sentence materials are shown in the Appendix.

Design and procedure

Data from the 25 visual-only sentences were obtained from a pre-screening session in two experiments designed to test hypotheses related to visual-only sentence recognition abilities. The stimuli were digitized from a laser video disk and rendered into a 720 × 480 pixel movie at a rate of 30 frames∕s. The movies were displayed on a Macintosh monitor with a refresh rate of 75 Hz. Participants were seated approximately 16–24 in. from the computer monitor. Each trial began with the presentation of a fixation cross (+) for approximately 500 ms followed by a video of a female talker, with the sound removed, speaking one of the 25 sentences listed in the Appendix. After the talker finished speaking the sentence, a dialog box appeared in the center of the screen instructing the participant to type in the words they thought the talker said by using a keyboard. Each sentence was given to the participant only once. No feedback was provided on any of the test trials.

Scoring was carried out in the following manner: If the participant correctly typed a word in the sentence, then that word was scored as “correct.” The proportion of words correct was scored across sentences. For the sentence “Is your sister in school,” if the participant typed in “Is the…” only the word “Is” would be scored as correct. In this example, one out of five words would be correct, making the proportion correct = 1∕5 = 0.20. Word order was not a critical criterion for a word to be scored as accurate. However, upon inspection of the data, participants almost never switched word order in their responses. Subject responses were manually corrected for any misspellings. These visual-only word-recognition scores provide a valuable benchmark for assessing overall lip-reading ability in individual participants and can be used as normative data for other research purposes.

The results revealed that the mean lip-reading score in visual-only sentence recognition was 12.4% correct with a standard deviation of 6.67%. Figure ​ Figure1 1 shows a box plot of the results where the lines indicate the mean, 75th and 25th percentile, as well as 1.5 times the interquartile range. Two outliers denoted by open circles, each close to 30% correct, are also plotted. The proportion of words identified correctly was not identical across sentence length. The mean and standard deviation of the accuracy scores for each sentence length are provided in Table TABLE I. . Correct identification across sentence length differed, with increased accuracy for longer sentences (up to nine words) before decreasing again for sentence lengths of 11 words [F(4,83) = 21.46, p  < 0.001]. This interesting finding is consistent with the hypothesis that language processing involves the use of higher-order cognitive resources to integrate semantic context over time. Hence, shorter sentences might not provide enough contextual information, whereas longer sentences may burden working memory capacity (see Feld and Sommers, 2009 ). Although sentences do provide contextual cues, such information is quite difficult to obtain in the visual modality, especially when sentence length increases. Incidentally, this reasoning explains why auditory-only accuracy, but not visual-only accuracy, improves for sentence recognition compared to single-word recognition in isolation.

The mean and standard deviation of words correctly identified for each sentence length, including the mean and standard deviation collapsed across all lengths (“Overall”). a

An external file that holds a picture, illustration, etc.
Object name is JASMAN-000130-000001_1-g001.jpg

The line in the middle of the box shows the mean visual-only sentence recognition score across all 84 participants. The 75th and 25th percentile are represented by the line above and below the middle line, respectively. The small bars on the top and bottom denote a value of 1.5 times the interquartile range.

Conversion of raw scores to T-scores

In order to determine individual performance relative to a standard benchmark, the method and rationale for calculating T-scores will be provided. Standardized T-scores have a mean of 50 and a standard deviation of 10. These standardized scores are generally preferred by clinicians and psychometricians over Z-scores due to the relative ease of their interpretability and appeal to intuition. For example, T-scores are positive, whereas Z-scores below the mean yield negative numbers, which does not make intuitive sense for visual-only accuracy scores.

The T-scores were computed in the following manner: The overall mean was subtracted from each individual raw score and divided by the standard deviation, thereby converting the raw score into a Z-score. Taking this score, multiplying it by a factor of 10, and then adding 50 provides us with the T-score for that individual:

For example, the mean score of 12.4% correct-word recognition gives us a T-score of 50, whereas an accuracy level of just over 2% yields a T-score of 35 (1.5 standard deviations below the mean) and an accuracy level of just over 22% correct yields a T-score of 65 (1.5 standard deviations above the mean). Computing T-scores is quite convenient, and can be utilized to convert a raw CUNY lip-reading score obtained from an open set sentence recognition test into an interpretable standardized score. This can inform clinicians and researchers where an individual stands relative to the population of young healthy participants.

CONCLUSIONS

Qualitatively, the scores reflect the difficulty of lip reading in an open-set sentence recognition task. Mean-word-recognition accuracy scores were barely greater than 10% correct. Further, any individual who achieved a CUNY lip-reading score of 30% correct is considered an outlier, giving them a T-score of nearly 80—three times the standard deviation from the mean. A lip-reading recognition accuracy score of 45% correct places an individual 5 standard deviations above the mean.

These results quantify the inherent difficulty in visual-only sentence recognition. One potential concern is that CUNY sentences tend to yield lower V -only accuracy than other sentence materials (see, e.g., Auer and Bernstein, 2007 ). However, the major contribution of our study is that it provides clinicians and researchers with a valuable benchmark for assessing lip-reading skills using a database that has a well-established history in the clinical and behavioral research community (see, e.g., Bergeson and Pisoni, 2004 ; Boothroyd et al., 1988 ; Kaiser et al., 2003 ). CUNY sentences are used widely in sentence perception tasks using normal hearing, elderly, and patients with cochlear implants as subjects. With these results, it is now possible to quantify the lip-reading ability of an individual participant relative to a normal-hearing population.

One potentially fruitful application would be to determine where a specific hearing-impaired listener falls on a standardized distribution. Research on visual-only speech recognition, for example, has shown that individuals with a progressive, rather than sudden, hearing loss have higher lip-reading recognition scores ( Bergeson et al., 2003 ). Other research has also demonstrated that lip-reading ability serves as an important behavioral predictor of who will benefit from a cochlear implant (see Bergeson and Pisoni, 2004 ). How might the scores from each individual in these populations compare with the standard scores from a normal-hearing population? These examples and numerous other scenarios suggest the importance of having some normative data on lip-reading ability readily available for the speech research community including basic researchers, as well as clinicians, who work with hearing-impaired listeners to determine strengths, weaknesses, and milestones.

Although this study only employed CUNY sentences, the use of sentences from this well-established and widely used database provided a generalized measure of language processing ability, and a T-score conversion method that should be applicable to other open-set sentence identification tasks. Future studies might consider establishing norms for visually presented anomalous sentences, isolated words, and syllables. It will also be worthwhile to obtain normative data for elderly listeners who have been found to have poorer lip-reading skills than younger listeners ( Sommers et al., 2005 ).

ACKNOWLEDGMENTS

This study was supported by the National Institute of Health (Grant No. DC-00111) and the National Institute of Health Speech Training (Grant No. DC-00012) and by NIMH (057717-07) and AFOSR (FA9550-07-1-0078) grants to J.T.T. We would like to acknowledge the members of the Speech Research Laboratory and James T. Townsend’s Psychological Modeling Laboratory at Indiana University for their helpful and insightful discussion. We also wish to thank two anonymous reviewers for their valuable input on an earlier version of this manuscript.

What will we make for dinner when our neighbors come over

Is your sister in school

Does your boss give you a bonus every year

Do not spend so much on new clothes

What is your recipe for cheesecake

Is your nephew having a birthday party next week

What is the humidity

Let the children stay up for Halloween

He plays the bass in a jazz band every Monday night

How long does it take to roast a turkey

Which team won

Take your vitamins every morning after breakfast

People who invest in stocks and bonds now take some risks

Those albums are very old

Aren’t dishwashers convenient

Is it snowing or raining right now

The school will be closed for Washington’s Birthday and Lincoln’s Birthday

Your check arrived by mail

Professional musicians must practice at least three hours everyday

Are whales mammals

Did the basketball game go into overtime

When he went to the dentist he had his teeth cleaned

We’ll plant roses this spring

I always mail in my loan payments on time

Sneakers are comfortable

  • Auer, E. T., and Bernstein, L. E. (2007). “ Enhanced visual speech perception in individual with early-onset hearing impairment ,” J. Speech Lang. Hear. Res. 50 , 1157–1165. 10.1044/1092-4388(2007/080) [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Bergeson, T. R., and Pisoni, D.B. (2004). “ Audiovisual speech perception in deaf adults and children following cochlear implantation ,” in The Handbook of Multisensory Processes , edited by G. A.Calvert,C.Spence, andStein B. E. (The MIT Press, Cambridge, MA: ), pp. 153–176. [ Google Scholar ]
  • Bergeson, T. R., Pisoni, D. B., Reese, L., and Kirk, K. I. (2003). “ Audiovisual speech perception in adult cochlear implant users: Effects of sudden vs. progressive hearing loss ,” Poster presented at the Annual Midwinter Research Meeting of the Association for Research in Otolaryngology , Daytona Beach, FL. [ Google Scholar ]
  • Bernstein, L. E., Auer, E. T., Moore, J. K., Ponton, C., Don, M., and Singh, M. (2002). “ Visual speech perception without primary auditory cortex activation ,” NeuroReport. 13 , 31–315. [ PubMed ] [ Google Scholar ]
  • Bernstein, L. E., Demorest, M. E., and Tucker, P. E. (1998). “ What makes a good speechreader? First you have to find one ,” in Hearing by Eye II: Advances in the Psychology of Speechreading and Audio-Visual Speech , edited by R.Campbell,B.Dodd, andBurnham D. (Psychology Press, Erlbaum, UK: ), pp. 211–227. [ Google Scholar ]
  • Boothroyd, A., Hnath-Chisolm, T., Hanin, L., and Kishon-Rabin, L. (1988). “ Voice fundamental frequency as an auditory supplement to the speechreading of sentences ,” Ear Hear. 9 , 306–312. 10.1097/00003446-198812000-00006 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Feld, J. E., and Sommers, M. S. (2009). “ Lipreading, processing speed, and working memory in younger and older adults ,” J. Speech Lang. Hear. Res. 52 , 1555–1565. 10.1044/1092-4388(2009/08-0137) [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Grant, K. W., and Seitz, P. F. (1998). “ Measures of auditory-visual integration in nonsense syllables and sentences ,” J. Acoust. Soc. Am. 104 , 2438–2450. 10.1121/1.423751 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Kaiser, A., Kirk, K., Lachs, L., and Pisoni, D. (2003). “ Talker and lexical effects on audiovisual word recognition by adults with cochlear implants ,” J. Speech Lang. Hear. Res. 46 , 390–404. 10.1044/1092-4388(2003/032) [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Sommers, M., Tye-Murray, N., and Spehar, B. (2005). “ Auditory-visual speech perception and auditory-visual enhancement in normal-hearing younger and older adults ,” Ear Hear. 26 , 263–275. 10.1097/00003446-200506000-00003 [ PubMed ] [ CrossRef ] [ Google Scholar ]
  • Sumby, W. H., and Pollack, I. (1954). “ Visual contribution to speech intelligibility in noise ,” J. Acoust. Soc. Am. 26 , 12–15. [ Google Scholar ]
  • van Wassenhove, V. Grant, K., and Poeppel, D. (2005). “ Visual speech speeds up the neural processing of auditory speech ,” Proc. Natl. Acad. Sci. USA 102 , 1181–1186. 10.1073/pnas.0408949102 [ PMC free article ] [ PubMed ] [ CrossRef ] [ Google Scholar ]

Help | Advanced Search

Computer Science > Computer Vision and Pattern Recognition

Title: sub-word level lip reading with visual attention.

Abstract: The goal of this paper is to learn strong lip reading models that can recognise speech in silent videos. Most prior works deal with the open-set visual speech recognition problem by adapting existing automatic speech recognition techniques on top of trivially pooled visual features. Instead, in this paper we focus on the unique challenges encountered in lip reading and propose tailored solutions. To this end, we make the following contributions: (1) we propose an attention-based pooling mechanism to aggregate visual speech representations; (2) we use sub-word units for lip reading for the first time and show that this allows us to better model the ambiguities of the task; (3) we propose a model for Visual Speech Detection (VSD), trained on top of the lip reading network. Following the above, we obtain state-of-the-art results on the challenging LRS2 and LRS3 benchmarks when training on public datasets, and even surpass models trained on large-scale industrial datasets by using an order of magnitude less data. Our best model achieves 22.6% word error rate on the LRS2 dataset, a performance unprecedented for lip reading models, significantly reducing the performance gap between lip reading and automatic speech recognition. Moreover, on the AVA-ActiveSpeaker benchmark, our VSD model surpasses all visual-only baselines and even outperforms several recent audio-visual methods.

Submission history

Access paper:.

  • Download PDF
  • Other Formats

References & Citations

  • Google Scholar
  • Semantic Scholar

DBLP - CS Bibliography

Bibtex formatted citation.

BibSonomy logo

Bibliographic and Citation Tools

Code, data and media associated with this article, recommenders and search tools.

  • Institution

arXivLabs: experimental projects with community collaborators

arXivLabs is a framework that allows collaborators to develop and share new arXiv features directly on our website.

Both individuals and organizations that work with arXivLabs have embraced and accepted our values of openness, community, excellence, and user data privacy. arXiv is committed to these values and only works with partners that adhere to them.

Have an idea for a project that will add value for arXiv's community? Learn more about arXivLabs .

IMAGES

  1. (PDF) Analysis of Efficient Lip Reading Method for Various Languages

    lip reading research paper

  2. (PDF) A Review on Deep Learning Based Lip-Reading

    lip reading research paper

  3. Lip Reading in the Wild Benchmark (Lipreading)

    lip reading research paper

  4. (PDF) Lip Reading in Profile

    lip reading research paper

  5. (PDF) Implication and Utilization of various Lip Reading Techniques

    lip reading research paper

  6. (PDF) Insights into machine lip reading

    lip reading research paper

VIDEO

  1. When You're Learning How to Read Lips No.2 #shorts #hearingloss

  2. My paper lip

  3. Unbelievable Lip Reading Skills: You'll Be Left Speechless! 11 #shorts #short

  4. lip hack using tissue paper #youtubeshorts #makeup #lipstick #makeuptutorial #lipstickeffect 😱😱😱

  5. Chinese lip paper for everything! ❤️ #makeupworld #makeuptutorial #cosmetics #beauty #lipstick

  6. Paper lip eating food #ytshorts #satisfying #eatingfood #paperlip

COMMENTS

  1. Lipreading

    Lipreading 28 papers with code • 7 benchmarks • 6 datasets Lipreading is a process of extracting speech by watching lip movements of a speaker in the absence of sound. Humans lipread all the time without even noticing. It is a big part in communication albeit not as dominant as audio.

  2. Deep Learning-Based Automated Lip-Reading: A Survey

    Deep Learning-Based Automated Lip-Reading: A Survey Abstract: A survey on automated lip-reading approaches is presented in this paper with the main focus being on deep learning related methodologies which have proven to be more fruitful for both feature extraction and classification.

  3. (PDF) A Survey of Research on Lipreading Technology

    Lipreading is a visual speech recognition technology that recognizes the speech content based on the motion characteristics of the speaker's lips without speech signals. Therefore, lipreading can...

  4. [2110.07879] Advances and Challenges in Deep Lip Reading

    Advances and Challenges in Deep Lip Reading Marzieh Oghbaie, Arian Sabaghi, Kooshan Hashemifard, Mohammad Akbari Driven by deep learning techniques and large-scale datasets, recent years have witnessed a paradigm shift in automatic lip reading.

  5. [1611.01599] LipNet: End-to-End Sentence-level Lipreading

    Lipreading is the task of decoding text from the movement of a speaker's mouth. Traditional approaches separated the problem into two stages: designing or learning visual features, and prediction. More recent deep lipreading approaches are end-to-end trainable (Wand et al., 2016; Chung & Zisserman, 2016a). However, existing work on models trained end-to-end perform only word classification ...

  6. Lipreading: A Review of Its Continuing Importance for Speech

    Here, we describe lipreading and theoretically motivated approaches to its training, as well as examples of successful training paradigms. We discuss some extensions to auditory-only (AO) and audiovisual (AV) speech recognition. Method: Visual speech perception and word recognition are described.

  7. PDF Computer Vision Lip Reading

    This project compares and contrasts several leading research papers in the realm of Lip Reading and combined Audio Visual Recognition using either small convolutional neural networks or state-of-the-art deep learning models. This dialogue includes my own thoughts on potential reasoning of their results.

  8. Lipreading using a comparative machine learning approach

    In this paper, machine learning approaches are applied to recognize lip-reading and nine different classifiers has been implemented and tested, reporting their confusion matrices among different groups of words. The classification process went on more than one classifier but these three classifiers got the best results which are ...

  9. Review on research progress of machine lip reading

    This paper studies the development of lip reading in detail, especially the latest research results of lip reading. We focus on the lip reading datasets and their comparison, including some recently released datasets. At the same time, we introduce the feature extraction methods of lip reading and compare various methods in detail.

  10. PDF arXiv:2110.07879v1 [cs.CV] 15 Oct 2021

    In Section 2, we define lip reading as a research problem and review usual modules of a VSR pipeline. In Section 3, popular datasets, data-related challenges, synthetic data ... 4For simplicity, we use 'lip reading' or VSR in the rest of this paper. 2. Figure 1: Baseline VSR Pipeline: In a custom lip reading system, the input video usually ...

  11. Lip Reading

    Lip Reading 38 papers with code • 3 benchmarks • 4 datasets Lip Reading is a task to infer the speech content in a video by using only the visual information, especially the lip movements. It has many crucial applications in practice, such as assisting audio-based speech recognition, biometric authentication and aiding hearing-impaired people.

  12. Lip-Reading Driven Deep Learning Approach for Speech Enhancement

    This paper proposes a novel lip-reading driven deep learning framework for speech enhancement. The approach leverages the complementary strengths of both deep learning and analytical acoustic modeling (filtering-based approach) as compared to benchmark approaches that rely only on deep learning. The proposed audio-visual (AV) speech enhancement framework operates at two levels. In the first ...

  13. A Comprehensive Dataset for Machine-Learning-based Lip-Reading

    Therefore, this paper carry out research on the construction of lip-reading dataset. First of all, frames are extracted from original videos by using the Scikit-Video. Then face detection is performed by applying dlib. Lip images are captured by processing the feature points to achieve lip cropping.

  14. Speaker-Adaptive Lip Reading with User-Dependent Padding

    2.1 Lip Reading. Lip reading is a task of recognizing speech by watching lip movements only, which is regarded as one of the challenging problems. With the great development of deep learning, many research efforts have been made to improve the performance of lip reading [23, 28, 43, 44].In word-level lip reading, [] constructed an architecture consists of a 3D convolution layer and 2D ResNet ...

  15. A Review on Deep Learning-Based Automatic Lipreading

    Automatic Lip-Reading (ALR), also known as Visual Speech Recognition (VSR), is the technological process to extract and recognize speech content, based solely on the visual recognition of the...

  16. Lip Reading Sentences Using Deep Learning With Only Visual Cues

    The main contributions of this paper are: 1) The classification of visemes in continuous speech using a specially designed transformer with a unique topology; 2) The use of visemes as a classification schema for lip reading sentences; and 3) The conversion of visemes to words using perplexity analysis.

  17. Research on a Lip Reading Algorithm Based on Efficient-GhostNet

    The earliest research on lip reading originated from the 1950s, when Sumby, W.H. [ 1] first proposed that the series of lip movements during human speech could be used as a means of information acquisition, and thus the concept of lip reading was introduced, opening a new chapter of research in the field of lip reading.

  18. (PDF) Vision based Lip Reading System using Deep Learning

    Lip reading is an approach for understanding speech by visually interpreting lip movements. Vision based lip leading system takes a video (without audio) as an input of a person speaking...

  19. Some normative data on lip-reading skills (L)

    RESULTS. The results revealed that the mean lip-reading score in visual-only sentence recognition was 12.4% correct with a standard deviation of 6.67%. Figure 1 shows a box plot of the results where the lines indicate the mean, 75th and 25th percentile, as well as 1.5 times the interquartile range.

  20. A Review on Deep Learning Based Lip-Reading

    The use of Deep Learning in lip reading is a recent concept and solves upcoming challenges in real-world such as Virtual Reality system, assisted driving systems, sign language recognition,...

  21. (PDF) Lip reading as reinforcement for speech reproduction in deaf

    Results: All 49 children completed the speech therapy sessions; 55% students scored high on the lip reading session (mean score 8.9 ± 4.37, range 1-19 out of 25) compared to less than 35% ...

  22. [2110.07603] Sub-word Level Lip Reading With Visual Attention

    The goal of this paper is to learn strong lip reading models that can recognise speech in silent videos. Most prior works deal with the open-set visual speech recognition problem by adapting existing automatic speech recognition techniques on top of trivially pooled visual features.

  23. Advances and Challenges in Deep Lip Reading

    This paper provides a comprehensive survey of the state-of-the-art deep learning based VSR research with a focus on data challenges, task-specific complications, and the corresponding...