Computer Engineering and Applications ›› 2022, Vol. 58 ›› Issue (4): 157-162.DOI: 10.3778/j.issn.1002-8331.2009-0334

• Pattern Recognition and Artificial Intelligence • Previous Articles     Next Articles

Research on Lip Reading Based on Visual Characteristics of Chinese Pronunciation

HE Shan, YUAN Jiabin, LU Yaoyao   

  1. 1.College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing 211106, China
    2.Information Department, Nanjing University of Aeronautics and Astronautics, Nanjing 211106, China
  • Online:2022-02-15 Published:2022-02-15



  1. 1.南京航空航天大学 计算机科学与技术学院,南京 211106
    2.南京航空航天大学 信息化处,南京 211106

Abstract: With the development of deep learning, lip reading has made great progress in English. However, there is a large gap in both the richness of dataset and the accuracy of recognition in Chinese. According to the visual characteristic of Chinese pronunciation, this paper proposes “visual pinyin” to avoid the ambiguity of Chinese visual expression. Then, in order to verify the effectiveness of visual pinyin, a Chinese sentence-level lip reading model CHSLR-VP is established. This model is an end-to-end structure, in which visual pinyin is used as a medium to convert video frames into Chinese characters. Through experiments, CHSLR-VP performs better than other prior methods, which proves that visual pinyin can significantly improve the accuracy of Chinese lip reading. It can provide a benchmark for future related work.

Key words: lip reading, visual pinyin, deep learning, convolutional neural networks(CNN), sequence-to-sequence model, attention mechanism

摘要: 随着深度学习的发展,唇语识别技术在英文方面取得了长足的进步,但针对中文无论是在数据集丰富性还是识别准确率上均存在一定的落差。通过分析中文发音的视觉特点,提出“视觉拼音”,意图规避中文在视觉表达上的歧义性。为了验证视觉拼音的有效性,建立了中文句子级唇语识别模型CHSLR-VP。该模型是一个端到端结构,其中以视觉拼音为媒介,将视频帧序列转换成最终的汉字语句。通过实验得出,相比于其他唇语识别方法,基于视觉拼音建立的CHSLR-VP模型性能更优,证明了视觉拼音的参与可明显提高中文唇语识别的准确率,为将来的相关工作提供了基准。

关键词: 唇语识别, 视觉拼音, 深度学习, 卷积神经网络(CNN), 序列到序列模型, 注意机制