Research on Lip Reading Based on Visual Characteristics of Chinese Pronunciation

doi:10.3778/j.issn.1002-8331.2009-0334

Abstract

Abstract: With the development of deep learning, lip reading has made great progress in English. However, there is a large gap in both the richness of dataset and the accuracy of recognition in Chinese. According to the visual characteristic of Chinese pronunciation, this paper proposes “visual pinyin” to avoid the ambiguity of Chinese visual expression. Then, in order to verify the effectiveness of visual pinyin, a Chinese sentence-level lip reading model CHSLR-VP is established. This model is an end-to-end structure, in which visual pinyin is used as a medium to convert video frames into Chinese characters. Through experiments, CHSLR-VP performs better than other prior methods, which proves that visual pinyin can significantly improve the accuracy of Chinese lip reading. It can provide a benchmark for future related work.

Key words: lip reading, visual pinyin, deep learning, convolutional neural networks（CNN）, sequence-to-sequence model, attention mechanism

摘要： 随着深度学习的发展，唇语识别技术在英文方面取得了长足的进步，但针对中文无论是在数据集丰富性还是识别准确率上均存在一定的落差。通过分析中文发音的视觉特点，提出“视觉拼音”，意图规避中文在视觉表达上的歧义性。为了验证视觉拼音的有效性，建立了中文句子级唇语识别模型CHSLR-VP。该模型是一个端到端结构，其中以视觉拼音为媒介，将视频帧序列转换成最终的汉字语句。通过实验得出，相比于其他唇语识别方法，基于视觉拼音建立的CHSLR-VP模型性能更优，证明了视觉拼音的参与可明显提高中文唇语识别的准确率，为将来的相关工作提供了基准。

关键词: 唇语识别, 视觉拼音, 深度学习, 卷积神经网络（CNN）, 序列到序列模型, 注意机制

HE Shan, YUAN Jiabin, LU Yaoyao. Research on Lip Reading Based on Visual Characteristics of Chinese Pronunciation[J]. Computer Engineering and Applications, 2022, 58(4): 157-162.

何珊, 袁家斌, 陆要要. 基于中文发音视觉特点的唇语识别方法研究[J]. 计算机工程与应用, 2022, 58(4): 157-162.

References

[1] PETAJAN E D.Automatic lipreading to enhance speech recognition（speech reading）[D].University of Illinois at Urbana-Champaign，1984.
[2] PETAJAN E，BISCHOFF B，BODOFF D，et al.An improved automatic lipreading system to enhance speech recognition[C]//1988 SIGCHI Conference on Human Factors in Computing Systems，1988：19-25.
[3] GOLDSCHEN A J，GARCIA O N，PETAJAN E D.Continuous automatic speech recognition by lipreading[M]//Motion-based recognition.Dordrecht：Springer，1997：321-343.
[4] SAITOH T，MORISHITA K，KONISHI R.Analysis of efficient lip reading method for various languages[C]//2008 19th International Conference on Pattern Recognition，2008：1-4.
[5] ZHAO G，PIETIK?INEN M，HADID A.Local spatiotemporal descriptors for visual recognition of spoken phrases[C]//2007 International Workshop on Human-Centered Multimedia，2007：57-66.
[6] ZHOU Z，ZHAO G，PIETIK?INEN M.Towards a practical lipreading system[C]//2011 IEEE Conference on Computer Vision and Pattern Recognition，2011：137-144.
[7] NODA K，YAMAGUCHI Y，NAKADAI K，et al.Lipreading using convolutional neural network[C]//15th Annual Conference of the International Speech Communication Association，2014.
[8] TATULLI E，HUEBER T.Feature extraction using multimodal convolutional neural networks for visual speech recognition[C]//2017 IEEE International Conference on Acoustics，Speech and Signal Processing，2017：2971-2975.
[9] CHUNG J S，ZISSERMAN A.Lip reading in the wild[C]//13th Asian Conference on Computer Vision.Cham：Springer，2016：87-103.
[10] WAND M，KOUTNíK J，SCHMIDHUBER J.Lipreading with long short-term memory[C]//2016 IEEE International Conference on Acoustics，Speech and Signal Processing，2016：6115-6119.
[11] PETRIDIS S，STAFYLAKIS T，MA P，et al.End-to-end audiovisual speech recognition[C]//2018 IEEE International Conference on Acoustics，Speech and Signal Processing，2018：6548-6552.
[12] STAFYLAKIS T，TZIMIROPOULOS G.Combining residual networks with LSTMs for lipreading[J].arXiv：1703.
04105，2017.
[13] WAND M，SCHMIDHUBER J.Improving speaker-independent lipreading with domain-adversarial training[J].arXiv：1708.
01565，2017.
[14] CHUNG J S，SENIOR A，VINYALS O，et al.Lip reading sentences in the wild[C]//2017 IEEE Conference on Computer Vision and Pattern Recognition，2017：3444-3453.
[15] KOLLER O，NEY H，BOWDEN R.Deep learning of mouth shapes for sign language[C]//2015 IEEE International Conference on Computer Vision Workshops，2015：85-91.
[16] ASSAEL Y M，SHILLINGFORD B，WHITESON S，et al.Lipnet：end-to-end sentence-level lipreading[J].arXiv：1611.
01599，2016.
[17] 刘大运，房国志，骆天依，等.基于BiLSTM-Attention唇语识别的研究[J].计算技术与自动化，2020，39（1）：150-155.
LIU D Y，FANG G Z，LUO T Y，et al.Research on lip-reading based on BiLSTM-Attention[J].Computing Technology and Automation，2020，39（1）：150-155.
[18] 马金林，陈德光，郭贝贝，等.唇语语料库综述[J].计算机工程与应用，2019，55（22）：1-13.
MA J L，CHEN D G，GUO B B，et al.Lip corpus review[J].Computer Engineering and Applications，2019，55（22）：1-13.
[19] YANG S，ZHANG Y，FENG D，et al.LRW-1000：a naturally-distributed large-scale benchmark for lip reading in the wild[C]//2019 14th IEEE International Conference on Automatic Face & Gesture Recognition，2019：1-8.
[20] 张晓冰，龚海刚，杨帆，等.基于端到端句子级别的中文唇语识别研究[J].软件学报，2020，31（6）：1747-1760.
ZHANG X B，GONG H G，YANG F，et al.Chinese sentence-level lip reading based on end-to-end model[J].Journal of Software，2020，31（6）：1747-1760.
[21] ZHAO Y，XU R，SONG M.A cascade sequence-to-sequence model for Chinese mandarin lip reading[C]//Proceedings of the ACM Multimedia Asia，2019：1-6.
[22] NETI C，POTAMIANOS G，LUETTIN J，et al.Audio visual speech recognition[R].IDIAP，2000.