计算机工程与应用 ›› 2012, Vol. 48 ›› Issue (13): 118-124.

• 数据库、信号与信息处理 • 上一篇    下一篇

音视频信息融合的说话人跟踪算法研究

曹  洁,郑景润   

  1. 兰州理工大学 电气工程与信息工程学院,兰州 730050
  • 出版日期:2012-05-01 发布日期:2012-05-09

Speaker tracking based on audio-video information fusion

CAO Jie, ZHENG Jingrun   

  1. College of Electrical and Information Engineering, Lanzhou University of Technology, Lanzhou 730050, China
  • Online:2012-05-01 Published:2012-05-09

摘要: 针对单独的音频和视频信息跟踪的缺陷,提出了一种音视频信息融合的粒子滤波跟踪算法。采用闭环跟踪框架,分为底层跟踪、融合、重要性粒子滤波、跟踪输出和反馈五个环节。底层跟踪环节利用说话人脸部肤色信息进行均值漂移跟踪的同时,利用说话人声音信号到达麦克风阵列的时间延迟进行跟踪定位;融合环节对这两者得到的跟踪信息进行整合,得出基于音视频信息融合的重要性函数和融合似然模型;滤波环节利用重要性粒子滤波算法对融合的数据进行滤波处理;跟踪环节根据滤波结果对说话人进行跟踪;反馈环节将跟踪结果动态反馈给人脸肤色跟踪和声源定位跟踪模块。流程化的闭环处理过程保证了算法的实时性。最后,采用AMI会议语料库对该算法进行测试,结果表明该算法平均误跟率仅为9.32%,比使用单一音频或视频信息的跟踪算法稳定性好、准确性高。

关键词: 对象跟踪, 声源定位, 肤色跟踪, 均值漂移, 重要性粒子滤波

Abstract: In order to solve the defects of tracking using only audio and video information, a novel speaker tracking algorithm based on audio-video information fusion using importance particle filter is proposed. The proposed algorithm performs in a closed-loop tracking system where five modules that are bottom tracking, fusion center, importance particle filtering, tracking results output and results feedback work together to make the system best. At the bottom tracking module, based on the complementarity between speech and image of a speaker, both mean shift tracking based on face color information and sound source localization using time delay of arrival from microphone array are adopted to acquire tracking information, and they are integrated in the fusion center to obtain audio-video fused importance function and fused likelihood model. Then the fused data are processed by importance particle filter to output the tracking results, and the results are returned dynamically to the skin color tracking module and sound source localization module. Such a closed-loop system ensures the proposed algorithm performs in real-time. Experiments using AMI Meeting Corpus data demonstrate that the proposed approach is more better than those trackers utilizing only audio or video information at robustness and accuracy, and reaches an average tracking error of 9.32%.

Key words: object tracking, sound source localization, skin color tracking, mean shift, importance particle filter