Computer Engineering and Applications ›› 2023, Vol. 59 ›› Issue (18): 154-162.DOI: 10.3778/j.issn.1002-8331.2205-0212

• Graphics and Image Processing • Previous Articles     Next Articles

Full Reference Video Quality Assessment Based on Multi-Scale Spatiotemporal Feature Aggregation

ZHANG Wei, ZHAO Shiling, LIU Yinhao, WANG Hongkui, YIN Haibing   

  1. School of Communication Engineering, Hangzhou Dianzi University, Hangzhou 310000, China
  • Online:2023-09-15 Published:2023-09-15



  1. 杭州电子科技大学 通信工程学院,杭州 310000

Abstract: The video quality depends on observer’s perception of video at multiple time scales, while the current video quality assessment models generally describe the distortion at a fixed scale, and the features of a single granularity are not sufficient to represent the global information of video. In order to fully extract and aggregate multi-granularity information to characterize the complex human perception mechanism, this paper proposes a multi-scale spatiotemporal feature aggregation network. For the deficiency of losing key frames in fixed interval sampling in traditional quality assessment algorithms, it combines the image structure distortion and perceived motion energy to adaptively sampling sequences. The long short-term memory layers perform multi-scale spatiotemporal feature extraction, and the features between layers are transmitted in forward and reverse paths. Finally, the video quality score is returned with self-attention network. In multiple datasets, the SRCC index of the model reaches more than 0.93, all of which achieve optimal or sub-optimal performance.

Key words: video quality assessment, adaptive sampling, visual neural perception, feature pyramid, multi-scale spatiotemporal feature, long short-term memory network

摘要: 视频质量得分是观测者在多个时间尺度下对视频进行感知的结果,而当前质量评价模型普遍在某个固定尺度下对失真进行描述,单一粒度的特征对全局信息表征并不充足。为充分提取并聚合多粒度信息来刻画人类复杂的感知机制,提出一种基于多尺度时空特征聚合的全参考视频质量评价方法。为解决传统质量评价算法中固定间隔采样丢失关键帧的痛点,通过结合图像结构失真度与感知运动能量对序列自适应采样;为提取不同粒度特征对失真进行表征,并探究聚合多粒度特征的有效方式,利用堆叠的长短时记忆层对序列进行特征提取,模拟视觉神经的正反向感知迭代机制,对网络层特征融合;结合多通道自注意力网络回归预测得分。模型在多个数据集中的SRCC指标均达到0.93以上,取得最优或次优的性能。

关键词: 视频质量评价, 自适应采样, 视觉神经感知, 特征金字塔, 多尺度时空域特征, 长短时记忆网络