Full Reference Video Quality Assessment Based on Multi-Scale Spatiotemporal Feature Aggregation

doi:10.3778/j.issn.1002-8331.2205-0212

Abstract

Abstract: The video quality depends on observer’s perception of video at multiple time scales, while the current video quality assessment models generally describe the distortion at a fixed scale, and the features of a single granularity are not sufficient to represent the global information of video. In order to fully extract and aggregate multi-granularity information to characterize the complex human perception mechanism, this paper proposes a multi-scale spatiotemporal feature aggregation network. For the deficiency of losing key frames in fixed interval sampling in traditional quality assessment algorithms, it combines the image structure distortion and perceived motion energy to adaptively sampling sequences. The long short-term memory layers perform multi-scale spatiotemporal feature extraction, and the features between layers are transmitted in forward and reverse paths. Finally, the video quality score is returned with self-attention network. In multiple datasets, the SRCC index of the model reaches more than 0.93, all of which achieve optimal or sub-optimal performance.

Key words: video quality assessment, adaptive sampling, visual neural perception, feature pyramid, multi-scale spatiotemporal feature, long short-term memory network

摘要： 视频质量得分是观测者在多个时间尺度下对视频进行感知的结果，而当前质量评价模型普遍在某个固定尺度下对失真进行描述，单一粒度的特征对全局信息表征并不充足。为充分提取并聚合多粒度信息来刻画人类复杂的感知机制，提出一种基于多尺度时空特征聚合的全参考视频质量评价方法。为解决传统质量评价算法中固定间隔采样丢失关键帧的痛点，通过结合图像结构失真度与感知运动能量对序列自适应采样；为提取不同粒度特征对失真进行表征，并探究聚合多粒度特征的有效方式，利用堆叠的长短时记忆层对序列进行特征提取，模拟视觉神经的正反向感知迭代机制，对网络层特征融合；结合多通道自注意力网络回归预测得分。模型在多个数据集中的SRCC指标均达到0.93以上，取得最优或次优的性能。

关键词: 视频质量评价, 自适应采样, 视觉神经感知, 特征金字塔, 多尺度时空域特征, 长短时记忆网络

ZHANG Wei, ZHAO Shiling, LIU Yinhao, WANG Hongkui, YIN Haibing. Full Reference Video Quality Assessment Based on Multi-Scale Spatiotemporal Feature Aggregation[J]. Computer Engineering and Applications, 2023, 59(18): 154-162.

张威, 赵世灵, 刘银豪, 王鸿奎, 殷海兵. 多尺度时空特征聚合的全参考视频质量评价[J]. 计算机工程与应用, 2023, 59(18): 154-162.

References

[1] 褚嘉璐，李强.VVC帧间编码CU快速划分算法[J].计算机工程与应用，2022，58（8）：249-256.
CHU J L，LI Q.Fast CU partition algorithm for VVC inter coding[J].Computer Engineering and Applications，2022，58（8）：249-256.
[2] WANG N，ZHOU W，WANG J，et al.Transformer meets tracker：exploiting temporal context for robust visual tracking[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，Nashville，Jun 20-25，2021.New York：IEEE，2021：1571-1580.
[3] KORHONEN J.Two-level approach for no-reference consumer video quality assessment[J].IEEE Transactions on Image Processing，2019，28（12）：5923-5938.
[4] WU J，LIU Y，DONG W，et al.Quality assessment for video with degradation along salient trajectories[J].IEEE Transactions on Multimedia，2019，21（11）：2738-2749.
[5] KIM W，KIM J，AHN S，et al.Deep video quality assessor：from spatio-temporal visual sensitivity to a convolutional neural aggregation network[C]//Proceedings of the European Conference on Computer Vision（ECCV），Munich，Sept 10-13，2018.Berlin：Springer，2018：219-234.
[6] XU M，CHEN J，WANG H，et al.C3DVQA：full-reference video quality assessment with 3d convolutional neural network[C]//Proceedings of the IEEE International Conference on Acoustics，Speech and Signal Processing（ICASSP），Barcelona，May 4-8，2020.New York：IEEE，2020：4447-4451.
[7] LIU Y，WU J，LI A，et al.Video quality assessment with serial dependence modeling[J].IEEE Transactions on Multimedia（Early Access），2021，24：3754-3768.
[8] AHISSAR H M.View from the top：hierarchies and reverse hierarchies in the visual system[J].Neuron，2002，36（5）：791-804.
[9] FISCHER J，WHITNEY D.Serial dependence in visual perception[J].Nature Neuroscience，2014，17（5）：738-743.
[10] 王兰馨，王卫亚，程鑫.结合Bi-LSTM-CNN的语音文本双模态情感识别模型[J].计算机工程与应用，2022，58（4）：192-197.
WANG L X，WANG W Y，CHENG X.Bimodal emotion recognition model for speech-text based on Bi-LSTM-CNN[J].Computer Engineering and Applications，2022，58（4）：192-197.
[11] HAMMETT S T.Motion blur and motion sharpening in the human visual system[J].Vision Research，1997，37（18）：2505-2510.
[12] SESHADRINATHAN K，BOVIK A C.Motion-based perceptual quality assessment of video[C]//Proceedings of the Human Vision and Electronic Imaging（HVEI），California，Jan 19-21，2009.New York：SPIE，2009：283-294.
[13] LIU T，ZHANG H J，QI F.A novel video key-frame-extraction algorithm based on perceived motion energy model[J].IEEE Transactions on Circuits and Systems for Video Technology，2003，13（10）：1006-1013.
[14] XUE W，ZHANG L，MOU X，et al.Gradient magnitude similarity deviation：a highly efficient perceptual image quality index[J].IEEE Transactions on Image Processing，2014，23（2）：684-695.
[15] WHITNEY D，MURAKAMI I，CAVANAGH P.Illusory spatial offset of a flash relative to a moving stimulus is caused by differential latencies for moving and flashed stimuli[J].Vision Research，2000，40（2）：137-149.
[16] VU P V，CHANDLER D M.ViS3：an algorithm for video quality assessment via analysis of spatial and spatiotemporal slices[J].Journal of Electronic Imaging，2014，23（1）：13-16.
[17] WANG Z，SIMONCELLI E P，BOVIK A C.Multiscale structural similarity for image quality assessment[C]//Proceedings of the Thrity-Seventh Asilomar Conference on Signals，Systems & Computers（ACSSC），California，Nov 9-12，2003.New York：IEEE，2003：1398-1402.
[18] WATSON A B.Toward a perceptual video-quality metric[C]//Proceedings of the Human Vision and Electronic Imaging（HVEI），California，1998.New York：SPIE，1998：139-147.
[19] KORHONEN J，SU Y，YOU J.Blind natural video quality prediction via statistical temporal features and deep spatial features[C]//Proceedings of the 28th ACM International Conference on Multimedia，Seattle，2020.New York：ACM，2020：3311-3319.