[1] DUTT J S, XIONG B, GRAUMAN K. FusionSeg: learning to combine motion and appearance for fully automatic segmentation of generic objects in videos[C]//Proceedings of the 30th IEEE Conference on Computer Vision and Pattern Recognition, 2017: 3664-3673.
[2] XU Y S, FU T J, YANG H K, et al. Dynamic video segmentation network[C]//Proceedings of the 31 st IEEE Conference on Computer Vision and Pattern Recognition, 2018: 6556-6565.
[3] ZHOU T, PORIKLI F, CRANDALL D J, et al. A survey on deep learning technique for video segmentation[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45: 7099-7122.
[4] LIU S, QI L, QIN H F, et al. Path aggregation network for instance segmentation[C]//Proceedings of the 31 st IEEE Conference on Computer Vision and Pattern Recognition, 2018: 8759-8768.
[5] YANG L J, FAN Y C, XU N. Video instance segmentation[C]//Proceedings of the 17th IEEE International Conference on Computer Vision, 2019: 5188-5197.
[6] OH S W, LEE J Y, XU N, et al. Video object segmentation using space-time memory networks[C]//Proceedings of the 17th IEEE International Conference on Computer Vision, 2019: 9226-9235.
[7] PERAZZI F, PONT-TUSET J, MCWILLIAMS B, et al. A benchmark dataset and evaluation methodology for video object segmentation[C]//Proceedings of the 29th IEEE Conference on Computer Vision and Pattern Recognition, 2017: 724-732.
[8] SIBECHI R, BOOIJ O, BAKA N, et al. Exploiting temporality for semi-supervised video segmentation[C]//Proceedings of the 17th IEEE International Conference on Computer Vision, 2019: 933-941.
[9] GAVRILYUK K, GHODRATI A, LI Z Y, et al. Actor and action video segmentation from a sentence[C]//Proceedings of the 31 st IEEE Conference on Computer Vision and Pattern Recognition, 2018: 5958-5966.
[10] LI D Z, LI R Q, WANG L J, et al. You only infer once: cross-modal meta-transfer for referring video object segmentation[C]//Proceedings of the 36th AAAI Conference on Artificial Intelligence, 2022: 1297-1305.
[11] GUO M H, XU T X, LIU J J, et al. Attention mechanisms in computer vision: a survey[J]. Computational Visual Media, 2022, 8: 331-368.
[12] ZHU X Z, CHENG D Z, ZHANG Z, et al. An empirical study of spatial attention mechanisms in deep networks[C]//Proceedings of the 17th IEEE International Conference on Computer Vision, 2019: 6688-6697.
[13] BRAUWERS G, FRASINCAR F. A general survey on attention mechanisms in deep learning[J]. IEEE Transactions on Knowledge and Data Engineering, 2023, 35: 3279-3298.
[14] WANG H, DENG C, YAN J C, et al. Asymmetric cross-guided attention network for actor and action video segmentation from natural language query[C]//Proceedings of the 17th IEEE International Conference on Computer Vision, 2019: 3938-3947.
[15] 蔡腾, 陈慈发, 董方敏. 结合Transformer和动态特征融合的低照度目标检测[J]. 计算机工程与应用, 2024, 60(9): 135-141.
CAI T, CHEN C F, DONG F G. Low-light object detection combining transformer and dynamic feature fusion[J]. Computer Engineering and Applications, 2024, 60(9): 135-141.
[16] BOTACH A, ZHELTONOZHSKII E, BASKIN C. End-to-end referring video object segmentation with multimodal transformers[C]//Proceedings of the 35th IEEE Conference on Computer Vision and Pattern Recognition, 2022: 4985-4995.
[17] YANG Z, WANG J Q, TANG Y S, et al. LAVT: language-aware vision transformer for referring image segmentation[C]//Proceedings of the 35th IEEE Conference on Computer Vision and Pattern Recognition, 2022: 18155-18165.
[18] WANG Z Q, LU Y, LI Q, et al. CRIS: clip-driven referring image segmentation[C]//Proceedings of the 35th IEEE Conference on Computer Vision and Pattern Recognition, 2022: 11686-11695.
[19] 邱爽, 赵耀, 韦世奎. 图像指代分割研究综述[J]. 信号处理, 2022, 38(6): 1144-1154.
QIU S, ZHAO Y, WEI S K. A survey of referring image segmentation[J]. Journal of Signal Processing, 2022, 38(6): 1144-1154.
[20] WANG H, DENG C, YANG Y. Context modulated dynamic networks for actor and action video segmentation with language queries[C]//Proceedings of the 34th AAAI Conference on Artificial Intelligence, 2020: 12152-12159.
[21] SEONGUK S, JOON-YOUNG L, BOHYUNG H. URVOS: unified referring video object segmentation network with a large-scale benchmark[C]//Proceedings of the 16th European Conference on Computer Vision, 2020: 208-223.
[22] HUI T R, HUANG S F, LIU S, et al. Collaborative spatial-temporal modeling for language-queried video actor segmentation[C]//Proceedings of the 34th IEEE Conference on Computer Vision and Pattern Recognition, 2021: 4187-4196.
[23] DING Z H, HUI T R, HUANG J C, et al. Language-bridged spatial-temporal interaction for referring video object segmentation[C]//Proceedings of the 35th IEEE Conference on Computer Vision and Pattern Recognition, 2022: 4964-4973.
[24] LIANG C, WANG W G, ZHOU T F, et al. Local-global context aware transformer for language-guided video segmentation[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(8): 10055-10069.
[25] YANG J, HUANG Y, NIU K, et al. Actor and action modular network for text-based video segmentation[J]. IEEE Transactions on Image Processing, 2022, 31: 4474-4489.
[26] ZHAO W B, WANG K, CHU X X, et al. Modeling motion with multimodal features for text-based video segmentation[C]//Proceedings of the 35th IEEE Conference on Computer Vision and Pattern Recognition, 2022: 11737-11746.
[27] WU D M, DONG X P, SHAO L, et al. Multi-level representation learning with semantic alignment for referring video object segmentation[C]//Proceedings of the 35th IEEE Conference on Computer Vision and Pattern Recognition, 2022: 4996-5005.
[28] YANG X, WANG H, XIE D, et al. Object-agnostic transformers for video referring segmentation[J]. IEEE Transactions on Image Processing, 2022, 31: 2839-2849.
[29] WU J N, JIANG Y, SUN P Z, et al. Language as queries for referring video object segmentation[C]//Proceedings of the 35th IEEE Conference on Computer Vision and Pattern Recognition, 2022: 4974-4984.
[30] MIAO B, BENNAMOUN M, GAO Y, et al. Spectrum-guided multi-granularity referring video object segmentation[C]//Proceedings of the 19th IEEE International Conference on Computer Vision, 2023: 920-930.
[31] WANG L M, XIONG Y J, WANG Z, et al. Temporal segment networks: towards good practices for deep action recognition[C]//Proceedings of the 14th European Conference on Computer Vision, 2016: 20-36.
[32] BRATTOLI B, TIGHE J, ZHDANOV F, et al. Rethinking zero-shot video classification: end-to-end training for realistic applications[C]//Proceedings of the 33 rd IEEE Conference on Computer Vision and Pattern Recognition, 2020: 4613-4623.
[33] GAO J, ZHANG T, XU C. Learning to model relationships for zero-shot video classification[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021, 43(10): 3476-3491.
[34] ZHAO Y, XIONG Y J, WANG L M, et al. Temporal action detection with structured segment networks[C]//Proceedings of the 16th IEEE International Conference on Computer Vision, 2017: 2914-2923.
[35] ZHAO C, DU D W, HOOGS A, et al. Open set action recognition via multi-label evidential learning[C]//Proceedings of the 36th IEEE Conference on Computer Vision and Pattern Recognition, 2023: 22982-22991.
[36] LIU N, NAN K, ZHAO W, et al. Learning complementary spatial-temporal transformer for video salient object detection[J]. IEEE Transactions on Neural Networks and Learning Systems, 2024, 35(8): 10663-10673.
[37] HU M, JIANG K, WANG Z, et al. CycMuNet+: cycle-projected mutual learning for spatial-temporal video super-resolution[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, 45(11): 13376-13392 .
[38] 王彩红, 沈燕飞, 王 毅, 等. 基于时空上下文的手势跟踪与识别[J]. 计算机工程与应用, 2016, 52(9): 202-207.
WANG C H, SHEN Y F, WANG Y, et al. Gesture tracking and recognition based on spatio-temporal context[J]. Computer Engineering and Applications, 2016, 52(9): 202-207.
[39] DU T, LUBOMIR B, ROB F, et al. Learning spatiotemporal features with 3D convolutional networks[C]//Proceedings of the 15th IEEE International Conference on Computer Vision, 2015: 4489-4497.
[40] CARREIRA J, ZISSERMAN A. Quo vadis, action recognition? a new model and the kinetics dataset[C]//Proceedings of the 31 st IEEE Conference on Computer Vision and Pattern Recognition, 2017: 6299-6308.
[41] LIU W, ANGUELOV D, ERHAN D, et al. SSD: single shot multibox detector[C]//Proceedings of the 14th European Conference on Computer Vision, 2016: 21-37.
[42] REN S, HE K, GIRSHICK R, et al. Faster R-CNN: towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017, 39(6): 1137-1149.
[43] HE K M, GKIOXARI G, DOLLAR P, et al. Mask R-CNN[C]//Proceedings of the 16th IEEE International Conference on Computer Vision, 2017: 2961-2969.
[44] DOSOVITSKIY A, FISCHER P, ILG E, et al. FlowNet: learning optical flow with convolutional networks[C]//Proceedings of the 15th IEEE International Conference on Computer Vision, 2015: 2758-2766.
[45] ILG E, MAYER N, SAIKIA T, et al. FlowNet 2.0: evolution of optical flow estimation with deep networks[C]//Proceedings of the 31 st IEEE Conference on Computer Vision and Pattern Recognition, 2017: 2462-2470.
[46] ASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Proceedings of the 31 st IEEE Conference on Computer Vision and Pattern Recognition, 2017: 5998-6008.
[47] 杨文涛, 雷雨琦, 李星月, 等. 融合汉字输入法的BERT与BLCG的长文本分类研究[J]. 计算机工程与应用, 2024, 60(9): 196-202.
YANG W T, LEI Y Q, LI X Y, et al. Chinese long text classification model based on BERT fused Chinese input methods and BLCG[J]. Computer Engineering and Applications, 2024, 60(9): 196-202.
[48] LIU Z, NING J, CAO Y, et al. Video swin transformer[C]//Proceedings of the 35th IEEE Conference on Computer Vision and Pattern Recognition, 2022: 3202-3211.
[49] DAI Z H, YANG Z L, YANG Y M, et al. Transformer-XL: attentive language models beyond a fixed-length context[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019: 2978-2988.
[50] LIU Y, YU R, WANG J H, et al. Global spectral filter memory network for video object segmentation[C]//Proceedings of the 17th European Conference on Computer Vision, 2022: 648-665.
[51] KHOREVA A, ROHRBACH A, SCHIELE B. Video object segmentation with language referring expressions[C]//Proceedings of the 14th Asian Conference on Computer Vision, 2018: 123-141.
[52] XU C L, HSIEH S H, XIONG C M, et al. Can humans fly? action understanding with multiple classes of actors[C]//Proceedings of the 28th IEEE Conference on Computer Vision and Pattern Recognition, 2015: 2264-2273.
[53] JHUANG H, GALL J, ZUFFI S, et al. Towards understanding action recognition[C]//Proceedings of the 14th IEEE International Conference on Computer Vision, 2013: 3192-3199.
[54] LIN T Y, MAIRE M, BELONGIE S, et al. Microsoft COCO: common objects in context[C]//Proceedings of the 13th European Conference on Computer Vision, 2014: 740-755. |