Cross-Attention Fusion Learning of Transformer-CNN Features for Person Re-Identification

doi:10.3778/j.issn.1002-8331.2311-0452

Abstract

Abstract: Convolutional neural networks (CNN) focus on local features and have difficulty to obtain global structural information. Transformer networks model long-distance feature dependence, but tend to ignore local feature details. Based on cross-attention fusion learning, a person re-identification algorithm is proposed in this paper, which combines the strengths of CNN and Transformer feature learning networks to enrich the local features of pedestrians and improve the global feature representation. The proposed model consists of three parts: the CNN branch mainly extracts local details; the Transformer branch focuses on global feature information; the cross-attention fusion branch calculates the correlation of the features from the above two branches by using the self-attention mechanism, then realizes the feature fusion, and finally improves the representation ability of the model. The ablation experiments and experimental results on Market1501 and DukeMTMC-reID datasets demonstrate the effectiveness of the proposed method.

Key words: person re-identificational, convolutional neural network (CNN), Transformer, cross-attention fusion learning

摘要： 卷积神经网络（convolutional neural network，CNN）关注局部特征，难以获得全局结构信息，Transformer网络建模长距离的特征依赖，但易忽略局部特征细节。提出了一种跨注意力融合学习的行人重识别算法，利用CNN和Transformer特征学习网络的特点，在丰富行人局部特征的同时改善特征的全局表达能力。该模型由三个部分构成：CNN分支主要提取局部细节信息；Transformer分支侧重于关注全局特征信息；跨注意力融合分支通过自注意力机制计算上述两个分支特征的相关性，进而实现特征融合，最终提高模型的表征能力。剥离实验以及在Market1501和DukeMTMC-reID数据集的实验结果证明了所提方法的有效性。

关键词: 行人重识别, 卷积神经网络（CNN）, Transformer, 跨注意力融合学习

XIANG Jun, ZHANG Jincheng, JIANG Xiaoping, HOU Jianhua. Cross-Attention Fusion Learning of Transformer-CNN Features for Person Re-Identification[J]. Computer Engineering and Applications, 2024, 60(16): 94-104.

项俊, 张金城, 江小平, 侯建华. Transformer-CNN特征跨注意力融合学习的行人重识别[J]. 计算机工程与应用, 2024, 60(16): 94-104.

References

[1] LIAO S, HU Y, ZHU X, et al. Person re-identification by local maximal occurrence representation and metric learning[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition, Boston, 2015: 2197-2206.
[2] ZHENG L, YANG Y, HAUPTMANN A G. Person re-identification: past, present and future[J]. arXiv:1610.02984, 2016.
[3] ZHENG L, ZHANG H, SUN S, et al. Person re-identification in the wild[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pattern Recognition, Honolulu, 2017: 1367-1376.
[4] SUN Y, ZHENG L, YANG Y, et al. Beyond part models: person retrieval with refined part pooling (and a strong convolutional baseline)[C]//Proceedings of the 15th European Conference on Computer Vision, Munich, 2018: 480-496.
[5] WANG G, YUAN Y, CHEN X, et al. Learning discriminative features with multiple granularities for person re-identification[C]//Proceedings of the 26th ACM International Conference on Multimedia. New York: ACM, 2018: 274-282.
[6] LUO H, JIANG W, ZHANG X, et al. AlignedreID++: dynamically matching local information for person re-identification[J]. Pattern Recognition, 2019, 94: 53-61.
[7] SUN Y, CHENG C, ZHANG Y, et al. Circle loss: a unified perspective of pair similarity optimization[C]//Proceedings of the 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition, Seattle, 2020: 6398-6407.
[8] 杨永胜, 邓淼磊, 张德贤. 基于IBN-Net和通道注意力的行人重识别方法[J]. 计算机工程与应用, 2023, 59(17): 143-151.
YANG Y S, DENG M L, ZHANG D X. Person re-identification method based on IBN-Net and channel attention[J]. Computer Engineering and Application, 2023, 59(17): 143-151.
[9] 陈璠, 彭力. 异构分支关联特征融合的行人重识别[J]. 计算机科学与探索, 2022, 16(11): 2609-2618.
CHEN F, PENG L. Person re-identification based on heterogeneous branch correlative features fusion[J]. Journal of Frontiers of Computer Science and Technology, 2022, 16(11): 2609-2618.
[10] 钱亚萍, 王凤随, 熊磊. 基于局部细化多分支与全局特征共享的无监督行人重识别方法[J]. 电子测量与仪器学报, 2023, 37(1): 106-115.
QIAN Y P, WANG F S, XIONG L. Unsupervised person re-identification method based on local refinement multi-branch and global feature sharing[J]. Journal of Electronic Measurement and Instrumentation, 2023, 37(1): 106-115.
[11] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Advances in Neural Information Processing Systems 30, 2017.
[12] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16x16 words: transformers for image recognition at scale[J]. arXiv:2010.11929, 2020.
[13] LIU Z, LIN Y, CAO Y, et al. Swin transformer: hierarchical vision transformer using shifted windows[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, Montreal, 2021: 10012-10022.
[14] HE S, LUO H, WANG P, et al. TransreID: transformer-based object re-identification[C]//Proceedings of the 2021 IEEE International Conference on Computer Vision, Montreal, 2021: 15013-15022.
[15] LEE K, JANG I S, KIM K J, et al. REET: region-enhanced transformer for person re-identification[C]//Proceedings of the 2022 IEEE International Conference on Advanced Video and Signal Based Surveillance, Madrid, 2022: 1-8.
[16] PENG Z, HUANG W, GU S, et al. Conformer: local features coupling global representations for visual recognition[C]//Proceedings of the 2021 IEEE International Conference on Computer Vision, Montreal, 2021: 367-376.
[17] LI H, YE M, WANG C, et al. Pyramidal transformer with Conv-Patchify for person re-identification[C]//Proceedings of the 30th ACM International Conference on Multimedia. New York: ACM, 2022: 7317-7326.
[18] XIE C X, XIA C Q, MA M C, et al. Pyramid grafting network for one-stage high resolution saliency detection[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, New Orleans, 2022: 11717-11726.
[19] 王静, 李沛橦, 赵容锋, 等. 融合卷积注意力Transformer架构的行人重识别方法[J]. 北京航空航天大学学报, 2024, 50(2): 466-476.
WANG J, LI P T, ZHAO R F, et al. A person re-identification method for fusing convolutional attention and transformer architecture[J]. Journal of Beihang University, 2024, 50(2): 466-476.
[20] ZHANG G, ZHANG P, QI J, et al. HAT: hierarchical aggregation transformers for person re-identification[C]//Proceedings of the 29th ACM International Conference on Multimedia. New York: ACM, 2021: 516-525.
[21] 刘洋, 闫冬梅, 孟范伟. 基于Transformer改进的两分支行人重识别算法[J]. 东北大学学报 (自然科学版), 2023, 44(1): 26-32.
LIU Y, YAN D M, MENG F W. Improved two-branch person re-identification algorithm based on transformer[J]. Journal of Northeastern University (Natural Science), 2023, 44(1): 26-32.
[22] ZHANG R. Making convolutional networks shift-invariant again[C]//Proceedings of the 2019 International Conference on Machine Learning, Los Angeles, 2019: 7324-7334.
[23] ZHENG L, SHEN L, TIAN L, et al. Scalable person re-identification: a benchmark[C]//Proceedings of the 2015 IEEE International Conference on Computer Vision, Santiago, 2015: 1116-1124.
[24] ZHENG Z, ZHENG L, YANG Y. Unlabeled samples generated by GAN improve the person re-identification baseline in vitro[C]//Proceedings of the 2017 IEEE International Conference on Computer Vision, Venice, 2017: 3754-3762.
[25] ZHOU B, KHOSLA A, LAPEDRIZA A, et al. Learning deep features for discriminative localization[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, 2016: 2921-2929.
[26] WOO S, PARK J, LEE J Y, et al. CBAM: convolutional block attention module[C]//Proceedings of the 15th European Conference on Computer Vision, 2018: 3-19.
[27] FELZENSZWALB P, MCALLESTER D, RAMANAN D. A discriminatively trained, multiscale, deformable part model[C]//Proceedings of the 2008 IEEE Conference on Computer Vision and Pattern Recognition, Anchorage, 2008: 1-8.
[28] ZHONG Z, ZHENG L, KANG G, et al. Random erasing data augmentation[C]//Proceedings of the 34th AAAI Conference on Artificial Intelligence, 2020: 13001-13008.
[29] KARPATHY A, TODERICI G, SHETTY S, et al. Large-scale video classification with convolutional neural networks[C]//Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition, Columbus, 2014: 1725-1732.
[30] HE K M, ZHANG X, REN S, et al. Deep residual learning for image recognition[C]//Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition, Las Vegas, 2016: 770-778.
[31] SELVARAJU R R, COGSWELL M, DAS A, et al. Grad-CAM: visual explanations from deep networks via gradient-based localization[C]//Proceedings of the 2017 IEEE International Conference on Computer Vision, Venice, 2017: 618-626.
[32] WU J, YANG Y, LEI Z, et al. Camera-aware representation learning for person re-identification[J]. Neurocomputing, 2023, 518: 155-164.
[33] CHEN T L, DING S J, XIE J Y, et al. ABD-net: attentive but diverse person re-identification[C]//Proceedings of the 2019 IEEE International Conference on Computer Vision, Seoul, 2019: 8351-8361.
[34] ZHOU K Y, YANG Y X, CAVALLARO A, et al. Omni-scale feature learning for person re-identification[C]//Proceedings of the 2019 IEEE International Conference on Computer Vision, Seoul, 2019: 3702-3712.
[35] ZHU K, GUO H, LIU Z, et al. Identity-guided human semantic parsing for person re-identification[C]//Proceedings of the 16th European Conference on Computer Vision, Glasgow, 2020: 346-363.
[36] 王鹏, 宋晓宁, 吴小俊, 等. 用于行人重识别的多类型特征网络[J]. 模式识别与人工智能, 2020, 33(10): 879-888.
WANG P, SONG X N, WU X J, et al. Multi-type features network for person re-identification[J]. Pattern Recognition and Artificial Intelligence, 2020, 33(10): 879-888.
[37] YE M, SHEN J, LIN G, et al. Deep learning for person re-identification: a survey and outlook[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021, 44(6): 2872-2893.
[38] ZHONG Z, ZHENG L, CAO D, et al. Re-ranking person re-identification with k?reciprocal encoding[C]//Proceedings of the 2017 IEEE Conference on Computer Vision and Pttern Rcognition, Honolulu, 2017: 1318-1327.
[39] LI Y, HE J, ZHANG T, et al. Diverse part discovery: occluded person re-identification with part-aware transformer[C]//Proceedings of the 2021 IEEE Conference on Computer Vision and Pattern Recognition, 2021: 2898-2907.
[40] CHEN Y, XIA S, ZHAO J, et al. ResT-reID: transformer block-based residual learning for person re-identification[J]. Pattern Recognition Letters, 2022, 157: 90-96.