[1] LI Z J, HUANG B, YE Z F, et al. Physical human-robot interaction of a robotic exoskeleton by admittance control[J]. IEEE Transactions on Industrial Electronics, 2018, 65(12): 9614-9624.
[2] 郑远攀, 李广阳, 李晔. 深度学习在图像识别中的应用研究综述[J]. 计算机工程与应用, 2019, 55(12): 20-36.
ZHENG Y P, LI G Y, LI Y. Survey of application of deep learning in image recognition[J]. Computer Engineering and Applications, 2019, 55(12): 20-36.
[3] PETRIU E M, PAYEUR P, CRETU A M, et al. Complementary tactile sensor and human interface for robotic telemanipulation[C]//Proceedings of the 2009 IEEE International Workshop on Haptic Audio Visual Environments and Games. Piscataway: IEEE, 2009: 164-169.
[4] YUAN W Z, DONG S Y, ADELSON E H. GelSight: high-resolution robot tactile sensors for estimating geometry and force[J]. Sensors, 2017, 17(12): 2762.
[5] CUI S W, WANG R, HU J Y, et al. In-hand object localization using a novel high-resolution visuotactile sensor[J]. IEEE Transactions on Industrial Electronics, 2022, 69(6): 6015-6025.
[6] ERNST M O, BANKS M S. Humans integrate visual and haptic information in a statistically optimal fashion[J]. Nature, 2002, 415(6870): 429-433.
[7] 葛同澳, 李辉, 郭颖, 等. 基于双融合框架的多模态3D目标检测算法[J]. 电子学报, 2023, 51(11): 3100-3110.
GE T A, LI H, GUO Y, et al. A multimodal 3D object detection method based on double-fusion framework[J]. Acta Electronica Sinica, 2023, 51(11): 3100-3110.
[8] 朱文霖, 刘华平, 王博文, 等. 基于视-触跨模态感知的智能导盲系统[J]. 智能系统学报, 2020, 15(1): 33-40.
ZHU W L, LIU H P, WANG B W, et al. An intelligent blind guidance system based on visual-touch cross-modal perception[J]. CAAI Transactions on Intelligent Systems, 2020, 15(1): 33-40.
[9] 任泽裕, 王振超, 柯尊旺, 等. 多模态数据融合综述[J]. 计算机工程与应用, 2021, 57(18): 49-64.
REN Z Y, WANG Z C, KE Z W, et al. Survey of multimodal data fusion[J]. Computer Engineering and Applications, 2021, 57(18): 49-64.
[10] 沈书馨, 宋爱国, 阳雨妍, 等. 面向空间机械臂的视触融合目标识别系统[J]. 载人航天, 2022, 28(2): 213-222.
SHEN S X, SONG A G, YANG Y Y, et al. Visual-tactile fusion target recognition system for space manipulator[J]. Manned Spaceflight, 2022, 28(2): 213-222.
[11] GAO J L, HUANG Z J, TANG Z N, et al. Visuo-tactile-based slip detection using a multi-scale temporal convolution network[J]. arXiv:2302.13564, 2023.
[12] HAN Y H, YU K L, BATRA R, et al. Learning generalizable vision-tactile robotic grasping strategy for deformable objects via transformer[J]. IEEE/ASME Transactions on Mechatronics, 2025, 30(1): 554-566.
[13] CUI S W, WEI J H, LI X C, et al. Generalized visual-tactile transformer network for slip detection[J]. IFAC-PapersOnLine, 2020, 53(2): 9529-9534.
[14] CHEN Y Z, SIPOS A, VAN DER MERWE M, et al. Visuo-tactile transformers for manipulation[C]//Proceedings of the Conference on Robot Learning, 2022.
[15] LI B J, BAI J B, QIU S J, et al. VITO-transformer: a visual-tactile fusion network for object recognition[J]. IEEE Transactions on Instrumentation and Measurement, 2023, 72: 2530810.
[16] GAO J, LI P, CHEN Z K, et al. A survey on deep learning for multimodal data fusion[J]. Neural Computation, 2020, 32(5): 829-864.
[17] DONG J H, CONG Y, SUN G, et al. Lifelong robotic visual-tactile perception learning[J]. Pattern Recognition, 2022, 121: 108176.
[18] CUI S W, WANG R, WEI J H, et al. Grasp state assessment of deformable objects using visual-tactile fusion perception[C]//Proceedings of the 2020 IEEE International Conference on Robotics and Automation. Piscataway: IEEE, 2020: 538-544.
[19] GAO S, DAI Y N, NATHAN A. Tactile and vision perception for intelligent humanoids[J]. Advanced Intelligent Systems, 2022, 4(2): 2100074.
[20] CUI C, YANG H C, WANG Y H, et al. Deep multimodal fusion of image and non-image data in disease diagnosis and prognosis: a review[J]. Progress in Biomedical Engineering, 2023, 5(2): 022001.
[21] 陈玲玲, 毕晓君. 多模态融合网络的睡眠分期研究[J]. 智能系统学报, 2022, 17(6): 1194-1200.
CHEN L L, BI X J. Sleep staging model based on multimodal fusion[J]. CAAI Transactions on Intelligent Systems, 2022, 17(6): 1194-1200.
[22] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[C]//Advances in Neural Information Processing Systems, 2017: 5998-6008.
[23] 刘文婷, 卢新明. 基于计算机视觉的Transformer研究进展[J]. 计算机工程与应用, 2022, 58(6): 1-16.
LIU W T, LU X M. Research progress of Transformer based on computer vision[J]. Computer Engineering and Applications, 2022, 58(6): 1-16.
[24] AL-RFOU R, CHOE D, CONSTANT N, et al. Character-level language modeling with deeper self-attention[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2019: 3159-3166.
[25] DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of the North American Chapter of the Association for Computational Linguistics, 2019.
[26] DAI Z H, YANG Z L, YANG Y M, et al. Transformer-XL: attentive language models beyond a fixed-length context[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: ACL, 2019: 2978-2988.
[27] BELTAGY I, PETERS M E, COHAN A. Longformer: the long-document Transformer[J]. arXiv:2004.05150, 2020.
[28] DOSOVITSKIY A, BEYER L, KOLESNIKOV A, et al. An image is worth 16×16 words: Transformers for image recognition at scale[J]. arXiv:2010.11929, 2020.
[29] LIU Z, LIN Y T, CAO Y, et al. Swin Transformer: hierarchical vision transformer using shifted windows[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2021: 9992-10002.
[30] NAGRANI A, YANG S, ARNAB A, et al. Attention bottlenecks for multimodal fusion[C]//Advances in Neural Information Processing Systems, 2021, 34: 14200-14213.
[31] CHANG Z H, FENG Z X, YANG S Y, et al. AFT: adaptive fusion Transformer for visible and infrared images[J]. IEEE Transactions on Image Processing, 2023, 32: 2077-2092.
[32] LIU H P, WANG F, ZHANG X Y, et al. Weakly-paired deep dictionary learning for cross-modal retrieval[J]. Pattern Recognition Letters, 2020, 130: 199-206.
[33] YANG L J, HUANG Y F, SUGANO Y, et al. Interact before align: leveraging cross-modal knowledge for domain adaptive action recognition[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2022: 14702-14712.
[34] SHI R W, YANG S C, CHEN Y Y, et al. CNN‐Transformer for visual‐tactile fusion applied in road recognition of autonomous vehicles[J]. Pattern Recognition Letters, 2023, 166: 200-208.
[35] BERTASIUS G, WANG H, TORRESANI L. Is space-time attention all you need for video understanding?[C]//Proceedings of the International Conference on Machine Learning, 2021: 813-824.
[36] ARNAB A, DEHGHANI M, HEIGOLD G, et al. ViViT: a video vision transformer[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2021: 6816-6826.
[37] YANG F Y, MA C Y, ZHANG J C, et al. Touch and Go: learning from human-collected vision and touch[J]. arXiv: 2211.12498, 2022.
[38] LI J H, DONG S Y, ADELSON E. Slip detection with combined tactile and visual information[C]//Proceedings of the 2018 IEEE International Conference on Robotics and Automation. Piscataway: IEEE, 2018: 7772-7777.
[39] YAN G, SCHMITZ A, TOMO T P, et al. Detection of slip from vision and touch[C]//Proceedings of the 2022 International Conference on Robotics and Automation. Piscataway: IEEE, 2022: 3537-3543. |