
Computer Engineering and Applications ›› 2025, Vol. 61 ›› Issue (24): 29-39.DOI: 10.3778/j.issn.1002-8331.2503-0101
• Research Hotspots and Reviews • Previous Articles Next Articles
GUO Ming1,2,3, ZHANG Yaru1, ZHU Li1, WANG Guoli1,2,3+, HUANG Ming1,2,3
Online:2025-12-15
Published:2025-12-15
郭明1,2,3,张雅如1,朱丽1,王国利1,2,3+,黄明1,2,3
GUO Ming, ZHANG Yaru, ZHU Li, WANG Guoli, HUANG Ming. Progress and Challenges in 3D Large Language Model Research[J]. Computer Engineering and Applications, 2025, 61(24): 29-39.
郭明, 张雅如, 朱丽, 王国利, 黄明. 三维大语言模型研究进展与挑战[J]. 计算机工程与应用, 2025, 61(24): 29-39.
Add to citation manager EndNote|Ris|BibTeX
URL: http://cea.ceaj.org/EN/10.3778/j.issn.1002-8331.2503-0101
| [1] XU R S, WANG X L, WANG T, et al. PointLLM: empowering large language models to understand point clouds[C]//Proceedings of the European Conference on Computer Vision. Cham: Springer, 2025: 131-147. [2] GAO P, GENG S J, ZHANG R R, et al. CLIP-adapter: better vision-language models with feature adapters[J]. International Journal of Computer Vision, 2024, 132(2): 581-595. [3] ALAYRAC J B, DONAHUE J, LUC P, et al. Flamingo: a visual language model for few-shot learning[C]//Advances in Neural Information Processing Systems, 2022: 23716-23736. [4] LI J N, LI D X, SAVARESE S, et al. BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models[C]//Proceedings of the International Conference on Machine Learning, 2023. [5] WEN S, FANG G A, ZHANG R R, et al. Improving compositional text-to-image generation with large vision-language models[J]. arXiv:2310.06311, 2023. [6] CHOUDHARY T, DEWANGAN V, CHANDHOK S, et al. Talk2BEV: language-enhanced bird’s-eye view maps for autonomous driving[C]//Proceedings of the 2024 IEEE International Conference on Robotics and Automation. Piscataway: IEEE, 2024: 16345-16352. [7] HUANG W L, WANG C, ZHANG R H, et al. VoxPoser: composable 3D value maps for robotic manipulation with language models[J]. arXiv:2307.05973, 2023. [8] 王密, 程昫, 潘俊, 等. 大模型赋能智能摄影测量: 现状、挑战与前景[J]. 测绘学报, 2024, 53(10): 1955-1966. WANG M, CHENG X, PAN J, et al. Large models enabling intelligent photogrammetry: status, challenges and prospects[J]. Acta Geodaetica et Cartographica Sinica, 2024, 53(10): 1955-1966. [9] 钱乾, 孙丽萍, 刘佳霖, 等. 基于判别增强大语言模型微调的医学影像报告生成[J]. 计算机应用研究, 2025, 42(3): 762-769. QIAN Q, SUN L P, LIU J L, et al. Medical imaging report generation via multi-modal large language models with discrimination-enhanced fine-tuning[J]. Application Research of Computers, 2025, 42(3): 762-769. [10] 张永军, 李彦胜, 党博, 等. 多模态遥感基础大模型: 研究现状与未来展望[J]. 测绘学报, 2024, 53(10): 1942-1954. ZHANG Y J, LI Y S, DANG B, et al. Multi-modal remote sensing large foundation models: current research status and future prospect[J]. Acta Geodaetica et Cartographica Sinica, 2024, 53(10): 1942-1954. [11] HONG Y N, ZHEN H Y, CHEN P H, et al. 3D-LLM: injecting the 3D world into large language models[J]. arXiv:2307. 12981, 2023. [12] HA H, SONG S R. Semantic abstraction: open-world 3D scene understanding from 2D vision-language models[J]. arXiv:2207.11514, 2022. [13] TANG Y, HAN X, LI X Z, et al. MiniGPT-3D: efficiently aligning 3D point clouds with large language models using 2D priors[C]//Proceedings of the 32nd ACM International Conference on Multimedia. New York: ACM, 2024: 6617-6626. [14] HUANG J Y, YONG S L, MA X J, et al. An embodied generalist agent in 3D world[J]. arXiv:2311.12871, 2023. [15] HEGDE D, JOSE VALANARASU J M, PATEL V M. CLIP goes 3D: leveraging prompt tuning for language grounded 3D recognition[C]//Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision Workshops. Piscataway: IEEE, 2023: 2020-2030. [16] LIU D N, DONG X M, ZHANG R R, et al. 3DAxies-Prompts: unleashing the 3D spatial task capabilities of GPT-4V[J]. arXiv:2312.097388, 2023. [17] GUO Z Y, ZHANG R R, ZHU X Y, et al. Point-bind & point-LLM: aligning point cloud with multi-modality for 3D understanding, generation, and instruction following[J]. arXiv:2309.00615, 2023. [18] JI J Y, WANG H W, WU C L, et al. JM3D & JM3D-LLM: elevating 3D representation with joint multi-modal cues[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025, 47(4): 2475-2492. [19] WU S, FEI H, QU L, et al. Next-GPT: any-to-any multimodal LLM[C]//Proceedings of the 41st International Conference on Machine Learning, 2024. [20] MEI G F, LIN W, RIZ L, et al. PerLA: perceptive 3D language assistant[J]. arXiv:2411.19774, 2024. [21] CHEN S J, CHEN X, ZHANG C, et al. LL3DA: visual interactive instruction tuning for omni-3D understanding, reasoning, and planning[C]//Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2024: 26418-26428. [22] HUANG K C, LI X T, QI L, et al. Reason3D: searching and reasoning 3D segmentation via large language model[C]//Proceedings of the 2025 International Conference on 3D Vision. Piscataway: IEEE, 2025: 1177-1186. [23] CHEN B Y, XU Z, KIRMANI S, et al. SpatialVLM: endowing vision-language models with spatial reasoning capabilities[C]//Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2024: 14455-14465. [24] CHENG TY, LU K, MA C Y, et al. SpatialPIN: enhancing spatial reasoning capabilities of vision-language models through prompting and interacting 3D priors[C]//Proceedings of the ?Annual Conference on Neural Information Processing Systems, 2024: 68803-68832. [25] CHENG A C, YIN H X, FU Y, et al. SpatialRGPT: grounded spatial reasoning in vision language models[J]. arXiv:2406. 01584, 2024. [26] YANG S Q, LIU J M, ZHANG R R, et al. LiDAR-LLM: exploring the potential of large language models for 3D LiDAR understanding[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2025: 9247-9255. [27] CHEN R, CHEN Y W, JIAO N X, et al. Fantasia3D: disentangling geometry and appearance for high-quality text-to-3D content creation[C]//Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2024: 22189-22199. [28] áRBOL B R, CASAS D. BodyShapeGPT: SMPL body shape manipulation withLLMs[C]//Proceedings of the European Conference on Computer Vision. Cham: Springer, 2025: 388-396. [29] QI Z Y, FANG Y, SUN Z Y, et al. GPT4Point: a unified framework for point-language understanding and generation[C]//Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2024: 26407-26417. [30] LIU D N, HUANG X S, HOU Y N, et al. Uni3D-LLM: unifying point cloud perception, generation and editing with large language models[J]. arXiv:2402.03327, 2024. [31] TANG Y, HAN X, LI X Z, et al. More text, less point: towards 3D data-efficient point-language understanding[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2025: 7284-7292. [32] ZHENG D, HUANG S J, WANG L W. Video-3D LLM: learning position-aware video representation for 3D scene understanding[J]. arXiv:2412.00493, 2024. [33] LI X, DING J, CHEN Z Y, et al. Uni3DL: unified model for 3D and language understanding[J]. arXiv:2312.03026, 2023. [34] LIU S C, TRAN V N, CHEN W K, et al. PAVLM: advancing point cloud based affordance understanding via vision-language model[J]. arXiv:2410.11564, 2024. [35] ZHOU J S, WANG J S, MA B R, et al. Uni3D: exploring unified 3D representation at scale[J]. arXiv:2310.06773, 2023. [36] SONG S Z, HE C X, LI S S, et al. MOSABench: multi-object sentiment analysis benchmark for evaluating multimodal large language models understanding of complex image[J]. arXiv:2412.00060, 2024. [37] WU X S, AVERBUCH-ELOR H, SUN J, et al. Towers of babel: combining images, language, and 3D geometry for learning multimodal vision[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2022: 418-427. [38] TANG Y W, ZHANG R, GUO Z, et al. Point-PEFT: parameter-efficient fine-tuning for 3D pre-trained models[C]//Proceedings of the AAAI Conference on Artificial Intelligence, 2024: 5171-5179. [39] YANG J N, CHEN X, MADAAN N, et al. 3D-GRAND: a million-scale dataset for 3D-LLMs with better grounding and less hallucination[J]. arXiv:2406.05132, 2024. [40] LI Z J, ZHANG C, WANG X Y, et al. 3DMIT: 3D multi-modal instruction tuning for scene understanding[C]//Proceedings of the 2024 IEEE International Conference on Multimedia and Expo Workshops. Piscataway: IEEE, 2024: 1-5. [41] QI Z K, DONG R P, ZHANG S C, et al. ShapeLLM: universal 3D object understanding for embodied interaction[C]//Proceedings of the European Conference on Computer Vision. Cham: Springer, 2025: 214-238. [42] ZHU C M, WANG T, ZHANG W W, et al. LLaVA-3D: a simple yet effective pathway to empowering LMMs with 3D-awareness[J]. arXiv:2409.18125, 2024. [43] LIU M H, ZHU Y H, CAI H, et al. PartSLIP: low-shot part segmentation for 3D point clouds via pretrained image-language models[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 21736-21746. [44] XUE L, GAO M F, XING C, et al. ULIP: learning a unified representation of language, images, and point clouds for 3D understanding[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 1179-1189. [45] KOCH S, HERMOSILLA P, VASKEVICIUS N, et al. Lang-3DSG: language-based contrastive pre-training for 3D scene graph prediction[C]//Proceedings of the 2024 International Conference on 3D Vision. Piscataway: IEEE, 2024: 1037-1047. [46] ZHU Z Y, MA X J, CHEN Y X, et al. 3D-VisTA: pre-trained transformer for 3D vision and text alignment[C]//Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2024: 2899-2909. [47] YANG F, ZHAO S C, ZHANG Y H, et al. LLMI3D: MLLM-based 3D perception from a single 2D image[J]. arXiv:2408.07422, 2024. [48] CHEN Y L, CHENG X Z, HUANG H F, et al. Chat-scene: bridging 3D scene and large language models with object identifiers[C]//Advances in Neural Information Processing Systems, 2024: 113991-114017. [49] YANG J N, CHEN X, QIAN S Y, et al. LLM-Grounder: open-vocabulary 3D visual grounding with large language model as an agent[C]//Proceedings of the 2024 IEEE International Conference on Robotics and Automation. Piscataway: IEEE, 2024: 7694-7701. [50] CHEN W, HU S Y, TALAK R, et al. Leveraging large (visual) language models for robot 3D scene understanding[J]. arXiv:2209.05629, 2022. [51] WANG T, FAN J M, ZHENG P. An LLM-based vision and language cobot navigation approach for human-centric smart manufacturing[J]. Journal of Manufacturing Systems, 2024, 75: 299-305. [52] RANA K, HAVILAND J, GARG S, et al. SayPlan: grounding large language models using 3D scene graphs for scalable robot task planning[J]. arXiv:2307.06135, 2023. [53] CHEN Y X, ZHANG G X, ZHANG Y W, et al. SYNERGAI: perception alignment for human-robot collaboration[J]. arXiv:2409.15684, 2024. [54] WANG K W, WU Y Z, CEN J, et al. Self-supervised class-agnostic motion prediction with spatial and temporal consistency regularizations[C]//Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2024: 14638-14647. [55] WANG Y Q, CHEN Y T, LIAO X Y, et al. PanoOcc: unified occupancy representation for camera-based 3D panoptic segmentation[C]//Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2024: 17158-17168. [56] ZHONG X G, PAN Y, STACHNISS C, et al. 3D LiDAR mapping in dynamic environments using a 4D implicit neural representation[C]//Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2024: 15417-15427. [57] YANG Z Y, GAO X Y, ZHOU W, et al. Deformable 3D Gaussians for high-fidelity monocular dynamic scene reconstruction[C]//Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2024: 20331-20341. [58] LIU Y Z, DONG S Y, WANG S Z, et al. SLAM3R: real-time dense scene reconstruction from monocular RGB videos[J]. arXiv:2412.09401, 2024. [59] ZHANG W Y, GUO Y S, NIU L T, et al. LP-SLAM: language-perceptive RGB-D SLAM framework exploiting large language model[J]. Complex & Intelligent Systems, 2024, 10(4): 5391-5409. [60] YIN H, XU X W, WU Z Y, et al. SG-Nav: online 3D scene graph prompting for LLM-based zero-shot object navigation[J]. arXiv:2410.08189, 2024. [61] ZHOU Y J, CAI L K, CHENG X H, et al. OpenAnnotate3D: open-vocabulary auto-labeling system for multi-modal 3D data[C]//Proceedings of the 2024 IEEE International Conference on Robotics and Automation. Piscataway: IEEE, 2024: 9086-9092. [62] NAJIBI M, JI J W, ZHOU Y, et al. Unsupervised 3D perception with 2D vision-language distillation for autonomous driving[C]//Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2024: 8568-8578. [63] ALI S, CHOURASIA P, PATTERSON M. When protein structure embedding meets large language models[J]. Genes, 2023, 15(1): 25. [64] GUO M, ZHU L, WANG G, et al. Understanding architectural heritage: 3D-MELL framework of architectural heritage large language model based on 3D colorful point cloud[J]. 2025. DOI:10.2139/ssrn.5071747. [65] XIE W Y, LIU Y H, WANG K M, et al. LLM-guided cross-modal point cloud quality assessment: a graph learning approach[J]. IEEE Signal Processing Letters, 2024, 31: 2250-2254. [66] HE S T, DING H H, JIANG X D, et al. SegPoint: segment any point cloud via large language model[C]//Proceedings of the European Conference on Computer Vision. Cham: Springer, 2024: 349-367. [67] CHEN D Z, CHANG A X, NIE?NER M. ScanRefer: 3D object localization in RGB-D scans using natural language[C]//Proceedings of the European Conference on Computer Vision. Cham: Springer, 2020: 202-221. [68] ZHANG Y M, GONG Z M, CHANG A X. Multi3DRefer: grounding text description to multiple 3D objects[C]//Proceedings of the 2023 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2024: 15179. [69] CHEN D Z, GHOLAMI A, NIESNER M, et al. Scan2Cap: context-aware dense captioning in RGB-D scans[C]//Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2021: 3192-3202. [70] AZUMA D, MIYANISHI T, KURITA S, et al. ScanQA: 3D question answering for spatial scene understanding[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2022: 19107-19117. [71] MA X J, YONG S L, ZHENG Z L, et al. SQA3D: situated question answering in 3D scenes[J]. arXiv:2210.07474, 2022. [72] WANG Z H, HUANG H F, ZHAO Y, et al. Chat-3D: data-efficiently tuning large language model for universal dialogue of 3D scenes[J]. arXiv:2308.08769, 2023. [73] CHEN Y L, YANG S, HUANG H F, et al. Grounded 3D-LLM with referent tokens[J]. arXiv:2405.10370, 2024. [74] ZHANG R R, GUO Z Y, ZHANG W, et al. PointCLIP: point cloud understanding by CLIP[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2022: 8542-8552. [75] WU Z R, SONG S R, KHOSLA A, et al. 3D ShapeNets: a deep representation for volumetric shapes[C]//Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2015: 1912-1920. [76] UY M A, PHAM Q H, HUA B S, et al. Revisiting point cloud classification: a new benchmark dataset and classification model on real-world data[C]//Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision. Piscataway: IEEE, 2019: 1588-1597. [77] PENG S Y, GENOVA K, JIANG C Y, et al. OpenScene: 3D scene understanding with open vocabularies[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2023: 815-824. [78] YUAN Z H, REN J K, FENG C M, et al. Visual programming for zero-shot open-vocabulary 3D visual grounding[C]//Proceedings of the 2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition. Piscataway: IEEE, 2024: 20623-20633. [79] XU R S, HUANG Z W, WANG T, et al. VLM-Grounder: a VLM agent for zero-shot 3D visual grounding[J]. arXiv: 2410.13860, 2024. [80] FERRARA E. Should ChatGPT be biased? challenges and risks of bias in large language models[J]. arXiv:2304.03738, 2023. |
| [1] | HAO Hefei, ZHANG Longhao, CUI Hongzhen, ZHU Xiaoyue, PENG Yunfeng, LI Xianghui. Review of Application of Deep Neural Networks in Human Pose Estimation [J]. Computer Engineering and Applications, 2025, 61(9): 41-60. |
| [2] | LIU Guihong, JIAO Chentian. Interactive News Recommendation Model Incorporating User Intent [J]. Computer Engineering and Applications, 2025, 61(9): 159-167. |
| [3] | PANG Jun, MA Zhifen, LIN Xiaoli, WANG Mengxiang. Knowledge Hypergraph Link Prediction Based on GAT and Convolutional Neural Network [J]. Computer Engineering and Applications, 2025, 61(9): 194-201. |
| [4] | YANG Hongdan, FU Gui, SHAO Huichao, WANG Yixin, SHAO Yanhua, CHU Hongyu, DENG Hu. Small Object Detection in Aerial Imagery Using Multi-Scale Hiearchical Feature Fusion Based Approach [J]. Computer Engineering and Applications, 2025, 61(9): 230-241. |
| [5] | CHEN Hong, YOU Yuzhu, JIN Haibo, WU Cong, ZOU Jiapeng. Fusion of Improved Sampling Technology and SRFCNN-BiLSTM Intrusion Detection Method [J]. Computer Engineering and Applications, 2025, 61(9): 315-324. |
| [6] | ZHANG Wubo, ZOU Wang, XIONG Li, DAI Shun’e, WU Wenhuan. Multi-Channel Syntactic Gated Graph Neural Network for Sentence-Level Sentiment Analysis [J]. Computer Engineering and Applications, 2025, 61(8): 135-144. |
| [7] | MENG Weichao, BIAN Chunjiang, NIE Hongbin. Method for Detecting Dim Small Infrared Targets with Low Signal-to-Noise Ratio in Complex Background [J]. Computer Engineering and Applications, 2025, 61(8): 183-193. |
| [8] | CUI Liqun, HAO Siya, LUAN Wuyang. Lightweight 3D Point Cloud Instance Segmentation Algorithm Based on Mamba [J]. Computer Engineering and Applications, 2025, 61(8): 194-203. |
| [9] | TIAN Yuan, ZHAO Mingfu, SONG Tao, XIONG Hailong, YE Dingxing, WANG Min. Global-Local Feature Aggregation for Real Point Cloud Semantic Segmentation [J]. Computer Engineering and Applications, 2025, 61(8): 260-266. |
| [10] | LYU Guanghong, WANG Kun. Dynamic Traffic Prediction of SDN Under Attention Mechanism of Spatiotemporal Graph [J]. Computer Engineering and Applications, 2025, 61(8): 267-273. |
| [11] | HE Lijie, GAO Maoting. Click Through Rate Prediction Model Based on Cross Attention [J]. Computer Engineering and Applications, 2025, 61(7): 353-360. |
| [12] | JIANG Wangyu, WANG Le, YAO Yepeng, MAO Guojun. Multi-Scale Feature Aggregation Diffusion and Edge Information Enhancement Small Object Detection Algorithm [J]. Computer Engineering and Applications, 2025, 61(7): 105-116. |
| [13] | XU Hang, WEN Xiaoke, WANG Wenjian. Partial Ordered Deep Forest Model Based on Feature Fusion [J]. Computer Engineering and Applications, 2025, 61(7): 165-175. |
| [14] | ZHAO Enhao, LING Jie. Data-Free Black-Box Adversarial Attack Method Based on GAN [J]. Computer Engineering and Applications, 2025, 61(7): 204-212. |
| [15] | TIAN Kan, CAO Xinwen, ZHANG Haoran, XIAN Xingping, WU Tao, SONG Xiuli. Knowledge Graph Question Answering with Shared Encoding and Graph Convolution Networks [J]. Computer Engineering and Applications, 2025, 61(7): 233-244. |
| Viewed | ||||||
|
Full text |
|
|||||
|
Abstract |
|
|||||