多模态预训练模型在金融票据信息抽取中的应用

doi:10.3778/j.issn.1002-8331.2401-0380

摘要/Abstract

摘要： 金融领域的票据信息抽取是一项复杂且具有挑战的任务，其目标是从金融文档中准确抽取票据所包含的关键信息。金融票据作为商业活动中重要的信息载体，其准确提取对于商业决策和财务分析具有重要意义。然而，由于票据格式的不规范性，在实际应用中可能导致关键信息的丢失，如数据中键值对不完整或缺失等问题，给金融票据信息抽取任务带来了挑战。当前，LayoutLMV3模型是主流的信息抽取的方法之一，它结合了自然语言处理和多模态技术，能够在大规模金融文档中进行信息抽取。但它在处理复杂布局的文档时准确性会下降，处理长文本时因包含大量的字符可能难以捕捉其中重要的信息。为了解决上述挑战和问题，以LayoutLMV3为基线模型，引入了P-Tuning V1技术，不仅能够解决特定问题（如金融票据中的键值关系），还具备适应不同情境和任务的能力，而且可以利用多模态的文本、图像和布局信息来更全面地理解票据内容。P-Tuning V1通过引入可训练的连续提示嵌入，即“prompt”，作为模型输入的一部分，用以表示文本数据中的“键”信息。同时，采用离散提示作为“值”的一部分，两者相结合构成完整的键值对。实验结果表明，相较于基于LayoutLMV3的方法，结合的新方法在Finance-Receipts数据集上取得了显著的提升，在F1得分上从95.95%提高到96.69%。

关键词: 信息抽取, 多模态, 预训练, LayoutLMv3, P-Tuning V1

Abstract: The extraction of information from financial documents, particularly pertaining to receipts, poses a complex and challenging task. The objective is to accurately extract crucial details contained within these documents. Financial receipts serve as pivotal carriers of information in commercial activities, and their precise extraction holds significant importance for business decision-making and financial analysis. However, the irregularity in receipt formats may lead to the loss of key information in practical applications, such as incomplete or missing key-value pairs in the data, presenting a challenge to the task of extracting information from financial receipts. Currently, the LayoutLMV3 model stands as one of the mainstream methods for information extraction. It combines natural language processing with multimodal techniques, enabling extraction of information from large-scale financial documents. However, its accuracy diminishes when dealing with documents of complex layouts, and capturing important information from lengthy texts becomes challenging due to the presence of numerous characters. To address these challenges, this paper adopts the LayoutLMV3 model as the baseline and introduces the P-Tuning V1 technique. This technique not only resolves specific issues, such as key-value relationships in financial receipts, but also possesses the capability to adapt to various contexts and tasks. Moreover, it leverages multimodal information, including text, images, and layout details, to comprehensively understand receipt content. By incorporating trainable continuous prompt embeddings, known as “prompts”, as part of the model input to represent key information within the text data, and employing discrete prompts as part of the “values”, the combined approach constructs complete key-value pairs. Experimental results demonstrate a significant improvement compared to methods based solely on LayoutLMV3, achieving notable enhancements in performance, particularly on the Finance-Receipts dataset, with the F1 score increasing from 95.95% to 96.69%.

Key words: information extraction, multimodal, pre-training, LayoutLMv3, P-Tuning V1

颜政锦, 叶正, 葛君. 多模态预训练模型在金融票据信息抽取中的应用[J]. 计算机工程与应用, 2025, 61(9): 186-193.

YAN Zhengjin, YE Zheng, GE Jun. Application of Multimodal Pre-Trained Models in Financial Invoice Information Extraction[J]. Computer Engineering and Applications, 2025, 61(9): 186-193.

参考文献

[1] 崔磊,徐毅恒,吕腾超,等.文档智能:数据集、模型和应用[J].中文信息学报, 2022, 36(6): 1-19.
CUI L, XU Y H, LV T, C et al. Document AI: benchmarks, models and applications [J].Journal of Chinese Information Processing, 2022, 36 (6): 1-19.
[2] APPALARAJU S, JASANI B, BHARGAVA U K, et al. DocFormer: end-to-end Transformer for document understanding[C]//Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021.
[3] KIM G, HONG T, YIM M, et al. OCR-free document understanding transformer[C]//Proceedings of the 17th European Conference on Computer Vision, 2022: 498-517.
[4] MISTRY J,ARZENO N M. Document understanding for healthcare referrals[C]//Proceedings of the 11th IEEE International Conference on Healthcare Informatics (ICHI), 2023.
[5] ?IMSA S,??ULC M,?URICAR M, et al. DocILE benchmark for document information localization and extraction[C]// Proceedings of the 17th International Conference on Document Analysis and Recognition(ICDAR 2023), San José, CA, USA, August 21-26, 2023.
[6] SVEN N M, MATTEO R. Page layout analysis of text-heavy historical documents: a comparison of textual and visual approaches[C]//Proceedings of the Computational Humanities Research Conference, 2022.
[7] DEVLIN J, CHANG M W,LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2019.
[8] XU Y H, LI M H,CUI L,et al.LayoutLM: Pre-training of text and layout for document image understanding[C]//Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (KDD’20), 2020.
[9] XU Y, XU Y H, LV T C, et al. LayoutLMv2: multi-modal pre-training for visually-rich document understanding[C]//Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2021 :2579-2591.
[10] HUANG Y P,LV T C,CUI L,et al.LayoutLMv3: pre-training for document ai with unified text and image masking[C]//Proceedings of the 30th ACM International Conference on Multimedia, 2022.
[11] LIU X,ZHENG Y N,DU Z X, et al.GPT understands,too[J].arXiv:2103.10385, 2021.
[12] GU Z X, MENG C H, WANG K, et al. XYLayoutLM: towards layout-aware multimodal networks for visually-rich document understanding[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
[13] LUO C W, CHENG C X, ZHENG Q, et al. GeoLayoutLM: geometric pre-training for visual information extraction[C]//Proceedings of the 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023.
[14] XU Y H,LV T C,CUI L,et al. LayoutXLM: multimodal pre-training for multilingual visually-rich document understanding[C]//Proceedings of the Annual Meeting of the Association for Computational Linguistics; International Joint Conference on Natural Language Processing(ACL2021), 2021.
[15] REYNOLDS L, MCDONELL K. Prompt programming for large language models: beyond the few-shot paradigm[C]//Proceedings of the Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems (CHI EA’21) , 2021.
[16] LI X L, LIANG P. Prefix-tuning: optimizing continuous prompts for generation[C]//Proceedings of the Annual Meeting of the Association for Computational Linguistics;International Joint Conference on Natural Language Processing(ACL2021), 2021.
[17] LESTER B, AI-RFOU R, CONSTANT N. The power of scale for parameter-efficient prompt tuning[C]//Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021.
[18] WANG L, HE J B , XU X ,et al. Alignment- enriched tuning for patch-level pre-trained document image models[C]//Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence (AAAI’23/IAAI’23/EAAI’23), 2023.
[19] XU L, JIE Z M, LU W, et al. Better feature integration for named entity recognition[C]//Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies( NAACL), 2021.
[20] KUMAR R, GOYAL S, VERMA A, et al. ProtoNER: few shot incremental learning for named entity recognition using prototypical networks[C]//Proceedings of the International Conference on Business Process Management (BPM), 2023.
[21] TOM B,?BENJAMIN M,?NICK R,et al. Language models are few-shot learners[C]//Proceedings of the 34th International Conference on Neural Information Processing Systems (NIPS’20), 2020: 1877-1901.
[22] SILAJEV I,VICTOR N,MORTIMER P. Semantic table detection with LayoutLMv3[C]//Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022.