计算机工程与应用 ›› 2025, Vol. 61 ›› Issue (9): 186-193.DOI: 10.3778/j.issn.1002-8331.2401-0380

• 模式识别与人工智能 • 上一篇    下一篇

多模态预训练模型在金融票据信息抽取中的应用

颜政锦,叶正,葛君   

  1. 1.中南民族大学 计算机科学学院&信息物理融合智能计算国家民委重点实验室,武汉 430074
    2.武汉纺织大学 外经贸学院,武汉 437100
  • 出版日期:2025-05-01 发布日期:2025-04-30

Application of Multimodal Pre-Trained Models in Financial Invoice Information Extraction

YAN Zhengjin, YE Zheng, GE Jun   

  1. 1.College of Computer Science & Information Physics Fusion Intelligent Computing Key Laboratory of the National Ethnic Affairs Commission, South-Central Minzu University,Wuhan 430074, China
    2.College of International Business and Economics, Wuhan Textile University, Wuhan 437100, China
  • Online:2025-05-01 Published:2025-04-30

摘要: 金融领域的票据信息抽取是一项复杂且具有挑战的任务,其目标是从金融文档中准确抽取票据所包含的关键信息。金融票据作为商业活动中重要的信息载体,其准确提取对于商业决策和财务分析具有重要意义。然而,由于票据格式的不规范性,在实际应用中可能导致关键信息的丢失,如数据中键值对不完整或缺失等问题,给金融票据信息抽取任务带来了挑战。当前,LayoutLMV3模型是主流的信息抽取的方法之一,它结合了自然语言处理和多模态技术,能够在大规模金融文档中进行信息抽取。但它在处理复杂布局的文档时准确性会下降,处理长文本时因包含大量的字符可能难以捕捉其中重要的信息。为了解决上述挑战和问题,以LayoutLMV3为基线模型,引入了P-Tuning V1技术,不仅能够解决特定问题(如金融票据中的键值关系),还具备适应不同情境和任务的能力,而且可以利用多模态的文本、图像和布局信息来更全面地理解票据内容。P-Tuning V1通过引入可训练的连续提示嵌入,即“prompt”,作为模型输入的一部分,用以表示文本数据中的“键”信息。同时,采用离散提示作为“值”的一部分,两者相结合构成完整的键值对。实验结果表明,相较于基于LayoutLMV3的方法,结合的新方法在Finance-Receipts数据集上取得了显著的提升,在F1得分上从95.95%提高到96.69%。

关键词: 信息抽取, 多模态, 预训练, LayoutLMv3, P-Tuning V1

Abstract: The extraction of information from financial documents, particularly pertaining to receipts, poses a complex and challenging task. The objective is to accurately extract crucial details contained within these documents. Financial receipts serve as pivotal carriers of information in commercial activities, and their precise extraction holds significant importance for business decision-making and financial analysis. However, the irregularity in receipt formats may lead to the loss of key information in practical applications, such as incomplete or missing key-value pairs in the data, presenting a challenge to the task of extracting information from financial receipts. Currently, the LayoutLMV3 model stands as one of the mainstream methods for information extraction. It combines natural language processing with multimodal techniques, enabling extraction of information from large-scale financial documents. However, its accuracy diminishes when dealing with documents of complex layouts, and capturing important information from lengthy texts becomes challenging due to the presence of numerous characters. To address these challenges, this paper adopts the LayoutLMV3 model as the baseline and introduces the P-Tuning V1 technique. This technique not only resolves specific issues, such as key-value relationships in financial receipts, but also possesses the capability to adapt to various contexts and tasks. Moreover, it leverages multimodal information, including text, images, and layout details, to comprehensively understand receipt content. By incorporating trainable continuous prompt embeddings, known as “prompts”, as part of the model input to represent key information within the text data, and employing discrete prompts as part of the “values”, the combined approach constructs complete key-value pairs. Experimental results demonstrate a significant improvement compared to methods based solely on LayoutLMV3, achieving notable enhancements in performance, particularly on the Finance-Receipts dataset, with the F1 score increasing from 95.95% to 96.69%.

Key words: information extraction, multimodal, pre-training, LayoutLMv3, P-Tuning V1