Computer Engineering and Applications ›› 2024, Vol. 60 ›› Issue (3): 331-339.DOI: 10.3778/j.issn.1002-8331.2210-0253

• Engineering and Applications • Previous Articles     Next Articles

Dataset Enhancement Quality Evaluation Method for Chinese Error Correction Task as Example

SONG Cheng, XIE Zhenping   

  1. 1.School of Artificial Intelligence and Computer Science, Jiangnan University, Wuxi, Jiangsu 214000, China
    2.Jiangsu Key Laboratory of Media Design and Software Technology, Jiangnan University, Wuxi, Jiangsu 214000, China
  • Online:2024-02-01 Published:2024-02-01

中文纠错任务为例的数据集增强质量评价方法

宋程,谢振平   

  1. 1.江南大学 人工智能与计算机学院,江苏 无锡 214000
    2.江南大学 江苏省媒体设计与软件技术重点实验室,江苏 无锡 214000

Abstract: Data augmentation is considered to be an effective solution to improve model performance. However, when selecting the generated data, it is necessary to consider the inherent data characteristics and the corresponding task relevance. Aiming at this problem, taking the Chinese error correction task scenario as an example, an evaluation method that can be used for dataset enhancement quality is proposed. The method uses the pre-training model optimized by contrastive learning to extract the feature vector of the dataset, and then proposes three basic evaluation indicators such as mutual coverage, total dispersion, and self-support, and gives a comprehensive dataset quality fusion indicator. The experimental analysis results on four data enhancement methods, two Chinese error correction data sets and three Chinese error correction models show that the above evaluation method can be independent of the test set performance inspection method, providing an important basis for the selection of different enhanced datasets.

Key words: dataset enhancement, machine learning, quality evaluation, Chinese error correction, deep learning

摘要: 数据增强被认为是一种有效提升模型效果的方案。但是在选取生成的数据时,需考虑固有的数据特征和相应的任务关联性。针对这一问题,以中文纠错任务场景为例,提出了一种可用于数据集增强质量的评价方法。该方法使用对比学习优化后的预训练模型提取数据集的特征向量,提出互覆盖度、总分散度、自支撑度等三个基本评价指标,并给出一个综合性的数据集质量融合指标。在四种数据增强方法、两个中文纠错数据集和三个中文纠错模型上的实验分析结果表明,上述评价方法能够独立于测试集性能检验方法,为不同增强数据集的选用提供重要依据。

关键词: 数据集增强, 机器学习, 质量评价, 中文纠错, 深度学习