计算机工程与应用 ›› 2019, Vol. 55 ›› Issue (19): 121-127.DOI: 10.3778/j.issn.1002-8331.1807-0140

• 大数据与云计算 • 上一篇    下一篇

结合L1和L2正则化约束的隐语义预测模型研究

王德贤,何先波,贺春林,周坤,陈敏治   

  1. 西华师范大学 计算机学院,四川 南充 637000
  • 出版日期:2019-10-01 发布日期:2019-09-30

Latent Factor Prediction Model Combining L1 and L2 Regularization Constraints

WANG Dexian, HE Xianbo, HE Chunlin, ZHOU Kun, CHEN Minzhi   

  1. School of Computer Science, China West Normal University, Nanchong, Sichuan 637000 China
  • Online:2019-10-01 Published:2019-09-30

摘要: 在大数据领域中预测高维稀疏矩阵中的缺失数据,通常采用随机梯度下降算法构造隐语义模型来对缺失数据进行预测。在随机梯度下降算法来求解模型的过程中经常加入正则化项来提高模型的性能,由于[L1]正则化项不可导,目前在隐语义模型中主要通过加入[L2]正则化项来构建隐语义模型(SGD_LF)。但因为[L1]正则化项能提高模型的稀疏性增强模型求解能力,因此提出一种基于[L1]和[L2]正则化约束的隐语义(SPGD_LF)模型。在通过构建目标函数时,同时引入[L1]和[L2]正则化项。由于目标函数满足利普希茨条件,并通过二阶的泰勒展开对目标函数进行逼近,构造出随机梯度下降的求解器,在随机梯度下降求解隐语义模型的过程中通过软阈值来处理[L1]正则化项所对应的边界优化问题。通过此优化方案,可以更好地表达目标矩阵中的已知数据在隐语义空间中的特征和对应的所属社区关系,提高了模型的泛化能力。通过在大型工业数据集上的实验表明,SPGD_LF模型的预测精度、稀疏性和收敛速度等性能都有显著提高。

关键词: 大数据应用, 高维稀疏矩阵, 隐语义

Abstract: LF model is usually built by SGD method and it’s used to predict the missing data of high-dimensional sparse matrix in big data field. LF model need to integrate regularization terms to improve its performance. Due to [L1] regularization term is non-differentiable, normally integrates [L2] regularization term into an LF model only. However, the [L1] regularization normal can improve the sparsity and solving ability of LF model. To solve the issue, this paper proposes a SPGD_LF model that simultaneously integrates both [L1] and [L2] regularization terms in to an LF model. Since the objective function satisfies the Lipschitz condition and approximates the objective function by second-order Taylor expansion, a solver for stochastic gradient descent is constructed. In the process of stochastic gradient descent, the soft threshold process deals with the boundary optimization problem corresponding to the [L1] regularization term and solves the implicit semantic model. Through this optimization scheme, the characteristics of the known data in the target matrix in the latent factor space and the corresponding community relationship can be better expressed, and the generalization ability of the model is improved. Empirical studies on two datasets from industrial applications and the results show that the prediction accuracy, sparsity and convergence rate of SPGD_LF model are improved significantly.

Key words: big data application, high-dimensional and sparse matrix, latent factor