Review of Data Normalization Methods

doi:10.3778/j.issn.1002-8331.2207-0179

Abstract

Abstract: In recent years, artificial intelligence has been widely used in various fields and has achieved remarkable results. Data normalization is a significant part of the implementation of artificial intelligence applications, which helps avoid incorrect modeling of data by neural networks due to the complexity of data dimensions. In the big data scenario, a portion of the data arrives at the training points successively in the form of streams. As a result, the research on data normalization in the stream scenario is a core problem that needs to be solved urgently. Currently, there are many reviews on normalization research, most of which only focus on the normalization research of batch data, but lack a summary of normalization methods for stream data, which is not informative. This paper systematically and exhaustively analyzes the literature on stream data normalization based on batch data normalization, condenses and proposes a normalization classification method based on stream data, and classifies the data normalization methods into batch data normalization methods and stream data normalization methods. At the same time, this paper compares and analyzes the principles, advantages, and main problems that can be solved by these methods. Finally, the future research directions of data normalization in different scenarios are prospected.

Key words: normalization, data stream, deep learning, data mining

摘要： 当今，人工智能已经广泛应用到各个领域中，并取得了显著的效果。数据归一化是人工智能应用落地中的一个重要环节，它有助于避免神经网络因数据量纲的复杂性对数据进行错误建模。在大数据场景下，相当一部分数据是以流的形式先后到达训练点，所以在流场景下数据归一化研究是当前亟待解决的关键问题。目前关于归一化研究的综述较多，大多仅仅针对于批数据的归一化研究，而缺乏对流数据的归一化方法的总结，不具参考性。在批数据归一化研究基础之上，系统化整理并详尽分析了流数据归一化的相关文献，凝练提出了基于流数据的归一化分类方法，并将数据归一化方法划分为批数据的归一化方法和流数据的归一化方法。同时，对这些方法的原理、优势和可以解决的主要问题进行了对比分析，在不同场景下对数据归一化的未来研究方向进行了展望。

关键词: 归一化, 数据流, 深度学习, 数据挖掘

YANG Hanyu, ZHAO Xiaoyong, WANG Lei. Review of Data Normalization Methods[J]. Computer Engineering and Applications, 2023, 59(3): 13-22.

杨寒雨, 赵晓永, 王磊. 数据归一化方法综述[J]. 计算机工程与应用, 2023, 59(3): 13-22.

References

[1] HUANG C，HUANG Y.Information fusion early warning of rail transit signal operation and maintenance based on big data of internet of things[J].Sustainable Computing：Informatics and Systems，2022，35.
[2] 周宇，曹英楠，王永超.面向大数据的数据处理与分析算法综述[J].南京航空航天大学学报，2021，53（5）：664-676.
ZHOU Y，CAO Y N，WANG Y C.Overview of data pro-cessing and analysis algorithms for big data[J].Journal of Nanjing University of Aeronautics & Astronautics，2021，53（5）：664-676.
[3] LUO S，DING C，CHENG H，et al.Estimated ultimate recovery prediction of fractured horizontal wells in tight oil reservoirs based on deep neural networks[J].Advances in Geo-Energy Research，2022，6（2）：111-122.
[4] XIAO Z W，GANG W J，YUAN J Q，et al.Impacts of data preprocessing and selection on energy consumption pre-diction model of HVAC systems based on deep learning[J].Energy & Buildings，2022，258.
[5] JO J M.Effectiveness of normalization pre-processing of big data to the machine learning performance[J].The Journal of the Korea Institute of Electronic Communication Sciences，2019，14（3）：547-552.
[6] GARCíA S，LUENGO J，HERRERA F.Data preprocessing in data mining[M].Cham，Switzerland：Springer International Publishing，2015.
[7] RENUKADEVI P，RAJIV KANNAN A.Covid-19 forecasting with deep learning-based half-binomial distribution cat swarm optimization[J].Computer Systems Science and Engineering，2023，44（1）：629-645.
[8] IOFFE S，SZEGEDY C.Batch normalization：accelerating deep network training by reducing internal covariate shift[C]//International Conference on Machine Learning，2015：448-456.
[9] LANJEWAR M G，PARATE R K，PARAB J S.Machine learning approach with data normalization technique for early stage detection of hypothyroidism[M]//Artificial intel-ligence applications for health care.[S.l.]：CRC Press，2022：91-108.
[10] PAN J，ZHUANG Y，FONG S.The impact of data normalization on stock market prediction：using SVM and technical indicators[C]//International Conference on Soft Computing in Data Science.Singapore：Springer，2016：72-88.
[11] JAN A K，DOUGLAS M.A comprehensive evaluation of metabolomics data preprocessing methods for deep learning[J].Metabolites，2022，12（3）.
[12] TANG C，XU Y，ZHU Q.Data normalization improves semantic annotation—a case study of rare disease name annotation[C]//2021 IEEE International Conference on Bioinformatics and Biomedicine（BIBM），2021：2609-2611.
[13] RAHM E，DO H H.Data cleaning：problems and current approaches[J].IEEE Data Engineering Bulletin，2000，23（4）：3-13.
[14] BA L J，KIROS J R，HINTON G E.Layer normalization[J].arXiv：1607.06450，2016.
[15] ULYANOV D，VEDALDI A，LEMPITSKY V S.Instance normalization：the missing ingredient for fast stylization[J].arXiv：1607.08022，2016.
[16] 王岩.深度神经网络的归一化技术研究[D].南京：南京邮电大学，2019：1179-1185.
WANG Y.Analysis of normalization for deep neural    networks[D].Nanjing：Nanjing University of Posts and    Telecommunications，2019：1179-1185.
[17] MEDEIROS D S V，CUNHA NETO H N，LOPEZ M A，et al.A survey on data analysis on large-scale wireless networks：online stream processing，trends，and challenges[J].Journal of Internet Services and Applications，2020，11（1）.
[18] 詹敏，廖志高，徐玖平.线性无量纲化方法比较研究[J].统计与信息论坛，2016，31（12）：17-22.
ZHAN M，LIAO Z G，XU J P，et al.Character analysis of linear dimensionless methods[J].Journal of Statistics and Information，2016，31（12）：17-22.
[19] 郭亚军，易平涛.线性无量纲化方法的性质分析[J].统计研究，2008，25（2）：93-100.
GUO Y J，YI P T.Character analysis of linear dimensionless methods[J].Statistical Research，2008，25（2）：   93-100.
[20] 郑宏宇，邓银燕，贺瑞缠.综合评价中数据变换方法的选择[J].纯粹数学与应用数学，2010，26（2）：319-324.
ZHENG H Y，DENG Y Y，HE R C.About the choice of target non-dimensional method in multi target synthetic evaluations[J].Pure and Applied Mathematics，2010，26（2）：319-324.
[21] PANDA S K，NAG S，JANA P K.A smoothing based task scheduling algorithm for heterogeneous multi-cloud environment[C]//2014 International Conference on Parallel，Distributed and Grid Computing，2014：62-67.
[22] PANDA S K，JANA P K.Efficient task scheduling algorithms for heterogeneous multi-cloud environment[J].The Journal of Supercomputing，2015，71（4）：1505-1533.
[23] PANDA S K，JANA P K.A multi-objective task scheduling algorithm for heterogeneous multi-cloud environment[C]//2015 International Conference on Electronic Design，Computer Networks & Automated Verification（EDCAV），2015：82-87.
[24] PATRO S G K，SAHU K K.Normalization：a preprocessing stage[J].arXiv：1503.06462，2015.
[25] LITTLE W A.The existence of persistent states in the brain[J].Mathematical Biosciences，1974，19（1/2）：101-120.
[26] LITTLE W A，SHAW G L.Analytic study of the memory storage capacity of a neural network[J].Mathematical Biosciences，1978，39（3/4）：281-290.
[27] HUANG K H，FU Y F，LEE Y T，et al.A-HA：a hybrid approach for hotel recommendation[C]//Proceedings of the Workshop on ACM Recommender Systems Challenge，2019：1-5.
[28] 郭亚军，宫诚举，李伟伟，等.基于反三角函数的非线性预处理方法[J].系统工程，2017，35（7）：53-57.
   GUO Y J，GONG C J，LI W W，et al.A nonlinear preprocessing method based on inverse trigonometric function[J].Systems Engineering，2017，35（7）：53-57.
[29] KALMAN B L，KWASNY S C.Why tanh：choosing a sigmoidal function[C]//Proceedings International Joint Conference on Neural Networks，1992：578-581.
[30] ZHANG S S，LIU J W，ZUO X，et al.Online deep learning based on auto-encoder[J].Applied Intelligence，2021.
[31] WU Y，HE K.Group normalization[C]//Proceedings of the European Conference on Computer Vision，2018：3-19.
[32] LUO P，REN J，PENG Z，et al.Differentiable learning-to-normalize via switchable normalization[J].arXiv：1806. 10779，2018.
[33] SINGH S，KRISHNAN S.Filter response normalization layer：eliminating batch dependence in the training of deep neural networks[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition，2020：11237-11246.
[34] PYLE D.Data preparation for data mining[M].[S.l.]：Morgan Kaufmann，1999.
[35] ZLIOBAITE I，GABRYS B.Adaptive preprocessing for streaming data[J].IEEE Transactions on Knowledge and Data Engineering，2012，26（2）：309-321.
[36] HU H，KANTARDZIC M.Smart preprocessing improves data stream mining[C]//2016 49th Hawaii International Conference on System Sciences（HICSS），2016：1749-1757.
[37] SAHEED Y K，ABIODUN A I，MISRA S，et al.A machine learning-based intrusion detection for detecting internet of things network attacks[J].Alexandria Engineering Journal，2022，61（12）：9395-9409.
[38] 陈玉平，刘波，林伟伟，等.云边协同综述[J].计算机科学，2021，48（3）：259-268.
CHEN Y P，LIU B，LIN W W，et al.Survey of cloud-edge collaboration[J].Computer Science，2021，48（3）：259-268.
[39] LOPEZ M A，MATTOS D M F，DUARTE O C M B，et al.A fast unsupervised preprocessing method for network monitoring[J].Annals of Telecommunications，2019，74（3）：139-155.
[40] GUPTA V，HEWETT R.Adaptive normalization in streaming data[C]//Proceedings of the 2019 3rd International Conference on Big Data Research，2019：12-17.
[41] GUPTA V.Big data stream analytics with AI techniques[D].Lubbock：Texas Tech University，2019.
[42] 张友浩，赵鸣，徐梦瑶，等.时序数据挖掘的预处理研究综述[J].智能计算机与应用，2021，11（1）：74-78.
ZHANG Y H，ZHAO M，XU M Y，et al.Summary of research on preprocessing on time series data mining[J].Intelligent Computer and Applications，2021，11（1）：74-78.
[43] GUPTA M，WADHVANI R，RASOOL A.Real-time change-point detection：a deep neural network-based adaptive approach for detecting changesin multivariate time series data[J].Expert Systems with Applications，2022，209.
[44] PASSALIS N，TEFAS A，KANNIAINEN J，et al.Deep adaptive input normalization for time series forecasting[J].IEEE Transactions on Neural Networks and Learning Systems，2019，31（9）：1-6.
[45] OGASAWARA E，MARTINEZ L C，DE OLIVEIRA D，et al.Adaptive normalization：a novel data normalization approach for non-stationary time series[C]//The 2010 International Joint Conference on Neural Networks，2010：1-8.
[46] GIAO B C，ANH D T.Similarity search for numerouspatterns over multiple time series streams under dyn-amic time warping which supports data normalization[J].Vietnam Journal of Computer Science，2016，3（3）：181-196.
[47] SAKURAI Y，FALOUTSOS C，YAMAMURO M.Stream monitoring under the time warping distance[C]//2007 IEEE 23rd International Conference on Data Engineering，2007：1046-1055.
[48] GONG X，SI Y W，FONG S，et al.NSPRING：normalization supported SPRING for subsequence matching ontime series streams[C]//2014 IEEE 15th International Symposium on Computational Intelligence and Informatics，2014：373-378.