Massive data parallel random sampling based on hadoop

Abstract

Abstract: In today’s “information explosion” society, data mining, because of mass data, faces a new challenges. When data mining turns to cloud computing platform to realize parallel, the study of parallel data random sampling further reduces the size of the data size. This paper presents a mapreduce parallel sampling algorithm which not only can clean up dirty data but also achieves the goal of equal probability sampling. The algorithm just needs to scan processed data only one time. It runs this algorithm in the hadoop platform and compares its performance with common random sampling. As a result, this new algorithm obtains a very high time efficiency. It is a kind of effective method which lays a good foundation for doing research on sampling in future. It can also promote data mining in the condition of facing mass data.

Key words: cloud computing, hadoop, mapreduce, parallel computing, data mining, random sampling

摘要： 在“信息爆炸”的当今社会，海量数据对数据挖掘提出新的挑战。在数据挖掘转向云计算平台实现并行化的同时，研究并行化数据随机抽样进一步降低处理的数据规模。提出一种单次扫描即可实现清理脏数据并实现等概率抽样的mapreduce并行抽样算法。在hadoop平台上实现并与普通随机抽样方法进行比较，得出其时间效率非常高，是一种行之有效的方法。为以后数据挖掘中的抽样研究和推动数据挖掘在海量数据下的发展奠定良好基础。

关键词: 云计算, hadoop, mapreduce, 并行计算, 数据挖掘, 随机抽样

WAN Wan, ZHOU Guoxiang. Massive data parallel random sampling based on hadoop[J]. Computer Engineering and Applications, 2014, 50(20): 115-118.

宛婉，周国祥. Hadoop平台的海量数据并行随机抽样[J]. 计算机工程与应用, 2014, 50(20): 115-118.

[1]	ZONG Xiaoping, TAO Zeze. Knowledge Tracing Model Based on Mastery Speed [J]. Computer Engineering and Applications, 2021, 57(6): 117-123.
[2]	GAO Tianyu, WANG Qingrong, YANG Lei. Data Mining Model Based on Attribute Dependability Enhancement of Rough Set [J]. Computer Engineering and Applications, 2021, 57(3): 87-93.
[3]	WENG Xiaoyong. Research on Blockchain Based Cloud Computing Data Sharing System [J]. Computer Engineering and Applications, 2021, 57(3): 120-124.
[4]	TIAN Zhuojing, HUANG Zhenchun, ZHANG Yinong. Review of Task Scheduling Methods in Cloud Computing Environment [J]. Computer Engineering and Applications, 2021, 57(2): 1-11.
[5]	MA Yang, ZHAO Xujun. Multi-source Outlier Detection Algorithm Based on Relevant Subspace [J]. Computer Engineering and Applications, 2021, 57(17): 88-95.
[6]	WU Dongyang, DOU Jianping, LI Jun. Design of Digital Twin System for Quadrotor [J]. Computer Engineering and Applications, 2021, 57(16): 237-244.
[7]	LI Leixiao, DENG Dan, LI Jie, WANG Yongsheng. All-to-All Comparison Computing Data Distribution Strategy Based on Particle Swarm Optimization [J]. Computer Engineering and Applications, 2021, 57(15): 109-117.
[8]	HU Heng, JIN Fenglin, LANG Siqi. Survey of Research on Computation Offloading Technology in Mobile Edge Computing Environment [J]. Computer Engineering and Applications, 2021, 57(14): 60-74.
[9]	ZHANG Nianpeng, WU Xu, ZHU Qiang. Entropy-Based Oversampling Framework [J]. Computer Engineering and Applications, 2021, 57(13): 96-101.
[10]	CHEN Yuanwen. Application of MapReduce Technology in Problem of Material Transportation and Stowage [J]. Computer Engineering and Applications, 2021, 57(12): 273-278.
[11]	ZHANG Bowen, LIU Zhi, SANG Guoming. Anomaly Detection Algorithm Based on Kernel Density Fluctuation [J]. Computer Engineering and Applications, 2021, 57(12): 132-136.
[12]	RAO Jiawang, MA Ronghua. Improved Kernel Density Estimator Based Spatial Point Density Algorithm [J]. Computer Engineering and Applications, 2021, 57(11): 260-265.
[13]	YU Bo, TAI Xianqing, MA Zhijie. Study on Attribute and Trust-Based RBAC Model in Cloud Computing [J]. Computer Engineering and Applications, 2020, 56(9): 84-92.
[14]	TONG Le, HAO Rong, YU Jia. Secure Outsourcing Scheme for Bilinear Pairing Based on Single Untrusted Server [J]. Computer Engineering and Applications, 2020, 56(9): 131-135.
[15]	JIANG Jiao, CAI Linqin, WEI Pengcheng, LI Li. Aretrieval Scheme Supporting Verifiable Ciphertext Fuzzy Keyword [J]. Computer Engineering and Applications, 2020, 56(7): 74-80.

Massive data parallel random sampling based on hadoop

Hadoop平台的海量数据并行随机抽样

PDF

Knowledge

Abstract

Cite this article

share this article

References

Related Articles 15

Recommended Articles

Metrics