ISBN: 978-981-11-3671-9 DOI: 10.18178/wcse.2017.06.006
A Sampling Strategy for Skewed Data Problem in MapReduce
Abstract— As an efficient and reliable parallel computing model, MapReduce was widely used in all walks
of life. However, when MapReduce dealing with the skewed data, the efficiency of the whole cluster will be
reduced. And the load imbalance in reducer nodes will happen quite often after assigning the results from
map stage. This paper used a reservoir sampling algorithm, it can sample with the same probability in case of
unknown and skewed data set, thus we estimated the frequency of the Key in overall data, and then
reallocated the tasks of processing node to achieve load balance. Finally, by comparing with the traditional
sampling strategy, the experimental results showed that the method in this paper is more effective in case of
computing skewed data set, advantages are more obvious with the increase of data set.
Index Terms— MapReduce, Skewed data, Reservoir sampling, Load balancing
Cheng Wenjuan, Tong Bing, Zhou Miaomiao
School of Computer and Information, Hefei University of Technology, CHINA
Zhu Junhong
School of Management, Hefei University of Technology, CHINA
Cite: Cheng Wenjuan, Tong Bing, Zhou Miaomiao, Zhu Junhong, "A Sampling Strategy for Skewed Data Problem in MapReduce," Proceedings of 2017 the 7th International Workshop on Computer Science and Engineering, pp. 37-41, Beijing, 25-27 June, 2017.