Study of K-Means Algorithm with Canopy Clustering Algorithm in Hadoop

Abstract
Authors
Keywords
Conclusion
References

There are very big bottlenecks when traditional data mining algorithms deal with large data sets. A novel technique for clustering the large and high dimensional datasets. The main idea is to use an inexpensive and approximate distance measure in order to efficiently partition the data into overlapping subsets which is called as canopies. After we get these canopies the desired clustering is performed by measuring exact distances only between points that occur in a common canopy. Using canopies, large clustering problems that were formerly impossible become practical and efficient. K-Means is typical distance-based clustering algorithm. Here, the canopy clustering algorithm is implemented as an efficient clustering technique by means of knowledge integration. With the study of the canopy clustering the K-Means paradigm of computing, we find is appropriate for the implementation of a clustering algorithm. This paper shows some advantages of canopy cluster to K-Means clustering mechanism and proposes a pre clustering approach to K-Means Clustering method. Here we use Hadoop’s MapReduce program model for K-Means clustering with canopy clustering.

Published In : IJCAT Journal Volume 2, Issue 2

Date of Publication : March 2015

Pages : 111 - 117

Figures :01

Tables : 05

Publication Link :Study of K-Means Algorithm with Canopy Clustering Algorithm in Hadoop

Aniket Gavhale : appeared B. E. in Computer Technology from Rajiv Gandhi College of Engg. & Research which is affiliated to RTM Nagpur University, Nagpur, INDIA, in 2015. His main areas of interest are BigData, Data Mining.

Punam Pofale : appeared B. E. in Computer Technology from Rajiv Gandhi College of Engg. & Research which is affiliated to RTM Nagpur University, Nagpur, INDIA, in 2015. Her main areas of interest are BigData, Data Mining.

Sushil Samarth : appeared B. E. in Computer Technology from Rajiv Gandhi College of Engg. & Research which is affiliated to RTM Nagpur University, Nagpur, INDIA, in 2015. His main area of interest is Data Mining.

Sayali Baitule : appeared B. E. in Computer Technology from Rajiv Gandhi College of Engg. & Research which is affiliated to RTM Nagpur University, Nagpur, INDIA, in 2015. Her main area of interest is Data Mining.

Data Mining

Clustering

K-Means Clustering

Canopy Clustering

Hadoop

Map Reduce Program Model

The canopy clustering algorithm is also an unsupervised pre-clustering algorithm, often used as preprocessing step for K-means algorithm or Hierarchical clustering algorithm. It intended to speed up the clustering operations on large data sets. Since the algorithm uses distance functions and requires the specification of distance thresholds, its applicability for high-dimensional data is limited by the curse of dimensionality. Only when a cheap and approximate – low-dimensional – distance function is available, the produced canopies will preserve the clusters produced by K-means. The new method has reduced the comparison of the number of instances at each step and there is some evidence that the resulting clusters are improved. Canopy clustering is a very simple, fast and surprisingly accurate method for grouping objects into clusters, thus it can be used in MapReduce concept using hadoop cluster in order to enhance the clustering techniques. One can also use the canopies idea to speed up prototype based clustering methods like K-means and Expectation-Maximization (EM).

[1] Mahesh Maurya, Sunita Mahajan, “Performance analysis of MapReduce Programs on Hadoop cluster”, IEEE. [2] Jing Zhang, Xindong Wu “A 2-Tier Clustering Algorithm with Map-Reduce”, IEEE 2010. [3] McCallum, A.; Nigam, K.; and Ungar L.H. (2000) "Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching", Proceedings of the sixth ACM SIGKDD international conference on Knowledge discovery and data mining. [4] A. M. Fahim, A. M. Salem, F. A. Torkey and M. A. Ramadan, “An Efficient enhanced k-means clustering algorithm,” journal of Zhejiang [5] K. A. Abdul Nazeer and M. P. Sebastian, “Improving the accuracy and efficiency of the k-means clustering algorithm,” in International Conference on Data Mining and Knowledge Engineering (ICDMKE), Proceedings of the World Congress on Engineering (WCE-2009), [6] Apache Hadoop. http://hadoop.apache.org/ [7] http://mahout.apache.org/users/clustering/canopyclustering. html [8] http://en.wikipedia.org/wiki/Canopy_clustering_algori thm [9] Tom White, “Hadoop: The Definitive Guide”, 2009 Published by O’Reilly Media, Inc., 1005 Gravenstein Highway North, Sebastopol, CA 95472