There are very big bottlenecks when traditional
data mining algorithms deal with large data sets. A novel
technique for clustering the large and high dimensional datasets.
The main idea is to use an inexpensive and approximate
distance measure in order to efficiently partition the data into
overlapping subsets which is called as canopies. After we get
these canopies the desired clustering is performed by measuring
exact distances only between points that occur in a common
canopy. Using canopies, large clustering problems that were
formerly impossible become practical and efficient. K-Means is
typical distance-based clustering algorithm. Here, the canopy
clustering algorithm is implemented as an efficient clustering
technique by means of knowledge integration. With the study of
the canopy clustering the K-Means paradigm of computing, we
find is appropriate for the implementation of a clustering
algorithm. This paper shows some advantages of canopy cluster
to K-Means clustering mechanism and proposes a pre clustering
approach to K-Means Clustering method. Here we use Hadoop’s
MapReduce program model for K-Means clustering with canopy
clustering.
Published In : IJCAT Journal Volume 2, Issue 2
Date of Publication : March 2015
Pages : 111 - 117
Figures :01
Tables : 05
Publication Link :Study of K-Means Algorithm with Canopy Clustering
Algorithm in Hadoop
Aniket Gavhale : appeared B. E. in Computer
Technology from Rajiv Gandhi College of
Engg. & Research which is affiliated to RTM
Nagpur University, Nagpur, INDIA, in 2015.
His main areas of interest are BigData, Data
Mining.
Punam Pofale : appeared B. E. in Computer
Technology from Rajiv Gandhi College of
Engg. & Research which is affiliated to RTM
Nagpur University, Nagpur, INDIA, in 2015.
Her main areas of interest are BigData, Data
Mining.
Sushil Samarth : appeared B. E. in Computer
Technology from Rajiv Gandhi College of
Engg. & Research which is affiliated to RTM
Nagpur University, Nagpur, INDIA, in 2015.
His main area of interest is Data Mining.
Sayali Baitule : appeared B. E. in Computer
Technology from Rajiv Gandhi College of
Engg. & Research which is affiliated to RTM
Nagpur University, Nagpur, INDIA, in 2015.
Her main area of interest is Data Mining.
Data Mining
Clustering
K-Means Clustering
Canopy Clustering
Hadoop
Map Reduce Program Model
The canopy clustering algorithm is also an unsupervised
pre-clustering algorithm, often used as preprocessing step
for K-means algorithm or Hierarchical clustering
algorithm. It intended to speed up the clustering
operations on large data sets. Since the algorithm uses
distance functions and requires the specification of
distance thresholds, its applicability for high-dimensional
data is limited by the curse of dimensionality. Only when
a cheap and approximate – low-dimensional – distance
function is available, the produced canopies will preserve
the clusters produced by K-means. The new method has
reduced the comparison of the number of instances at
each step and there is some evidence that the resulting
clusters are improved. Canopy clustering is a very simple,
fast and surprisingly accurate method for grouping objects
into clusters, thus it can be used in MapReduce concept
using hadoop cluster in order to enhance the clustering
techniques. One can also use the canopies idea to speed
up prototype based clustering methods like K-means and
Expectation-Maximization (EM).
[1] Mahesh Maurya, Sunita Mahajan, “Performance
analysis of MapReduce Programs on Hadoop
cluster”, IEEE.
[2] Jing Zhang, Xindong Wu “A 2-Tier Clustering
Algorithm with Map-Reduce”, IEEE 2010.
[3] McCallum, A.; Nigam, K.; and Ungar L.H. (2000)
"Efficient Clustering of High Dimensional Data Sets
with Application to Reference Matching", Proceedings
of the sixth ACM SIGKDD international conference
on Knowledge discovery and data mining.
[4] A. M. Fahim, A. M. Salem, F. A. Torkey and M. A.
Ramadan, “An Efficient enhanced k-means clustering
algorithm,” journal of Zhejiang
[5] K. A. Abdul Nazeer and M. P. Sebastian, “Improving
the accuracy and efficiency of the k-means clustering
algorithm,” in International Conference on Data
Mining and Knowledge Engineering (ICDMKE),
Proceedings of the World Congress on Engineering
(WCE-2009),
[6] Apache Hadoop. http://hadoop.apache.org/
[7] http://mahout.apache.org/users/clustering/canopyclustering.
html
[8] http://en.wikipedia.org/wiki/Canopy_clustering_algori
thm
[9] Tom White, “Hadoop: The Definitive Guide”, 2009
Published by O’Reilly Media, Inc., 1005 Gravenstein
Highway North, Sebastopol, CA 95472