Human Genome Data Clustering Using K-Means Algorithm

Abstract
Authors
Keywords
Conclusion
References

In medical science field K-means algorithm can be applied to form clusters and data can be arranged in specific format. This paper explores the human genome data and indulgence of this data. Data from human genome project site is collected and model text is used. Data pre-processing steps in data mining are applied and inconsistencies in obtained data are removed. Using cluster analysis technique further this data is then grouped together using modified K-means algorithm. The formed clustered are arranged in group and availability of data is made easy in human genome information. Furthermore, outlier detection is possible and genetic disorders can be identified.

Published In : IJCAT Journal Volume 1, Issue 4

Date of Publication : 31 May 2014

Pages : 96 - 99

Figures : 06

Tables : 01

Publication Link : Human Genome Data Clustering Using K-Means Algorithm

Amrita A. Kulkarni : Department of C.S.E., GHRAET, Nagpur University, Nagpur, Maharashtra, India

Prof. Deepak Kapgate : Department of C.S.E., GHRAET, Nagpur University, Nagpur, Maharashtra, India

Human Genome

Clustering, K-means Algorithm

Pre-processing

Human genome data was analyzed in this paper. The format of the human genome data analysis result was described and few of the attributes were selected for processing, based on the knowledge. The KDD steps were explained and were applied on the Human genome Data to convert the raw data into a transformed data that was used for generating more knowledge from the system. Various clusters are formed based on the various numerical attributes of the human genome data. Observing the data if any attributes numerical value is absent an outlier detection technique applied. If missing value is error then value can be inserted however if value is not present then analysis regarding to missing value can be done and genetic disorders can be identified.

[1] Kazuki Ichikawa et. Al., “A simple but powerful heuristic method for accelerating k-means clustering of large-scale data in life science”, IEEE/ACM Transactions on Computational Biology and Bioinformatics, Vol. no. - PP, Issue no -99, pp – 1- 12, 2014.

[2] D. Minnie et. Al., “Clustering the preprocessed bone marrow data using modified k-means algorithm”, Indian Journal of Computer Science and Engineering, Vol. no.- 4, Issue no.- 2, pp- 196-203, 2013.

[3] Alp Aslan dogan et. Al. “Evidence Combination in Medical Data Mining”, International Conference on Information Technology: Coding and Computing , Vol. no- 2, pp – 465-469 , 2004.

[4] Rakesh Agrawal, et. Al., “Database Mining: A Performance Perspective”, IEEE Transactions on Knowledge and Data Engineering, Vol. no- 5, Issue no.- 6, pp – 914-925, 1993.

[5] Patricia Cerrito, et. Al., “Data and Text Mining the Electronic Medical Record to Improve Care and to Lower Costs”, SUGI 31 Proceedings, paper- 077-31, 2006.

[6] K. Y. Yeung, et. Al., “Validating clustering for gene expression data”, Bioinformatics Oxford Journal, Vol. no.- 17, Issue no.- 4, pp-309-318, 2001.

[7] Cios KJ, et. Al., “Uniqueness of Medical Data Mining”, Artificial Intelligence in Medicine,pp-1-24, 2002.

[8] Berks, Georg, et. Al., "Fuzzy clustering-a versatile mean to explore medical database." Program on European Symposium on Intelligent Techniques, Aachen, Germany, 2000.

[9] J. Harrow, et. Al., “GENCODE: the reference human genome annotation for The ENCODE Project,” Genome Research, vol. 22, no. 9, pp. 1760-74, 2012.

[10] M. H. Fulekar, Book on Bioinformatics: Applications in Life and Environmental Sciences: Springer, pp-1-11, 2009.

[11] F. De Smet, et. Al., “Adaptive quality-based clustering of gene expression profiles,” Bioinformatics, vol. 18, no. 5, pp. 735-46, May, 2002.

[12] Michael B. Eisen, et. Al., “Cluster Analysis and Display of Genome-Wide Expression Patterns”, Proc. Natl. Acad. Sci. USA Vol. 95, Pp. 14863–14868, 1998.

[13] Anil K Jain, “Data clustering: 50 years beyond Kmeans”, Journal Pattern Recognition Letters, Volume no.- 31, Issue no.- 8,pp- 651-666, 2010.