Prediction of molecular class of an unknown
protein is an area of great relevance for carrying out
research in various disease detections and their
corresponding drug discovery processes and it is a very
tough and challenging task. Some specific approaches were
used in the past to increase the accuracy of Human protein
Function (HPF) prediction. This research is primarily
concentrated on one such approach of HPF prediction with
sequence derived features (SDF) using decision trees and
there variants implemented using C5 algorithm. More
sequence derived features were identified and incorporated,
training data was enhanced (Sequence data evolved from
HPRD (Human protein reference database)) in terms of
number of sequences and the features used to extract the
relation towards a specific class. Multiple techniques were
tested for accuracy in prediction and a comprehensive
comparison was done amongst them and the previous
research results.
Sunny Sharma : Department of Computer Science,
Guru Nanak Dev University, Amritsar, India
Amritpal Singh : Department of Computer Science,
Guru Nanak Dev University, Amritsar, India
Dr. Rajinder Singh : Department of Computer Science,
Guru Nanak Dev University, Amritsar, India
HPF, C5, See5, Decision Tree, SDF
Present work focus on usability of see5 tool in HPF
prediction and also demonstrate the impact of choosing the
right training data. The detailed analysis shows that
increasing number of features (5 features) of HPF data
increases the accuracy of prediction process (about
16%)but does not necessarily involves the participation of
all parameters in decision making process. Some
parameters were more dominant than others (like GRAVY 13%, Solubility 8%, Thr 4%) hence they decide the course
of prediction. Activities like advanced pruning and
winnowing (17 attributes winnowed) help in minimizing
the computation time and also help in reaching the most
important parameters involved in prediction process
(ExpAA came out as most important parameter after
winnowing). In future more features can be extracted on
more sequences and their relative impact on prediction
process can be examined hence it will lead to greater
precision in the HPF identification process. Inclusion of
comparison feature in See5 tool can be of great importance
as it will help researchers in identification of correct ruleset
and role of newly incorporated feature for the HPF
prediction scenario.
[1] B. Bergeron, “Bioinformatics Computing”, pp 257-
270, 2002.
[2] D. Arditi and T. Pulket, “Predicting the outcome of
construction litigation using boosted decision trees ”,
Journal of Computing in Civil Engineering, vol. 19, no.
4, pp 387–393, 2005.
[3] H. Wei-Feng, G. Na, Y. Yan, L. Ji-Yang, Y. Ji-Hong,
“Decision Trees Com-bined with Feature Selection for
the Rational Synthesis of Aluminophos-phate AlPO4-
5”, National Natural Science Foundation of China, vol
27, no.9, pp 2111-2117, 2011.
[4] I. Friedberg, “Automated Protein Function Predictionthe
Genomic Chal-lenge”, Briefings in Bioinformatics,
vol 7, no.3, pp 225-242.
[5] J. Han and M. Kamber, “Data Mining Concepts and
Techniques”, MorganKaufmann Publishers, USA pp
279-322, 2003.
[6] L.J. Jensen, R. Gupta, N. Blom, D. Devos, J. Tamames
C. Kesmir, H. Nielsen, H.H. Stærfeldt, K. Rapacki, C.
Workman C.A.F. Andersen, S. Knudsen, A. Krogh,
A.Valencia and S. Brunak , “Prediction of Human
Protein Function from Post-Translational
Modifications and Localization Features ”, Journal of
Molecular Biology, vol. 319, issue 5,pp 1257-1265,
2002.
[7] M. Singh, G. Singh, “Cluster Analysis Technique
based on Bipartite Graph for Human Protein Class
Prediction”, International Journal of Computer
Applications (0975 – 8887), vol. 20, no.3, pp. 22-27,
2011.
[8] M. Singh, P. K. Wadhwa and P. S. Sandhu , “ Human
Protein Function Prediction using Decision Tree
Induction “, IJCSNS International Journal of Computer
Science and Network Security, vol. 7, no.4, pp. 92-98,
2007.
[9] www.hprd.org.
[10] http://rulequest.com/see5-info.html.