1. Department of Computer Engineering, Islamic Azad University, Mashhad branch 2. Department of Computer Science & Software Engineering Faculty of Engineering and Computer Science Concordia University 3. Department of Computer Engineering, Islamic Azad University, Mashhad branch
Abstract
Introduction
Gene functionality explorations has a great importance in health science research. developing gene classifiers having accurate prediction is crucial and desirable research.
Methods
in this paper, we introduce hybrid model for classification of 24 genes related to infectious disease from many unrelated genes, a “high imbalanced dataset”, in which the number of instances of one class is much lower than the other class. problems arise when the dataset is imbalanced, misclassification of minority class sample occurs due to an incorrect learning of the real boundaries samples, therefore our model apply clustering for under sampling of negative genes and a smot oversampling method for increasing positive gene samples. we select a decision tree model for classification, and use ensemble of some classifiers for gene classification using a majority voting technique.
Results
We success to build classifier which classified huge and
high imbalanced data set with 81.12% accuracy, 79%
sensitivity and 89% specificity. our model could perform
on similar data sets.
Conclusion
According to our simulation study it is observed that the proposed approach improves classification performance compared to other similar approaches in the literature.furthermore, it is obvious that the smot method is suitable for reducing error rate.
Keywords
Gene classification, imbalanced data set, cluster based undersampling, smot, ensemble, decision tree