Derrick KR and Varayini Pankayatselvan
Background: Highly complex and computational intensive methods based on Synthetic Minority Over-sampling Technique (SMOTE) and more recently Learning Vector Quantization SMOTE (LVQ-SMOTE) have been proposed for classification problems of imbalanced biomedical data. This works presents a much simpler approach that is not computationally intensive and competes well with existing approaches. It uses principal component analysis (PCA) to generate a pseudo-variable as a linear combination of the features. From this one pseudo-variable, several classification methods are developed that classify directly based on very simple statistics. One method, the Mean Method (MM), classifies cases based on closeness to the means for the two classes from training data sets. When the number of features is very large, a feature reduction (FR) procedure is proposed to reduce misclassifications. In cases where the means of both classes are similar but their spread about their means are different, the Spread Method (SM) is proposed. A unique feature of this method is that one is able to vary the accuracy of classification between the two classes by changing the width of the window for allocation of cases. These proposed methods are found to perform well without the use of over-sampling techniques and multiple-fold cross validation.
Results: The MM or the MM with FR was compared directly to recently published results for LVQ-SMOTE on six (6) data sets and gave better or much better results in every case as measured by adding the percent of true positives to the percent of true negatives. The SM was compared with LVQ-SMOTE on two (2) data sets and operating windows widths were obtained that gave much better results for the SM over LVQ-SMOTE.
Conclusion: Given the simplicity, strengths, and performance of the proposed approach in comparison to current methods, these methods and procedures are recommended for use in classification of imbalanced biomedical data applications.
この記事をシェアする