RANDSHUFF: an Algorithm to Handle Imbalance Class for Qualitative Data

Tora Fahrudin; Joko Lianto Buliali; Chastine Fatichah

doi:10.15866/irecos.v11i12.10956

RANDSHUFF: an Algorithm to Handle Imbalance Class for Qualitative Data

Tora Fahrudin^(1*), Joko Lianto Buliali⁽²⁾, Chastine Fatichah⁽³⁾

^(*) Corresponding author

Authors' affiliations

DOI: https://doi.org/10.15866/irecos.v11i12.10956

Abstract

Class imbalance is a case in which the proportion of training data between one class and another is not balanced, the larger data are called “major class”, conversely known as the “minor class”. It is believed that accuracy of data mining algorithms can be affected by an imbalance problem. Nowadays, researchers distinguish three main factors of class imbalance that affect the accuracy of data mining algorithm such as overlap, small disjuncts and outliers. A general solution to the problem is the modification of data level or algorithm level. To overcome imbalance problems, we propose a new algorithm called RANDSHUFF(Random Shuffle Oversampling Techniques for Qualitative Data), oversampling synthetic data generation for qualitative data type. RANDSHUFF algorithm uses the concept of neighborhood with IVDM (Interpolated Value Difference Metric) distance calculation and crossovers of the original attribute values and their neighbor’s attribute values using the random shuffle technique. Our experimental results showed that RANDSHUFF, combined with Borderline and ADASYN concepts, provides the best results against seven imbalanced public qualitative data type (best minor class Recall on hepatitis, breast cancer and German data and best F-Measure of minor class on hepatitis, abalone and German data).
Copyright © 2016 Praise Worthy Prize - All rights reserved.

Keywords

Class Imbalance; Oversampling; Synthetic Data; IVDM; RANDSHUFF; Qualitative Data

Full Text:

PDF

References

R. Longadge, S. Dongre, and M. Latesh, “Class Imbalance Problem in Data Mining: Review,” Int. J. Comput. Sci. Netw.(IJCSN), vol. 2, no. 1, 2013.
http://dx.doi.org/10.4018/978-1-5225-0489-4.ch013

P. Thanathamathee and C. Lursinsap, “Handling imbalanced data sets with synthetic boundary data generation using bootstrap re-sampling and AdaBoost techniques,” Pattern Recognit. Lett., vol. 34, no. 12, pp. 1339–1347, 2013.
http://dx.doi.org/10.1016/j.patrec.2013.04.019

K. Napierała, “Improving Rule Classifiers For Imbalanced Data,” Doctoral Dissertation, Faculty of Computer Science, Poznan University of Technology, Piotrowo 2, Poznan, Poland, 2012.
http://dx.doi.org/10.3311/ppci.7880

D. Tomar and S. Agarwal, “A Survey on Pre-processing and Post-processing Techniques in Data Mining,” International Journal of Database Theory and Application,vol. 7, no. 4, pp. 99–128, 2014.
http://dx.doi.org/10.14257/ijdta.2014.7.4.09

Y. Dong and X. Wang, “A new over-sampling approach: Random-SMOTE for learning from imbalanced data sets,” in Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 7091 LNAI, Springer Berlin Heidelberg, 2011, pp. 343–352.
http://dx.doi.org/10.1007/978-3-642-25975-3_30

A. Fernández, V. López, M. Galar, M. José, and F. Herrera, “Knowledge-Based Systems Analysing the classification of imbalanced data-sets with multiple classes: Binarization techniques and ad-hoc approaches,” in Knowledge-Based Systems, vol. 42, 2013, pp. 97–110.
http://dx.doi.org/10.1016/j.knosys.2013.01.018

H. He and E. A. Garcia, “Learning from imbalanced data,” IEEE Trans. Knowl. Data Eng., vol. 21, no. 9, pp. 1263–1284, 2009.
http://dx.doi.org/10.1109/tkde.2008.239

N. Verbiest, E. Ramentol, C. Cornelis, and F. Herrera, “Preprocessing noisy imbalanced datasets using SMOTE enhanced with fuzzy rough prototype selection,” Appl. Soft Comput. J., vol. 22, pp. 511–517, 2014.
http://dx.doi.org/10.1016/j.asoc.2014.05.023

D. R. Wilson and T. R. Martinez, “Improved heterogeneous distance functions,” J. Artif. Intell. Res., vol. 6, pp. 1–34, 1997.
http://dx.doi.org/10.1142/s0218001412500127

A. M. Mahmood, “Class Imbalance Learning in Data Mining – A Survey,” Int. J. Commun. Technol. Soc. Netw. Serv., vol. 3, no. 2, pp. 17–36, 2015.
http://dx.doi.org/10.21742/ijctsns.2015.3.2.02

N. V Chawla, K. W. Bowyer, L. O. Hall, and W. P. Kegelmeyer, “SMOTE: Synthetic Minority Over-sampling Technique,”Journal of Artificial Intelligence Research, vol. 16, pp. 321–357, 2002.
http://dx.doi.org/10.2172/751021

L. Abdi and S. Hashemi, “To Combat Multi-Class Imbalanced Problems by Means of Over-Sampling Techniques,” IEEE Trans. Knowl. Data Eng., vol. 28, no. 1, pp. 238–251, 2016.
http://dx.doi.org/10.1109/tkde.2015.2458858

C. Huang, Y. Li, C. C. Loy, and X. Tang, “Learning Deep Representation for Imbalanced Classification,” IEEE Conf. Comput. Vis. Pattern Recognit., 2016.
http://dx.doi.org/10.1109/cvpr.2016.580

N. V Chawla, A. Lazarevic, L. O. Hall, and K. Bowyer, “SMOTEBoost: improving prediction of the minority class in boosting,” 7th European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD), PKDD-2003, pp. 107–119, 2003.
http://dx.doi.org/10.1007/978-3-540-39804-2_12

F. S. Hanifah, “SMOTE Bagging Algorithm for Imbalanced Dataset in Logistic Regression Analysis ( Case: Credit of Bank X ),” Appl. Math. Sci., vol. 9, no. 138, pp. 6857–6865, 2015.
http://dx.doi.org/10.12988/ams.2015.58562

T. R. Hoens and N. V. Chawla, “Generating diverse ensembles to counter the problem of class imbalance,” Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 6119 LNAI, no. PART 2, pp. 488–499, 2010.
http://dx.doi.org/10.1007/978-3-642-13672-6_46

E. Ramentol, Y. Caballero, R. Bello, and F. Herrera, “SMOTE-RSB *: A hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory,” Knowl. Inf. Syst., vol. 33, no. 2, pp. 245–265, 2012.
http://dx.doi.org/10.1007/s10115-011-0465-6

G. E. A. P. A. Batista, R. C. Prati, and M. C. Monard, “A Study of the Behavior of Several Methods for Balancing Machine Learning Training Data,” SIGKDD Explor. Newsl.(Special issue on learning from imbalanced datasets), vol. 6, no. 1, pp. 20–29, 2004.
http://dx.doi.org/10.1145/1007730.1007735

V. García, A. I. Marqués, and J. S. Sánchez, “Improving risk predictions by preprocessing imbalanced credit data,” Lect. Notes Comput. Sci. (including Subser. Lect. Notes Artif. Intell. Lect. Notes Bioinformatics), vol. 7664 LNCS, no. PART 2, pp. 68–75, 2012.
http://dx.doi.org/10.1007/978-3-642-34481-7_9

C. Bunkhumpornpat, K. Sinapiromsaran, and C. Lursinsap, “DBSMOTE: Density-based synthetic minority over-sampling technique,” Appl. Intell., vol. 36, no. 3, pp. 664–684, 2012.
http://dx.doi.org/10.1007/s10489-011-0287-y

M. Kubat and S. Matwin, “Addressing the Curse of Imbalanced Training Sets: One Sided Selection,” Proc. Fourteenth Int. Conf. Mach. Learn., vol. 4, no. 1, pp. 179–186, 1997.
http://dx.doi.org/10.1007/s10994-005-0464-5

C. Bunkhumpornpat, K. Sinapiromsaran, and C. Lursinsap, "Safe-Level-SMOTE : Safe-Level-Synthetic Minority Over-Sampling TEchnique",Lect. Notes Comput Sci, vol. 5476, pp. 475-482, Springer Berlin Heidelberg, 2009.
http://dx.doi.org/10.1007/978-3-642-01307-2_43

H. Han, W. Wang, and B. Mao, “Borderline-SMOTE : A New Over-Sampling Method in,” in International Conference on Intelligent Computing, ICIC 2005, Hefei, China, August 23-26, 2005, Proceedings, Part I, Springer Berlin Heidelberg, 2005, pp. 878–887.
http://dx.doi.org/10.1007/11538059_91

H. He, Y. Bai, E. A. Garcia, and S. Li, “ADASYN: Adaptive Synthetic Sampling Approach for Imbalanced Learning,” in Neural Networks, 2008. IJCNN 2008. (IEEE World Congress on Computational Intelligence). S on, 2008, no. 3, pp. 1322–1328.
http://dx.doi.org/10.1109/ijcnn.2008.4633969

H. Guo and H. L. Viktor, “Learning from Imbalanced Data Sets with Boosting and Data Generation: The DataBoost-IM Approach,” ACM SIGKD Explor. Newsl. - Spec. issue Learn. from imbalanced datasets, vol. 6, no. 1, pp. 30–39, 2004.
http://dx.doi.org/10.1145/1007730.1007736

H. Wang and W. Dubitzky, “A flexible and robust similarity measure based on contextual probability,” IJCAI Int. Jt. Conf. Artif. Intell., pp. 27–32, 2005.
http://dx.doi.org/10.1142/s0218001412500097

J. H. Holland, Genetic Algorithms - Computer programs that “evolve” in ways that resemble natural selection can solve complex problems even their creators do not fully understand.Scientific American, pp. 66-72, 1992.
http://dx.doi.org/10.1038/scientificamerican0792-66

W. Y. Lin, W. Y. Lee, and T. P. Hong, “Adapting crossover and mutation rates in genetic algorithms,” J. Inf. Sci. Eng., vol. 19, no. 5, pp. 889–903, 2003.
http://dx.doi.org/10.1109/icsmc.2010.5642209

J. Magalhães-Mendes, “A comparative study of crossover operators for genetic algorithms to solve the job shop scheduling problem,” WSEAS Trans. Comput., vol. 12, no. 4, pp. 164–173, 2013.
http://dx.doi.org/10.4236/ajibm.2016.66071

S. Visa and A. Ralescu, “Issues in mining imbalanced data sets-a review paper,” in Proceedings of the sixteen midwest artificial intelligence and cognitive science conference, 2005, no. August 2016, pp. 67–73.
http://dx.doi.org/10.1109/fuzzy.2005.1452488

M. F. Naufal and S. Rochimah, “Software complexity metric-based defect classification using FARM with preprocessing step CFS and SMOTE a preliminary study,” in 2015 International Conference on Information Technology Systems and Innovation (ICITSI), 2015, pp. 1–6.
http://dx.doi.org/10.1109/icitsi.2015.7437685

J. A. Sáez and B. Krawczyk, “Analyzing the oversampling of different classes and types of examples in multi-class imbalanced datasets,” Pattern Recognit., vol. 57, pp. 164–178, 2016.
http://dx.doi.org/10.1016/j.patcog.2016.03.012

Refbacks

There are currently no refbacks.

Username
Password
Remember me