Open Access Open Access  Restricted Access Subscription or Fee Access

Effect of Missing Data Treatment on the Predictive Accuracy of C4.5 Classifier

Saeed Shurrab(1), Rehab Duwairi(2*)

(1) Department of Computer Information Systems, Jordan University of Science and Technology, Irbid 22110, Jordan
(2) Department of Computer Information Systems, Jordan University of Science and Technology, Irbid 22110, Jordan
(*) Corresponding author


DOI: https://doi.org/10.15866/irecap.v11i3.19721

Abstract


Missing data is a common problem confronted by researchers in machine learning applications. Missing values affect both the performance of analysis tools, as well as the quality of the drawn decisions. This research aims to analyze the impact of four missing data treatment methods on the predictive accuracy of the C4.5 decision tree algorithm. It also investigates the imputation accuracy of each imputation method using a single dataset with missing values presented in a single variable. The work was performed under three missing data assumptions, namely, Missing Completely At Random (MCAR), Missing At Random (MAR) and Missing Not At Random (MNAR) with three missingness’ rates: 5%, 10%, and 15%. The methods used to treat the missing data are: lite-wise deletion, mean/mode imputation, K-nearest neighbor imputation, and decision tree imputation. The results of the experiments showed that the C4.5 classifier achieved better performance under the MCAR assumption. While the mean/mode imputation has the highest C4.5 predictive accuracy under MAR and MNAR assumptions. The k-nearest neighbor method obtained the most accurate imputation result under the MCAR assumption, whereas mean/mode imputation was the most accurate method under the MAR assumption. On the other hand, the lowest imputation accuracy levels were achieved under the MNAR assumption attributed to the mean/mode imputation method.
Copyright © 2021 Praise Worthy Prize - All rights reserved.

Keywords


C4.5 Classifier; Decision Tree; Missing Values; Imputation; Treatment Methods

Full Text:

PDF


References


M. S. Santos, P. H. Abreu, P. J. García-Laencina, A. Simão, and A. Carvalho, A new cluster-based oversampling method for improving survival prediction of hepatocellular carcinoma patients, J. Biomed. Inform., vol. 58, pp. 49–59, 2015.
https://doi.org/10.1016/j.jbi.2015.09.012

P. Austin, I. White, D. Lee, S. van Buuren. Missing Data in Clinical Research: A Tutorial on Multiple Imputation. Canadian Journal of Cardiology, 2021, (2021).
https://doi.org/10.1016/j.cjca.2020.11.010

K. Leea, K. Tillingc, R. Cornishc, R. Littled, M. Belle, E. Goetghebeurf, J. Hogang, J. Carpenterh. Framework for the treatment and reporting of missing data in observational studies: The Treatment and Reporting of Missing data in Observational Studies Framework. Journal of Clinical Epidemiology 134 (2021) 79-88.
https://doi.org/10.1016/j.jclinepi.2021.01.008

J. Chen, S. Hunter, K. Kisfalvi, R. Lirio. A hybrid approach of handling missing data under different missing data mechanisms: VISIBLE 1 and VARSITY trials for ulcerative colitis. Contemporary Clinical Trials 100 (2021) 106226.
https://doi.org/10.1016/j.cct.2020.106226

Andrea Gabrio, Rachael Hunter, Alexina J. Mason, Gianluca Baio, Joint Longitudinal Models for Dealing With Missing at Random Data in Trial-Based Economic Evaluations, Value in Health, Volume 24, Issue 5, 2021, Pages 699-706, ISSN 1098-3015.
https://doi.org/10.1016/j.jval.2020.11.018

Z. Wang, L. Wang, Y. Tan, and J. Yuan. Fault detection based on Bayesian network and missing data imputation for building energy systems. Applied Thermal Engineering 182 (2021) 116051.
https://doi.org/10.1016/j.applthermaleng.2020.116051

L. Lia, H. Liua, H. Zhou, C. Zhang. Missing data estimation method for time series data in structure health monitoring systems by probability principal component analysis. Advances in Engineering Software 149 (2020) 102901.
https://doi.org/10.1016/j.advengsoft.2020.102901

R. Tawn, J. Browell, I. Dinwoodie. Missing data in wind farm time series: Properties and effect on forecasts. Electric Power Systems Research 189 (2020) 106640.
https://doi.org/10.1016/j.epsr.2020.106640

J. Tang, X. Zhang, T. Yu, and F. Liu. Missing traffic data imputation considering approximate intervals: A hybrid structure integrating adaptive network-based inference and fuzzy rough set. Physica A (2021).
https://doi.org/10.1016/j.physa.2021.125776

I. Izonin, R. Tkachenko, V. Verhun, K. Zub. An approach towards missing data management using improved GRNN-SGTM ensemble method. Engineering Science and Technology, an International Journal. (2020).
https://doi.org/10.1016/j.jestch.2020.10.005

D. Priya and R. Sivaraj, A review of missing data handling methods, Int. J. Eng. Technol. Sci., vol. 2, no. 2, pp. 58–68, 2015.

M. L. Brown and J. F. Kros, Data mining and the impact of missing data, Ind. manag. data syst., vol. 103, no. 8, pp. 611–621, 2003.
https://doi.org/10.1108/02635570310497657

S. Gorard, Handling missing data in numeric analyses, Int. J. Soc. Res. Methodol., vol. 23, no. 6, pp. 651–660, 2020.

R. J. A. Little and D. B. Rubin, Statistical analysis with missing data: Little/statistical analysis with missing data. Hoboken, NJ, USA: John Wiley & Sons, Inc., 2002.
https://doi.org/10.1002/9781119013563

X. Liu, Methods for handling missing data, in Methods and Applications of Longitudinal Data Analysis, London, England: Academic Press Inc. (London), 2016, pp. 441–499.
https://doi.org/10.1016/b978-0-12-801342-7.00014-9

B. Twala, An empirical comparison of techniques for handling incomplete data using decision trees, Appl. Artif. Intell., vol. 23, no. 5, pp. 373–405, 2009.
https://doi.org/10.1080/08839510902872223

J. M. Jerez et al., Missing data imputation using statistical and machine learning methods in a real breast cancer problem, Artif. Intell. Med., vol. 50, no. 2, pp. 105–115, 2010.

S. L. Salzberg, C4.5: Programs for machine learning by J. ross Quinlan. Morgan Kaufmann publishers, inc., 1993, Mach. Learn., vol. 16, no. 3, pp. 235–240, 1994.
https://doi.org/10.1007/bf00993309

A. D. Gordon, L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone, Classification and Regression Trees, Biometrics, vol. 40, no. 3, p. 874, 1984.
https://doi.org/10.2307/2530946

J. R. Quinlan, Induction of decision trees, Mach. Learn., vol. 1, no. 1, pp. 81–106, 1986.

E. Acuña and C. Rodriguez, The treatment of missing values and its effect on classifier accuracy, in Classification, Clustering, and Data Mining Applications, Berlin, Heidelberg: Springer Berlin Heidelberg, 2004, pp. 639–647.
https://doi.org/10.1007/978-3-642-17103-1_60

G. E. A. P. A. Batista and M. C. Monard, An analysis of four missing data treatment methods for supervised learning, Appl. Artif. Intell., vol. 17, no. 5–6, pp. 519–533, 2003.
https://doi.org/10.1080/713827181

A. Farhangfar, L. A. Kurgan, and W. Pedrycz, Experimental analysis of methods for imputation of missing values in databases, in Intelligent Computing: Theory and Applications II, 2004.
https://doi.org/10.1117/12.542509

B. E. T. H. Twala, M. C. Jones, and D. J. Hand, Good methods for coping with missing data in decision trees, Pattern Recognit. Lett., vol. 29, no. 7, pp. 950–956, 2008.
https://doi.org/10.1016/j.patrec.2008.01.010

G. Ssali and T. Marwala, Computational intelligence and decision trees for missing data estimation, in 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence), 2008.
https://doi.org/10.1109/ijcnn.2008.4633790

E. T. Caparino, A. M. Sison, and R. P. Medina, Application of the modified imputation method to missing data to increase classification performance, in 2019 IEEE 4th International Conference on Computer and Communication Systems (ICCCS), 2019.
https://doi.org/10.1109/ccoms.2019.8821632

D. McNeish, Missing data methods for arbitrary missingness with small samples, J. Appl. Stat., vol. 44, no. 1, pp. 24–39, 2017.

G. Boquet, J. L. Vicario, A. Morell, and J. Serrano, Missing data in traffic estimation: A variational autoencoder imputation method, in ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019.
https://doi.org/10.1109/icassp.2019.8683011

A. Ngueilbaye, H. Wanga, D. Mahamat, S. Junaidu. Modulo 9 model-based learning for missing data imputation. Applied Soft Computing Journal 103 (2021), 107167.
https://doi.org/10.1016/j.asoc.2021.107167

D. Peng, M. Zou, C. Liu, and J. Lu. RESI: A Region-Splitting Imputation method for different types of missing data. Expert Systems with Applications, 168 (2021) 114425.
https://doi.org/10.1016/j.eswa.2020.114425

Z. Che, S. Purushotham, K. Cho, D. Sontag, and Y. Liu, Recurrent neural networks for multivariate time series with missing values, Scientific Reports, 8 (1) (2018).
https://doi.org/10.1038/s41598-018-24271-9

H. Xu, X. Tang, B. Ai, X. Gao, F. Yang, and Z. Wen. Missing data reconstruction in VHR images based on progressive structure prediction and texture generation. ISPRS Journal of Photogrammetry and Remote Sensing 171 (2021) 266–277.
https://doi.org/10.1016/j.isprsjprs.2020.11.020

W. Weihan. MAGAN: A masked autoencoder generative adversarial network for processing missing IoT sequence data. Pattern Recognition Letters 138 (2020) 211–216.
https://doi.org/10.1016/j.patrec.2020.07.025

Z. Ma, and G. Chen. Bayesian methods for dealing with missing data problems. Journal of the Korean Statistical Society, (2018), 47(3), 297–313.
https://doi.org/10.1016/j.jkss.2018.03.002

L. Madan, and R. Basav. Handling missing values: A study of popular imputation packages in R. Knowledge-Based Systems, (2018), 160, 104–118.
https://doi.org/10.1016/j.knosys.2018.06.012

T. Derek, S. Lyndsay, and R. John. Handling missing data in self-exciting point process models. Spatial Statistics, (2019). 29, 160–176.
https://doi.org/10.1016/j.spasta.2018.12.004

B. K. Beaulieu-Jones and J. H. Moore, Missing data imputation in the electronic health record using deeply learned autoencoders, Pac. Symp. Biocomput., vol. 22, pp. 207–218, 2017.
https://doi.org/10.1142/9789813207813_0021

J. Yoon, J. Jordon, and M. van der Schaar, GAIN: Missing data imputation using Generative Adversarial Nets, arXiv [cs.LG], 2018.

H. Hofmann, “Statlog (German Credit Data) Data Set, UCI Machine Learning Repository. University of California, Department of Information and Computer Science, Irvine, CA. 1994.

A. Rieger, T. Hothorn, and C. Strobl, Random Forests with Missing Values in the Covariates, 2010.

M. S. Santos, R. C. Pereira, A. F. Costa, J. P. Soares, J. Santos, and P. H. Abreu, Generating synthetic missing data: A review by missing mechanism, IEEE Access, vol. 7, pp. 1–1, 2019.
https://doi.org/10.1109/access.2019.2891360

J. Xia et al., Adjusted weight voting algorithm for random forests in handling missing values, Pattern Recognit., vol. 69, pp. 52–60, 2017.
https://doi.org/10.1016/j.patcog.2017.04.005

P.-N. Tan, M. Steinbach, and V. Kumar, Introduction to Data Mining. Upper Saddle River, NJ: Pearson, 2005.


Refbacks

  • There are currently no refbacks.



Please send any question about this web site to info@praiseworthyprize.com
Copyright © 2005-2021 Praise Worthy Prize