Open Access Open Access  Restricted Access Subscription or Fee Access

Analysis of Various Techniques for Solving the Problem of Big Data Classification

Mei Lu(1*)

(*) Corresponding author


Authors' affiliations


DOI: https://doi.org/10.15866/iremos.v14i4.20987

Abstract


The purpose of this study is to assess the effectiveness of various algorithms for big data classification, namely, partial least squares discriminant analysis (PLS-DA), NaiveBayes (NBC) and k-Nearest Neighbor (k-NN) based on the Hadoop MapReduce approach. The effectiveness of the approaches is compared to the classification of big data sets of average shot lengths (CSV). It has been shown that in accordance with the data set size, the PLS-DA classification accuracy increases and reaches 82%, and the computation time goes up to 45 seconds. The analysis of various classifiers showed that high accuracy rates for the PLS-DA classifier are ensured by a high percentage of positive and negative cases properly classified, and lower accuracy for KNN and NaiveBayes is justified by a high percentage of false-positive and false-negative indicators. It is concluded that the optimal classifier is the PLS-DA method, which allows one to classify a large amount data with high accuracy in a short time.
Copyright © 2021 Praise Worthy Prize - All rights reserved.

Keywords


Big Data; Classification Problem; K-Nearest Neighbor; NaiveBayes Classifier; PLS DA

Full Text:

PDF


References


M. Weichbold, A. Seymer, W. Aschauer, and T. Herdin, Potential and limits of automated classification of big data-A case study, Historical Social Research, Vol. 45(Issue 3):288-313, 2020.

S. Mondal, K. Agarwal, and M. Rashidm, Deep learning approach for automatic classification of X-ray images using convolutional neural network, 2019 Fifth International Conference on Image Information Processing (ICIIP), Vol. 1:326-331, 2019.
https://doi.org/10.1109/ICIIP47207.2019.8985687

Y. Roh, G. Heo, and S.E. Whang, A survey on data collection for machine learning: A big data - AI integration perspective, IEEE Transactions on Knowledge and Data Engineering, Vol. 1:1328-1347, 2019.
https://doi.org/10.1109/TKDE.2019.2946162

S.R. Zeebaree, H.M. Shukur, L.M. Haji, R.R. Zebari, K. Jacksi, and S.M. Abas, Characteristics and analysis of hadoop distributed systems, Technology Reports of Kansai University, Vol. 62(Issue 4):1555-1564, 2020.

L. Zhou, S. Pan, J. Wang, and A.V. Vasilakos, Machine learning on big data: Opportunities and challenges, Neurocomputing, Vol. 237:350-361, 2017.
https://doi.org/10.1016/j.neucom.2017.01.026

A.E. Maxwell, T.A. Warner, and F. Fang, Implementation of machine-learning classification in remote sensing: An applied review, International Journal of Remote Sensing, Vol. 9 (Issue 9):2784-2817, 2018.
https://doi.org/10.1080/01431161.2018.1433343

B. Jan, H. Farman, M. Khan, M. Imran, I.U. Islam, A. Ahmad, S. Ali, and G. Jeon, Deep learning in big data analytics: A comparative study, Computers & Electrical Engineering, Vol. 75:275-287, 2019.
https://doi.org/10.1016/j.compeleceng.2017.12.009

R. Varatharajan, G. Manogaran, and M.K. Priyan, A big data classification approach using LDA with an enhanced SVM method for ECG signals in cloud computing, Multimedia Tools and Applications, Vol. 77(Issue 8):10195-10215, 2018.
https://doi.org/10.1007/s11042-017-5318-1

A. L'heureux, K. Grolinger, H.F. Elyamany, and M.A. Capretz, Machine learning with big data: Challenges and approaches, IEEE Access, Vol. 5:7776-7797, 2017.
https://doi.org/10.1109/ACCESS.2017.2696365

A. Fernández, S. del Río, N.V. Chawla, and F. Herrera, An insight into imbalanced big data classification: Outcomes and challenges, Complex & Intelligent Systems, Vol. 3(Issue 2):105-120, 2017.
https://doi.org/10.1007/s40747-017-0037-9

S. Jin, J. Peng, and D. Xie, Towards MapReduce approach with dynamic fuzzy inference/interpolation for big data classification problems, 2017 IEEE 16th International Conference on Cognitive Informatics & Cognitive Computing (ICCI* CC), Vol. 1:407-413, 2017.
https://doi.org/10.1109/ICCI-CC.2017.8109781

K. Ahlawat, A. Chug, and A.P. Singh, Empirical evaluation of map reduce based hybrid approach for problem of imbalanced classification in big data, International Journal of Grid and High Performance Computing, Vol. 11(Issue 3):23-45, 2019.
https://doi.org/10.4018/IJGHPC.2019070102

K. Meena, and J. Sujatha, Reduced time compression in big data using Mapreduce approach and Hadoop, Journal of Medical Systems, Vol. 43(Issue 8):1-12, 2019.
https://doi.org/10.1007/s10916-019-1369-3

S. Jin, J. Peng, and D. Xie, A new MapReduce approach with dynamic fuzzy inference for big data classification problems, International Journal of Cognitive Informatics and Natural Intelligence, Vol. 12(Issue 3):40-54, 2018.
https://doi.org/10.4018/IJCINI.2018070103

A. Fernández, S. del Río, A. Bawakid, and F. Herrera, Fuzzy Rule based classification systems for big data with MapReduce: Granularity analysis, Advances in Data Analysis and Classification, Vol. 11(Issue 4):711-730, 2017.
https://doi.org/10.1007/s11634-016-0260-z

A. Prabhat, and V. Khullar, Sentiment classification on big data using naïve bayes and logistic regression, 2017 International Conference on Computer Communication and Informatics, Vol. 1:1-5, 2017.
https://doi.org/10.1109/ICCCI.2017.8117734

D. García-Gil, J. Luengo, S. García, and F. Herrera, Enabling smart data: Noise filtering in big data classification, Information Sciences, Vol. 479:135-152, 2019.
https://doi.org/10.1016/j.ins.2018.12.002

W. Xing, and Y. Bei, Medical health big data classification based on KNN classification algorithm, IEEE Access, Vol. 8:28808-28819, 2019.
https://doi.org/10.1109/ACCESS.2019.2955754

F. Ertam, and G. Aydın, Data classification with deep learning using Tensorflow, International Conference on Computer Science and Engineering (UBMK), Vol. 1:755-758, 2017.
https://doi.org/10.1109/UBMK.2017.8093521

M. Elkano, M. Galar, J. Sanz, and H. Bustince, CHI-BD: A fuzzy rule-based classification system for big data classification problems, Fuzzy Sets and Systems, Vol. 348:75-101, 2018.
https://doi.org/10.1016/j.fss.2017.07.003

F. Padillo, J.M. Luna, and S. Ventura, A grammar-guided genetic programing algorithm for associative classification in big data, Cognitive Computation, Vol. 11(Issue 3):331-346, 2019.
https://doi.org/10.1007/s12559-018-9617-2

Y. Shi, W. Huang, H. Ye, C. Ruan, N. Xing, Y. Geng, Y. Dong, and D. Peng, Partial least square discriminant analysis based on normalized two-stage vegetation indices for mapping damage from rice diseases using PlanetScope datasets, Sensors, Vol. 18(Issue 6):1901, 2018.
https://doi.org/10.3390/s18061901

R.T. Lottering, M. Govender, K. Peerbhay, and S. Lottering, Comparing Partial Least Squares (PLS) discriminant analysis and sparse PLS discriminant analysis in detecting and mapping solanummauritianum in commercial forest plantations using image texture, ISPRS Journal of Photogrammetry and Remote Sensing, Vol. 159:271-280, 2020.
https://doi.org/10.1016/j.isprsjprs.2019.11.019

M. Fordellone, A. Bellincontro, and F. Mencarelli, Partial least squares discriminant analysis: A dimensionality reduction method to classify hyperspectral data, ArXiv Preprint, arXiv:1806.09347, 2018.

L.C. Lee, C.Y. Liong, and A.A. Jemain, Partial Least Squares-Discriminant Analysis (PLS-DA) for classification of High-Dimensional (HD) data: A review of contemporary practice strategies and knowledge gaps, Analyst, Vol. 143(Issue 15):3526-3539, 2018.
https://doi.org/10.1039/C8AN00599K

W. Liu, Z. Sun, J. Chen, and C. Jing, Raman spectroscopy in colorectal cancer diagnostics: Comparison of PCA-LDA and PLS-DA models, Journal of Spectroscopy, Vol. 2016:1603609, 2016.
https://doi.org/10.1155/2016/1603609

L. Jiang, L. Zhang, L. Yu, and D. Wang, Class-specific attribute weighted naive Bayes, Pattern Recognition, Vol. 88:321-330, 2019.
https://doi.org/10.1016/j.patcog.2018.11.032

D. Berrar, Bayes' Theorem and naive Bayes classifier, in: Encyclopedia of Bioinformatics and Computational Biology: ABC of Bioinformatics (Elsevier Science Publisher, 2018, pp. 403-412).
https://doi.org/10.1016/B978-0-12-809633-8.20473-1

S. Xu, Bayesian naïve Bayes classifiers to text classification, Journal of Information Science, Vol. 44(Issue 1):48-59, 2018.
https://doi.org/10.1177/0165551516677946

I.A.T. Hashem, N.B. Anuar, A. Gani, I. Yaqoob, F. Xia, and S.U. Khan, MapReduce: Review and open challenges, Scientometrics, Vol. 109(Issue 1):389-422, 2016.
https://doi.org/10.1007/s11192-016-1945-y

S. Ramírez-Gallego, A. Fernández, S. García, M. Chen, and F. Herrera, Big data: Tutorial and guidelines on information and process fusion for analytics algorithms with MapReduce, Information Fusion, Vol. 42:51-61, 2018.
https://doi.org/10.1016/j.inffus.2017.10.001

B. Akil, Y. Zhou, and U. Röhm, On the usability of Hadoop MapReduce, Apache Spark & Apache Flink for data science, IEEE International Conference on Big Data, Vol. 1:303-310, 2017.
https://doi.org/10.1109/BigData.2017.8257938

I. Mavridis, and H. Karatza, Performance evaluation of cloud-based log file analysis with Apache Hadoop and Apache Spark, Journal of Systems and Software, Vol. 125: 133-151, 2017.
https://doi.org/10.1016/j.jss.2016.11.037

R.G. Brereton, and G.R. Lloyd, Partial least squares discriminant analysis: Taking the magic away, Journal of Chemometrics, Vol. 28(Issue 4):213-225, 2014.
https://doi.org/10.1002/cem.2609

M. Vadivukarassi, N. Puviarasan, and P. Aruna, Sentimental analysis of tweets using naive Bayes algorithm, World Applied Sciences Journal, Vol. 35(Issue 1):54-59, 2017.

S. Zhang, X. Li, M. Zong, X. Zhu, and R. Wang, Efficient KNN classification with different numbers of nearest neighbors, IEEE Transactions on Neural Networks and Learning Systems, Vol. 29(Issue 5):1774-1785, 2017.
https://doi.org/10.1109/TNNLS.2017.2673241

CSV examples (accessed 12 March 2021).
http://users.stat.ufl.edu/~winner/data/movie_avshotlength.csv

Cinemetrics, David Bordwell (accessed 12 March 2021).
http://www.cinemetrics.lv/bordwell.php

W.C. Lin, C.F. Tsai, Y.H. Hu, and J.S. Jhang, Clustering-based undersampling in class-imbalanced data, Information Sciences, Vol. 409:17-26, 2017.
https://doi.org/10.1016/j.ins.2017.05.008

C. Serrano-Cinca, and B. Gutiérrez-Nieto, Partial least square discriminant analysis for bankruptcy prediction, Decision Support Systems, Vol. 54(Issue 3):1245-1255, 2013.
https://doi.org/10.1016/j.dss.2012.11.015

H. Chen, C. Tan, Z. Lin, and T. Wu, Classification and quantitation of milk powder by near-infrared spectroscopy and mutual information-based variable selection and partial least squares, Spectrochimica Acta Part A: Molecular and Biomolecular Spectroscopy, Vol. 189:183-189, 2018.
https://doi.org/10.1016/j.saa.2017.08.034

C. Mo, J. Lim, S.W. Kwon, D.K. Lim, M.S. Kim, G. Kim, J. Kang, K.D. Kwon, and B.K. Cho, Hyperspectral imaging and partial least square discriminant analysis for geographical origin discrimination of white rice, Journal of Biosystems Engineering, Vol. 42(4):293-300, 2017.

P.S. Sampaio, A. Castanho, A.S. Almeida, J. Oliveira, and C. Brites, Identification of rice flour types with near-infrared spectroscopy associated with PLS-DA and SVM methods, European Food Research and Technology, Vol. 246(Issue 3):527-537, 2020.
https://doi.org/10.1007/s00217-019-03419-5

L.S. Vieira, C. Assis, M.E.L.R. de Queiroz, A.A. Neves, and A.F. de Oliveira, Building robust models for identification of adulteration in olive oil using FT-NIR, PLS-DA and variable selection, Food Chemistry, Vol. 345:128866, 2021.
https://doi.org/10.1016/j.foodchem.2020.128866

Al-Tarawneh, M., Muheilan, M., Al Tarawneh, Z., Hand Movement-Based Diabetes Detection Using Machine Learning Techniques, (2021) International Journal on Engineering Applications (IREA), 9 (4), pp. 234-242.
https://doi.org/10.15866/irea.v9i4.20616

Bataineh, A., Batayneh, W., Harahsheh, T., Hijazi, K., Alrayes, A., Olimat, M., Bataineh, A., Early Detection of Cardiac Diseases from Electrocardiogram Using Artificial Intelligence Techniques, (2021) International Review on Modelling and Simulations (IREMOS), 14 (2), pp. 128-136.
https://doi.org/10.15866/iremos.v14i2.19869

Shatnawi, M., Bani Yassein, M., Aljawarneh, S., Alodibat, S., Meqdadi, O., Hmeidi, I., Al Zoubi, O., An Improvement of Neural Network Algorithm for Anomaly Intrusion Detection System, (2020) International Journal on Communications Antenna and Propagation (IRECAP), 10 (2), pp. 84-93.
https://doi.org/10.15866/irecap.v10i2.18735

Surantha, N., Liujaya, S., Sunardy, A., Harvy, I., Design and Evaluation of Indoor Positioning System for User Access Management in Data Center, (2019) International Journal on Communications Antenna and Propagation (IRECAP), 9 (6), pp. 393-402.
https://doi.org/10.15866/irecap.v9i6.17026


Refbacks

  • There are currently no refbacks.



Please send any question about this web site to info@praiseworthyprize.com
Copyright © 2005-2024 Praise Worthy Prize