Implementing and Validating a Data Mining System Based on Decision Trees

Authors

  • Qutaiba Humadi Mohammed Media Technology and Communications Engineering Department, College of Engineering, University of Information Technology and Communications, Baghdad, Iraq https://orcid.org/0000-0001-9837-3320
  • Chaitanya Konda Department of CSE, Dr.YSR ANU College of Engineering and Technology, Acharya Nagarjuna University, Nagarjuna Nagar, Guntur, Andhra Pradesh 522510, India https://orcid.org/0009-0005-5647-4517
  • Anupama Namburu School of Engineering, Jawaharlal Nehru University, New Delhi 110067, India https://orcid.org/0000-0002-7826-0158

DOI:

https://doi.org/10.31272/jeasd.2998

Keywords:

Data Mining, Decision Trees, Machine Learning, Statistical Analysis, Titanic Dataset

Abstract

Data mining is concerned with revealing valuable patterns and insights out of big data sets, and machine learning forms a critical part in streamlining this effort. In any supervised classification task, decision tree algorithms become particularly beneficial since they provide understandable models and can be used to evaluate the significance of features. This paper assesses the work of YaDT as an example of a decision-making algorithm that uses the famous Titanic dataset. Provided a stringent statistical analysis to explore how the model behaves in different evaluation conditions. The findings reveal that YaDT builds a highly organized decision tree with strong generalization abilities, hence showing high classification ability. However, there was also a significant reduction in predictive reliability in some classes with very limited amounts of predictable data, indicating poor representation in the dataset. Additional evaluation of model performance using ROC analysis, based on cross-validation results, demonstrated comparable discriminative ability. Overall, it can be concluded that the findings favor the hypothesis that YaDT is a resilient, interpretable classifier, and as such, it can be competitive with the alternative machine-learning algorithms.

References

R. Ragavi, B. Srinithi, and V. S. Anitha Sofia, “Data Mining Issues and Challenges: A Review,” International Journal of Advanced Research in Computer and Communication Engineering, vol. 7, no. 11, pp. 118–121, Nov. 2018, doi: https://doi.org/10.17148/ijarcce.2018.71125.

P. Kordjamshidi, D. Roth, and K. Kersting, “Declarative Learning-Based Programming as an Interface to AI Systems,” Frontiers in Artificial Intelligence, vol. 5, Mar. 2022, doi: https://doi.org/10.3389/frai.2022.755361.

A. Zia, M. Aziz, I. Popa, S. A. Khan, A. F. Hamedani, and A. R. Asif, “Artificial Intelligence-Based Medical Data Mining,” Journal of Personalized Medicine, vol. 12, no. 9, p. 1359, Aug. 2022, doi: https://doi.org/10.3390/jpm12091359.

D. Wang, T. Miwa, and T. Morikawa, “Big Trajectory Data Mining: A Survey of Methods, Applications, and Services,” Sensors, vol. 20, no. 16, p. 4571, Aug. 2020, doi: https://doi.org/10.3390/s20164571.

J. R. Niaf, A. K. Kadhim, Q. H. Mohammed, H. K. Hoomod, and M. M. Salman, “New Cloud Computing Authentication Based on Secure Hash Algorithm (SHA-3) and Lightweight Sosemanuk Algorithm,” Lecture notes in networks and systems, pp. 207–221, Jan. 2025, doi: https://doi.org/10.1007/978-981-97-7603-0_19.

L. Meng, B. Bai, W. Zhang, L. Liu, and C. Zhang, “Research on a Decision Tree Classification Algorithm Based on Granular Matrices,” Electronics, vol. 12, no. 21, pp. 4470–4470, Oct. 2023, doi: https://doi.org/10.3390/electronics12214470.

F. Kibrete, T. Trzepieciński, H. S. Gebremedhen, and D. E. Woldemichael, “Artificial Intelligence in Predicting Mechanical Properties of Composite Materials,” Journal of Composites Science, vol. 7, no. 9, p. 364, Sep. 2023, doi: https://doi.org/10.3390/jcs7090364.

Z.-H. Zhou, “Three perspectives of data mining,” Artificial Intelligence, vol. 143, no. 1, pp. 139–146, Jan. 2003, doi: https://doi.org/10.1016/s0004-3702(02)00357-0.

H. M. Ibrahim, M. A. Shyaa, A. N. Yousif, and A. J. Ouda, “Data Mining Technique And Evaluation In Iraqi Named Crime Documents,” International Journal of Psychosocial Rehabilitation, vol. 24, no. 6, p. 2020, https://www.psychosocial.com/index.php/ijpr/article/download/7389/6625/13275.

J. M. Barrios and P. E. Romero, “Decision Tree Methods for Predicting Surface Roughness in Fused Deposition Modeling Parts,” Materials, vol. 12, no. 16, p. 2574, Aug. 2019, doi: https://doi.org/10.3390/ma12162574.

A. V. de Oliveira, M. C. S. Dazzi, A. M. da R. Fernandes, R. L. S. Dazzi, P. Ferreira, and V. R. Q. Leithardt, “Decision Support Using Machine Learning Indication for Financial Investment,” Future Internet, vol. 14, no. 11, p. 304, Oct. 2022, doi: https://doi.org/10.3390/fi14110304.

Z. Xiaoliang, Y. Hongcan, W. Jian, and W. Shangzhuo, “Research and application of the improved algorithm C4.5 on Decision tree,” 2009 International Conference on Test and Measurement, Hong Kong, , pp. 184-187.2009. doi: https://doi.org/10.1109/ICTM.2009.5413078.

R. K. Amin, Indwiarti, and Y. Sibaroni, “Implementation of decision tree using C4.5 algorithm in decision making of loan application by debtor (Case study: Bank Pasar of Yogyakarta Special Region),” 2015 3rd International Conference on Information and Communication Technology (ICoICT), May 2015, doi: https://doi.org/10.1109/icoict.2015.7231400.

I. H. Witten and E. Frank, “Data mining,” ACM SIGMOD Record, vol. 31, no. 1, p. 76, Mar. 2002, doi: https://doi.org/10.1145/507338.507355.

W. Ibrahim, S. Abdullaev, H. Alkattan, O. A. Adelaja, and A. A. Subhi, “Development of a Model Using Data Mining Technique to Test, Predict and Obtain Knowledge from the Academics Results of Information Technology Students,” Data, vol. 7, no. 5, p. 67, May 2022, doi: https://doi.org/10.3390/data7050067.

K. Abe, "Data Mining and Machine Learning Applications for Educational Big Data in the University," 2019 IEEE Intl Conf on Dependable, Autonomic and Secure Computing, Intl Conf on Pervasive Intelligence and Computing, Intl Conf on Cloud and Big Data Computing, Intl Conf on Cyber Science and Technology Congress (DASC/PiCom/CBDCom/CyberSciTech), Fukuoka, Japan, 2019, pp. 350-355, doi: https://doi.org/10.1109/DASC/PiCom/CBDCom/CyberSciTech.2019.00071.

A. Cherfi, K. Nouira, and A. Ferchichi, “Very Fast C4.5 Decision Tree Algorithm,” Applied Artificial Intelligence, vol. 32, no. 2, pp. 119–137, Mar. 2018, doi: https://doi.org/10.1080/08839514.2018.1447479.

N. Farag and G. Hassan, “Predicting the Survivors of the Titanic Kaggle, Machine Learning from Disaster,” Proceedings of the 7th International Conference on Software and Information Engineering - ICSIE’18, 2018, doi: https://doi.org/10.1145/3220267.3220282.

M.Hakem, Z. Boulouard, and M. Kissi, “Classification of Body Weight in Beef Cattle via Machine Learning Methods: A Review,” Procedia Computer Science, vol. 198, pp. 263–268, 2022, doi: https://doi.org/10.1016/j.procs.2021.12.238.

A. M. Rahmani et al., “Machine Learning (ML) in Medicine: Review, Applications, and Challenges,” Mathematics, vol. 9, no. 22, p. 2970, Nov. 2021, doi: https://doi.org/10.3390/math9222970.

M.-W. Huang, C.-F. Tsai, S.-C. Tsui, and W.-C. Lin, “Combining data discretization and missing value imputation for incomplete medical datasets,” PloS one, vol. 18, no. 11, pp. e0295032–e0295032, Nov. 2023, doi: https://doi.org/10.1371/journal.pone.0295032.

G. Gledec, M. Horvat, M. Mikuc, and B. Blaskovic, “A Comprehensive Dataset of Spelling Errors and Users’ Corrections in Croatian Language,” Data, vol. 8, no. 5, pp. 89–89, May 2023, doi: https://doi.org/10.3390/data8050089.

F. Mauriello, A. Montella, M. Pernetti, and F. Galante, “An Exploratory Analysis of Curve Trajectories on Two-Lane Rural Highways,” Sustainability, vol. 10, no. 11, p. 4248, Nov. 2018, doi: https://doi.org/10.3390/su10114248.

S. Tufail, H. Riggs, M. Tariq, and A. I. Sarwat, “Advancements and Challenges in Machine Learning: A Comprehensive Review of Models, Libraries, Applications, and Algorithms,” Electronics, vol. 12, no. 8, p. 1789, Jan. 2023, doi: https://doi.org/10.3390/electronics12081789.

P. Linardatos, V. Papastefanopoulos, and S. Kotsiantis, “Explainable AI: A Review of Machine Learning Interpretability Methods,” Entropy, vol. 23, no. 1, p. 18, Dec. 2020, doi: https://doi.org/10.3390/e23010018.

S. Jun, “Evolutionary Algorithm for Improving Decision Tree with Global Discretization in Manufacturing,” Sensors, vol. 21, no. 8, p. 2849, Apr. 2021, doi: https://doi.org/10.3390/s21082849.

S.-W. Lin and S.-C. Chen, “Parameter determination and feature selection for C4.5 algorithm using scatter search approach,” Soft Computing, vol. 16, no. 1, pp. 63–75, May 2011, doi: https://doi.org/10.1007/s00500-011-0734-z.

S. Sathyadevan and R. R. Nair, “Comparative Analysis of Decision Tree Algorithms: ID3, C4.5 and Random Forest,” In Computational Intelligence in Data Mining-Volume 1: Proceedings of the International Conference on CIDM, 20-21 December 2014, pp. 549-562. New Delhi: Springer India, 2014, doi: https://doi.org/10.1007/978-81-322-2205-7_51.

L. Zhao, S. Lee, and S.-P. Jeong, “Decision Tree Application to Classification Problems with Boosting Algorithm,” Electronics, vol. 10, no. 16, p. 1903, Jan. 2021, doi: https://doi.org/10.3390/electronics10161903.

S.-J. Lee, Z. Xu, T. Li, and Y. Yang, “A novel bagging C4.5 algorithm based on wrapper feature selection for supporting wise clinical decision making,” Journal of Biomedical Informatics, vol. 78, pp. 144–155, Feb. 2018, doi: https://doi.org/10.1016/j.jbi.2017.11.005.

S. Alvarez-Rodríguez and F. G. Peña-Lecona, “Artificial Neural Networks with Machine Learning Design for a Polyphasic Encoder,” Sensors, vol. 23, no. 20, pp. 8347–8347, Oct. 2023, doi: https://doi.org/10.3390/s23208347.

E. O. Kiyak, G. Tuysuzoglu, and D. Birant, “Partial Decision Tree Forest: A Machine Learning Model for the Geosciences,” Minerals, vol. 13, no. 6, pp. 01–15, Jun. 2023, doi: https://doi.org/10.3390/min13060800.

S. Ruggieri, “YaDT: yet another decision tree builder,” 16th IEEE International Conference on Tools with Artificial Intelligence, Boca Raton, FL, USA, 2004, pp.260-265, doi: https://doi.org/10.1109/ictai.2004.123.

N. Maaroof, A. Moreno, A. Valls, M. Jabreel, and M. Szeląg, “A Comparative Study of Two Rule-Based Explanation Methods for Diabetic Retinopathy Risk Assessment,” Applied Sciences, vol. 12, no. 7, p. 3358, Mar. 2022, doi: https://doi.org/10.3390/app12073358.

J. D. Blischak, E. R. Davenport, and G. Wilson, “A Quick Introduction to Version Control with Git and GitHub,” PLOS Computational Biology, vol. 12, no. 1, p. e1004668, Jan. 2016, doi: https://doi.org/10.1371/journal.pcbi.1004668.

H. K. Hoomod, M. A. Al-Hamami, A. K. Kadhim, Q. H. Mohammed, and Jolan Rokan Niaf, “MQTT Routing Optimizing Based Intrusion Detection for Internet of Things Using Hybrid Machine Learning,” 2024 International Conference on Decision Aid Sciences and Applications (DASA), Manama, Bahrain, 2024, pp. 1-4, doi: https://doi.org/10.1109/dasa63652.2024.10836235.

C. S. Hong and T. G. Oh, “TPR-TNR plot for confusion matrix,” Communications for Statistical Applications and Methods, vol. 28, no. 2, pp. 161–169, Mar. 2021, doi: https://doi.org/10.29220/csam.2021.28.2.161.

N. Ahmad and A. B. Nassif, “Dimensionality Reduction: Challenges and Solutions,” ITM Web of Conferences, vol. 43, p. 01017, 2022, doi: https://doi.org/10.1051/itmconf/20224301017.

A. Dasgupta, V. P. Mishra, S. Jha, B. Singh, and V. K. Shukla, “Predicting the Likelihood of Survival of Titanic’s Passengers by Machine Learning,” 2021 International Conference on Computational Intelligence and Knowledge Economy (ICCIKE), Dubai, United Arab Emirates, 2021, pp. 52-57, https://ieeexplore.ieee.org/document/9410757.

J. Bier, “Bodily circulation and the measure of a life: Forensic identification and valuation after the Titanic disaster,” Social Studies of Science, vol. 48, no. 5, pp. 635–662, Sep. 2018, doi: https://doi.org/10.1177/0306312718801173.

T. M. Mitchell, Machine learning. New York: Mcgraw-Hill, 1997. Available: https://www.cs.cmu.edu/~tom/files/MachineLearningTomMitchell.pdf.

G. I. Webb et al., “Leave-One-Out Cross-Validation,” n: Sammut, C., Webb, G.I. (eds) Encyclopedia of Machine Learning. Springer, Boston, MA. ISBN: 9780387307688, pp. 600–601, 2011, doi: https://doi.org/10.1007/978-0-387-30164-8_469.

Y. Ma, D. Liu, and L. Cai, “Deep Learning-Based Upper Limb Functional Assessment Using a Single Kinect v2 Sensor,” Sensors, vol. 20, no. 7, p. 1903, Mar. 2020, doi: https://doi.org/10.3390/s20071903.

A. A. Obaid, I. A. Rahman, I. J. Idan, and S. Nagapan, “Construction Waste and its Distribution in Iraq: An Ample Review,” Indian Journal of Science and Technology, vol. 12, no. 17, pp. 1–10, May 2019, doi: https://doi.org/10.17485/ijst/2019/v12i17/144627.

L. Al-Taie, N. Al-Ansari, S. Knutsson, and R. Pusch, “Hazardous Wastes Problems in Iraq: A Suggestion for an Environmental Solution,” Journal of Earth Sciences and Geotechnical Engineering, vol. 3, no. 3, pp. 1792–9660, 2013, Available: https://www.scienpress.com/journal_focus.asp?main_id=59&Sub_id=IV&Issue=795.

J. S. Aguilar-Ruiz and M. Michalak, “Multiclass Classification Performance Curve,” IEEE Access, vol. 10, pp. 68915–68921, 2022, doi: https://doi.org/10.1109/ACCESS.2022.3186444.

E. Indra, K. Ho, Arlinanda, R. Hakim, D. Sitanggang, and O. Sihombing, “Application of C4.5 Algorithm for Cattle Disease Classification,” Journal of Physics: Conference Series, vol. 1230, p. 012070, Jul. 2019, doi: https://doi.org/10.1088/1742-6596/1230/1/012070.

N. Lin, D. A. Noe, and X. He, “Tree-Based Methods and Their Applications,” In: Pham, H. (eds) Springer Handbook of Engineering Statistics. Springer Handbooks. Springer, London, pp. 551–570, Jan. 2006, doi: https://doi.org/10.1007/978-1-84628-288-1_30.

S. Kumar, A. Rai, and A. Kumar, “Decision Tree Based Models for Classification in Agricultural Ergonomics,” Statistics and Applications, vol. 12, pp. 21–33, 2014, Accessed: Aug. 09, 2024. [Online]. Available: https://ssca.org.in/media/3Sadhu.pdf.

Downloads

Key Dates

Received

2024-09-14

Revised

2025-10-28

Accepted

2025-11-12

Published Online First

2025-12-23

Published

2026-01-01

How to Cite

Mohammed, Q. H., Chaitanya Konda, & Anupama Namburu. (2026). Implementing and Validating a Data Mining System Based on Decision Trees. Journal of Engineering and Sustainable Development, 30(1), 99-115. https://doi.org/10.31272/jeasd.2998

Similar Articles

21-30 of 605

You may also start an advanced similarity search for this article.