TY - JOUR
T1 - An investigation of the imputation techniques for missing values in ordinal data enhancing clustering and classification analysis validity
AU - Alam, Shafiq
AU - Ayub, Muhammad Sohaib
AU - Arora, Sakshi
AU - Khan, Muhammad Asad
N1 - Publisher Copyright:
© 2023 The Author(s)
PY - 2023/12
Y1 - 2023/12
N2 - Missing data can significantly impact dataset integrity and suitability, leading to unreliable statistical results, distortions, and poor decisions. The presence of missing values in data introduces inaccuracies in clustering and classification and compromises the reliability and validity of such analyses. This study investigates multiple imputation techniques specifically designed for handling missing values in ordinal data commonly encountered in surveys and questionnaires. Quantitative approaches are used to evaluate different imputation methods on various datasets with varying missing value percentages. By comparing the performance of imputation techniques using clustering metrics and algorithms (e.g., k-means, Partitioning Around Medoids), the study provides valuable insights for selecting appropriate imputation methods for accurate data analysis. Furthermore, the study examines the impact of imputed values on classification algorithms, including k-Nearest Neighbors (kNN), Naive Bayes (NB), and Multilayer Perceptron (MLP). Results demonstrate that the decision tree method is the most effective approach, closely aligning with the original data and achieving high accuracy. In contrast, random number imputation performs poorly, indicating limited reliability. This study advances the understanding of handling missing values and emphasizes the need to address this issue to enhance data analysis integrity and validity.
AB - Missing data can significantly impact dataset integrity and suitability, leading to unreliable statistical results, distortions, and poor decisions. The presence of missing values in data introduces inaccuracies in clustering and classification and compromises the reliability and validity of such analyses. This study investigates multiple imputation techniques specifically designed for handling missing values in ordinal data commonly encountered in surveys and questionnaires. Quantitative approaches are used to evaluate different imputation methods on various datasets with varying missing value percentages. By comparing the performance of imputation techniques using clustering metrics and algorithms (e.g., k-means, Partitioning Around Medoids), the study provides valuable insights for selecting appropriate imputation methods for accurate data analysis. Furthermore, the study examines the impact of imputed values on classification algorithms, including k-Nearest Neighbors (kNN), Naive Bayes (NB), and Multilayer Perceptron (MLP). Results demonstrate that the decision tree method is the most effective approach, closely aligning with the original data and achieving high accuracy. In contrast, random number imputation performs poorly, indicating limited reliability. This study advances the understanding of handling missing values and emphasizes the need to address this issue to enhance data analysis integrity and validity.
KW - Classification
KW - Clustering
KW - Imputation
KW - Multilayer Perceptron
KW - Ordinal data
KW - Partitioning Around Medoids
UR - http://www.scopus.com/inward/record.url?scp=85174017072&partnerID=8YFLogxK
U2 - 10.1016/j.dajour.2023.100341
DO - 10.1016/j.dajour.2023.100341
M3 - Article
AN - SCOPUS:85174017072
SN - 2772-6622
VL - 9
JO - Decision Analytics Journal
JF - Decision Analytics Journal
M1 - 100341
ER -