TY - GEN
T1 - The Impact of Data Valuation on Feature Importance in Classification Models
AU - Ebiele, Malick
AU - Bendechache, Malika
AU - Ward, Marie
AU - Geary, Una
AU - Byrne, Declan
AU - Creagh, Donnacha
AU - Brennan, Rob
N1 - Publisher Copyright:
© The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd. 2024.
PY - 2024
Y1 - 2024
N2 - This paper investigates the impact of data valuation metrics (variability and coefficient of variation) on the feature importance in classification models. Data valuation is an emerging topic in the fields of data science, accounting, data quality, and information economics concerned with methods to calculate the value of data. Feature importance or ranking is important in explaining how black-box machine learning models make predictions as well as selecting the most predictive features while training these models. Existing feature importance algorithms are either computationally expensive (e.g. SHAP values) or biased (e.g. Gini importance in Tree-based models). No previous investigation of the impact of data valuation metrics on feature importance has been conducted. Five popular machine learning models (eXtreme Gradient Boosting (XGB), Random Forest (RF), Logistic Regression (LR), Multi-Layer Perceptron (MLP), and Naive Bayes (NB)) have been used as well as six widely implemented feature ranking techniques (Information Gain, Gini importance, Frequency Importance, Cover Importance, Permutation Importance, and SHAP values) to investigate the relationship between feature importance and data valuation metrics for a clinical use case. XGB outperforms the other models with a weighted F1-score of 79.72%. The findings suggest that features with variability greater than 0.4 or a coefficient of variation greater than 23.4 have little to no value; therefore, these features can be filtered out during feature selection. This result, if generalisable, could simplify feature selection and data preparation.
AB - This paper investigates the impact of data valuation metrics (variability and coefficient of variation) on the feature importance in classification models. Data valuation is an emerging topic in the fields of data science, accounting, data quality, and information economics concerned with methods to calculate the value of data. Feature importance or ranking is important in explaining how black-box machine learning models make predictions as well as selecting the most predictive features while training these models. Existing feature importance algorithms are either computationally expensive (e.g. SHAP values) or biased (e.g. Gini importance in Tree-based models). No previous investigation of the impact of data valuation metrics on feature importance has been conducted. Five popular machine learning models (eXtreme Gradient Boosting (XGB), Random Forest (RF), Logistic Regression (LR), Multi-Layer Perceptron (MLP), and Naive Bayes (NB)) have been used as well as six widely implemented feature ranking techniques (Information Gain, Gini importance, Frequency Importance, Cover Importance, Permutation Importance, and SHAP values) to investigate the relationship between feature importance and data valuation metrics for a clinical use case. XGB outperforms the other models with a weighted F1-score of 79.72%. The findings suggest that features with variability greater than 0.4 or a coefficient of variation greater than 23.4 have little to no value; therefore, these features can be filtered out during feature selection. This result, if generalisable, could simplify feature selection and data preparation.
KW - Data value
KW - Explainable AI
KW - Feature importance
KW - Feature selection
KW - Machine learning
UR - https://www.scopus.com/pages/publications/85201011272
U2 - 10.1007/978-981-97-0892-5_47
DO - 10.1007/978-981-97-0892-5_47
M3 - Conference Publication
AN - SCOPUS:85201011272
SN - 9789819708918
T3 - Lecture Notes in Networks and Systems
SP - 601
EP - 617
BT - Proceedings of 3rd International Conference on Computing and Communication Networks - ICCCN 2023
A2 - Fortino, Giancarlo
A2 - Kumar, Akshi
A2 - Swaroop, Abhishek
A2 - Shukla, Pancham
PB - Springer Science and Business Media Deutschland GmbH
T2 - 3rd International Conference on Computing and Communication Networks, ICCCN 2023
Y2 - 17 November 2023 through 18 November 2023
ER -