The Impact of Data Valuation on Feature Importance in Classification Models

  • Malick Ebiele
  • , Malika Bendechache
  • , Marie Ward
  • , Una Geary
  • , Declan Byrne
  • , Donnacha Creagh
  • , Rob Brennan

Research output: Chapter in Book or Conference Publication/ProceedingConference Publicationpeer-review

2 Citations (Scopus)

Abstract

This paper investigates the impact of data valuation metrics (variability and coefficient of variation) on the feature importance in classification models. Data valuation is an emerging topic in the fields of data science, accounting, data quality, and information economics concerned with methods to calculate the value of data. Feature importance or ranking is important in explaining how black-box machine learning models make predictions as well as selecting the most predictive features while training these models. Existing feature importance algorithms are either computationally expensive (e.g. SHAP values) or biased (e.g. Gini importance in Tree-based models). No previous investigation of the impact of data valuation metrics on feature importance has been conducted. Five popular machine learning models (eXtreme Gradient Boosting (XGB), Random Forest (RF), Logistic Regression (LR), Multi-Layer Perceptron (MLP), and Naive Bayes (NB)) have been used as well as six widely implemented feature ranking techniques (Information Gain, Gini importance, Frequency Importance, Cover Importance, Permutation Importance, and SHAP values) to investigate the relationship between feature importance and data valuation metrics for a clinical use case. XGB outperforms the other models with a weighted F1-score of 79.72%. The findings suggest that features with variability greater than 0.4 or a coefficient of variation greater than 23.4 have little to no value; therefore, these features can be filtered out during feature selection. This result, if generalisable, could simplify feature selection and data preparation.

Original languageEnglish
Title of host publicationProceedings of 3rd International Conference on Computing and Communication Networks - ICCCN 2023
EditorsGiancarlo Fortino, Akshi Kumar, Abhishek Swaroop, Pancham Shukla
PublisherSpringer Science and Business Media Deutschland GmbH
Pages601-617
Number of pages17
ISBN (Print)9789819708918
DOIs
Publication statusPublished - 2024
Event3rd International Conference on Computing and Communication Networks, ICCCN 2023 - Manchester, United Kingdom
Duration: 17 Nov 202318 Nov 2023

Publication series

NameLecture Notes in Networks and Systems
Volume917
ISSN (Print)2367-3370
ISSN (Electronic)2367-3389

Conference

Conference3rd International Conference on Computing and Communication Networks, ICCCN 2023
Country/TerritoryUnited Kingdom
CityManchester
Period17/11/2318/11/23

Keywords

  • Data value
  • Explainable AI
  • Feature importance
  • Feature selection
  • Machine learning

Fingerprint

Dive into the research topics of 'The Impact of Data Valuation on Feature Importance in Classification Models'. Together they form a unique fingerprint.

Cite this