Selecting robust features for machine-learning applications using multidata causal discovery

Ganesh S., Saranya; Beucler, Tom; Tam, Frederick Iat-Hin; Gomez, Milton S.; Runge, Jakob; Gerhardus, Andreas

doi:10.1017/eds.2023.21

Selecting robust features for machine-learning applications using multidata causal discovery

Détails

Télécharger: selecting-robust-features-for-machine-learning-applications-using-multidata-causal-discovery.pdf (2559.66 [Ko])
Etat: Public
Version: Final published version
Licence: CC BY 4.0

ID Serval

serval:BIB_D0DCB42C4015

Type

Article: article d'un périodique ou d'un magazine.

Collection

Publications

Institution

UNIL/CHUV

Titre

Selecting robust features for machine-learning applications using multidata causal discovery

Périodique

Environmental Data Science

Auteur⸱e⸱s

Ganesh S. Saranya, Beucler Tom, Tam Frederick Iat-Hin, Gomez Milton S., Runge Jakob, Gerhardus Andreas

ISSN

2634-4602

Statut éditorial

Publié

Date de publication

2023

Volume

Langue

anglais

Résumé

Robust feature selection is vital for creating reliable and interpretable machine-learning (ML) models. When designing statistical prediction models in cases where domain knowledge is limited and underlying interactions are unknown, choosing the optimal set of features is often difficult. To mitigate this issue, we introduce a multidata (M) causal feature selection approach that simultaneously processes an ensemble of time series datasets and produces a single set of causal drivers. This approach uses the causal discovery algorithms PC1 or PCMCI that are implemented in the Tigramite Python package. These algorithms utilize conditional independence tests to infer parts of the causal graph. Our causal feature selection approach filters out causally spurious links before passing the remaining causal features as inputs to ML models (multiple linear regression and random forest) that predict the targets. We apply our framework to the statistical intensity prediction of Western Pacific tropical cyclones (TCs), for which it is often difficult to accurately choose drivers and their dimensionality reduction (time lags, vertical levels, and area-averaging). Using more stringent significance thresholds in the conditional independence tests helps eliminate spurious causal relationships, thus helping the ML model generalize better to unseen TC cases. M-PC1 with a reduced number of features outperforms M-PCMCI, noncausal ML, and other feature selection methods (lagged correlation and random), even slightly outperforming feature selection based on explainable artificial intelligence. The optimal causal drivers obtained from our causal feature selection help improve our understanding of underlying relationships and suggest new potential drivers of TC intensification.

Mots-clé

causal feature selection, machine learning, multivariate time series analysis, tropical cyclones

URN

urn:nbn:ch:serval-BIB_D0DCB42C40155

OAI-PMH

oai:serval.unil.ch:BIB_D0DCB42C4015

DOI

10.1017/eds.2023.21

Site de l'éditeur

https://doi.org/10.1017/eds.2023.21

Open Access

Oui