TY - JOUR
T1 - Development of an explainable machine learning model for predicting poststroke anxiety
T2 - A multicenter study using Shapley Additive Explanations and nomogram visualization
AU - Lyu, Mengke
AU - Xie, Yanming
AU - Li, Min
AU - Hölscher, Christian
AU - Shen, Xiaoming
N1 - The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was financially supported by the National Natural Science Foundation of China (Grant Nos. 81303011 and 81973618).
Publisher Copyright:
© The Author(s) 2026. This article is distributed under the terms of the Creative Commons Attribution-NonCommercial 4.0 License (https://creativecommons.org/licenses/by-nc/4.0/) which permits non-commercial use, reproduction and distribution of the work without further permission provided the original work is attributed as specified on the SAGE and Open Access page (https://us.sagepub.com/en-us/nam/open-access-at-sage).
PY - 2026/1/8
Y1 - 2026/1/8
N2 - Objective: Neuropsychiatric complications following a stroke can impede recovery and reduce the quality of life. Current predictive methods for poststroke anxiety (PSA) are limited by inadequate feature selection and lack of interpretability. This study aimed to develop an interpretable machine learning model utilizing a wide range of clinical data to detect high-risk PSA patients early, enabling personalized interventions. Methods: This retrospective multicenter study included 238 stroke patients from 10 Chinese hospitals spanning from 1 January 2022 to 11 June 2025. Data encompassing demographic, clinical, biochemical, and psychosocial factors were gathered. Feature selection involved univariate analysis followed by least absolute shrinkage and selection operator (LASSO) regression. Seven machine learning models—logistic regression, extreme gradient boosting (XGBoost), light gradient boosting machine (LightGBM), random forest, decision tree, K-nearest neighbors, and stacking—were constructed and assessed using cross-validation. Feature importance was determined using SHAP (Shapley Additive Explanations), and a nomogram was developed based on the final model. Results: Among the 238 patients, 109 were diagnosed with PSA. In the test set, the logistic regression model exhibited the best performance, achieving an area under the curve (AUC) of 0.981, accuracy of 0.917, sensitivity of 0.867, specificity of 0.952, and an F1 score of 0.897. SHAP analysis identified recurrent stroke, income level, payment type, occupational stress, overwork, sleep quality, continuous drinking history, history of hypertension, diabetes, hyperlipidemia, hyperhomocysteinemia, white blood cell (WBC) count, total cholesterol (TC), low-density lipoprotein (LDL), fibrinogen (FIB), activated partial thromboplastin time (APTT), National Institutes of Health Stroke Scale (NIHSS) score, and Barthel index as crucial predictors. A nomogram incorporating the top 10 SHAP-ranked features was devised to assist in clinical decision-making. Conclusion: The machine learning model demonstrated high accuracy and interpretability in predicting PSA risk. Through the integration of SHAP analysis and nomogram visualization, it offers a practical tool for clinicians to recognize high-risk PSA patients and customize management strategies to improve poststroke outcomes.
AB - Objective: Neuropsychiatric complications following a stroke can impede recovery and reduce the quality of life. Current predictive methods for poststroke anxiety (PSA) are limited by inadequate feature selection and lack of interpretability. This study aimed to develop an interpretable machine learning model utilizing a wide range of clinical data to detect high-risk PSA patients early, enabling personalized interventions. Methods: This retrospective multicenter study included 238 stroke patients from 10 Chinese hospitals spanning from 1 January 2022 to 11 June 2025. Data encompassing demographic, clinical, biochemical, and psychosocial factors were gathered. Feature selection involved univariate analysis followed by least absolute shrinkage and selection operator (LASSO) regression. Seven machine learning models—logistic regression, extreme gradient boosting (XGBoost), light gradient boosting machine (LightGBM), random forest, decision tree, K-nearest neighbors, and stacking—were constructed and assessed using cross-validation. Feature importance was determined using SHAP (Shapley Additive Explanations), and a nomogram was developed based on the final model. Results: Among the 238 patients, 109 were diagnosed with PSA. In the test set, the logistic regression model exhibited the best performance, achieving an area under the curve (AUC) of 0.981, accuracy of 0.917, sensitivity of 0.867, specificity of 0.952, and an F1 score of 0.897. SHAP analysis identified recurrent stroke, income level, payment type, occupational stress, overwork, sleep quality, continuous drinking history, history of hypertension, diabetes, hyperlipidemia, hyperhomocysteinemia, white blood cell (WBC) count, total cholesterol (TC), low-density lipoprotein (LDL), fibrinogen (FIB), activated partial thromboplastin time (APTT), National Institutes of Health Stroke Scale (NIHSS) score, and Barthel index as crucial predictors. A nomogram incorporating the top 10 SHAP-ranked features was devised to assist in clinical decision-making. Conclusion: The machine learning model demonstrated high accuracy and interpretability in predicting PSA risk. Through the integration of SHAP analysis and nomogram visualization, it offers a practical tool for clinicians to recognize high-risk PSA patients and customize management strategies to improve poststroke outcomes.
KW - explainable AI
KW - machine learning
KW - nomogram
KW - Poststroke anxiety
KW - risk prediction
KW - SHAP
KW - stroke rehabilitation
UR - https://www.scopus.com/pages/publications/105027008675
U2 - 10.1177/20552076251412575
DO - 10.1177/20552076251412575
M3 - Journal article
AN - SCOPUS:105027008675
SN - 2055-2076
VL - 12
JO - Digital Health
JF - Digital Health
ER -