TY - GEN
T1 - Breast Cancer Risk Prediction Using Electronic Health Records
AU - Wu, Yirong
AU - Burnside, Elizabeth S.
AU - Cox, Jennifer
AU - FAN, Jun
AU - Yuan, Ming
AU - Yin, Jie
AU - Peissig, Peggy
AU - Cobian, Alexander
AU - Page, David
AU - Craven, Mark
N1 - Funding Information:
ACKNOWLEDGMENT The authors acknowledge the support of NIH grants U54AI117924, K24CA194251 and the NIH NCATS grant (UL1TR000427). We also acknowledge support from the University of Wisconsin Madison Office of the Vice Chancellor for Research and Graduate Education with funding from the Wisconsin Alumni Research Foundation and the University of Wisconsin Carbone Comprehensive Cancer Center (P30CA014520) .
PY - 2017/9/8
Y1 - 2017/9/8
N2 - Electronic health records (EHRs) represent an underused data source that has great research and clinical potential. Our goal was to quantify the value of EHRs in breast cancer risk prediction. We conducted a retrospective case-control study, gathering patients' ICD-9 diagnosis codes from an existing EHR data repository. Based on the hierarchical structure of ICD-9 codes, which are composed of 3-5 digits, three levels of data representation were studied: level 0, using only the first 3 digits; level 1, using up to the first 4 digits; and level 2, using up to the full 5 digits of each code. We created two models to predict breast cancer one year in advance based on diagnosis codes in three levels of data representation: logistic regression (LR) and LASSO logistic regression (LR+Lasso). Area under the ROC curve (AUC) was used to assess model performance. The LR+Lasso model demonstrated significantly higher predictive performance than the LR model when using the level 2 feature representation (0.648 vs 0.603, p=0.013). For both the level 1 representation and the level 0 representation, the predictive difference between LR+Lasso and LR model was not significant, (0.634 vs 0.604, p=0.081) and (0.612 vs 0.603, p=0.523), respectively. For LR model, predictive performance changed modestly across three levels. For LR+Lasso model, predictive performance also changed modestly from the level 0 to the level 1representation (p=0.168) and from the level 1 to the level 2 representation (p=0.374). However, the level 2 representation provided significantly higher predictive performance than the level 0 representation (p=0.034). The unabridged level 2 representation of the diagnosis codes contains the most valuable information that may contribute to breast cancer risk prediction. The performance of these models demonstrates that EHR data can be used to predict breast cancer risk, which provides the possibility to personalize care in clinical practice. In the future, we will combine coded EHR data with demographic risk factors, genetic variants, and imaging features to improve breast cancer risk prediction.
AB - Electronic health records (EHRs) represent an underused data source that has great research and clinical potential. Our goal was to quantify the value of EHRs in breast cancer risk prediction. We conducted a retrospective case-control study, gathering patients' ICD-9 diagnosis codes from an existing EHR data repository. Based on the hierarchical structure of ICD-9 codes, which are composed of 3-5 digits, three levels of data representation were studied: level 0, using only the first 3 digits; level 1, using up to the first 4 digits; and level 2, using up to the full 5 digits of each code. We created two models to predict breast cancer one year in advance based on diagnosis codes in three levels of data representation: logistic regression (LR) and LASSO logistic regression (LR+Lasso). Area under the ROC curve (AUC) was used to assess model performance. The LR+Lasso model demonstrated significantly higher predictive performance than the LR model when using the level 2 feature representation (0.648 vs 0.603, p=0.013). For both the level 1 representation and the level 0 representation, the predictive difference between LR+Lasso and LR model was not significant, (0.634 vs 0.604, p=0.081) and (0.612 vs 0.603, p=0.523), respectively. For LR model, predictive performance changed modestly across three levels. For LR+Lasso model, predictive performance also changed modestly from the level 0 to the level 1representation (p=0.168) and from the level 1 to the level 2 representation (p=0.374). However, the level 2 representation provided significantly higher predictive performance than the level 0 representation (p=0.034). The unabridged level 2 representation of the diagnosis codes contains the most valuable information that may contribute to breast cancer risk prediction. The performance of these models demonstrates that EHR data can be used to predict breast cancer risk, which provides the possibility to personalize care in clinical practice. In the future, we will combine coded EHR data with demographic risk factors, genetic variants, and imaging features to improve breast cancer risk prediction.
KW - Breast cancer
KW - Electronic health record (EHR)
KW - International Classification of Disease (ICD)
KW - LASSO
KW - Risk prediction models
UR - http://www.scopus.com/inward/record.url?scp=85032366077&partnerID=8YFLogxK
U2 - 10.1109/ICHI.2017.62
DO - 10.1109/ICHI.2017.62
M3 - Conference proceeding
AN - SCOPUS:85032366077
T3 - Proceedings - 2017 IEEE International Conference on Healthcare Informatics, ICHI 2017
SP - 224
EP - 228
BT - Proceedings - 2017 IEEE International Conference on Healthcare Informatics, ICHI 2017
A2 - Cummins, Mollie
A2 - Facelli, Julio
A2 - Meixner, Gerrit
A2 - Giraud-Carrier, Christophe
A2 - Nakajima, Hiroshi
PB - IEEE
T2 - 5th IEEE International Conference on Healthcare Informatics, ICHI 2017
Y2 - 23 August 2017 through 26 August 2017
ER -