Breast Cancer Risk Prediction Using Electronic Health Records

Yirong Wu, Elizabeth S. Burnside, Jennifer Cox, Jun FAN, Ming Yuan, Jie Yin, Peggy Peissig, Alexander Cobian, David Page, Mark Craven

Research output: Chapter in book/report/conference proceedingConference contributionpeer-review

6 Citations (Scopus)

Abstract

Electronic health records (EHRs) represent an underused data source that has great research and clinical potential. Our goal was to quantify the value of EHRs in breast cancer risk prediction. We conducted a retrospective case-control study, gathering patients' ICD-9 diagnosis codes from an existing EHR data repository. Based on the hierarchical structure of ICD-9 codes, which are composed of 3-5 digits, three levels of data representation were studied: level 0, using only the first 3 digits; level 1, using up to the first 4 digits; and level 2, using up to the full 5 digits of each code. We created two models to predict breast cancer one year in advance based on diagnosis codes in three levels of data representation: logistic regression (LR) and LASSO logistic regression (LR+Lasso). Area under the ROC curve (AUC) was used to assess model performance. The LR+Lasso model demonstrated significantly higher predictive performance than the LR model when using the level 2 feature representation (0.648 vs 0.603, p=0.013). For both the level 1 representation and the level 0 representation, the predictive difference between LR+Lasso and LR model was not significant, (0.634 vs 0.604, p=0.081) and (0.612 vs 0.603, p=0.523), respectively. For LR model, predictive performance changed modestly across three levels. For LR+Lasso model, predictive performance also changed modestly from the level 0 to the level 1representation (p=0.168) and from the level 1 to the level 2 representation (p=0.374). However, the level 2 representation provided significantly higher predictive performance than the level 0 representation (p=0.034). The unabridged level 2 representation of the diagnosis codes contains the most valuable information that may contribute to breast cancer risk prediction. The performance of these models demonstrates that EHR data can be used to predict breast cancer risk, which provides the possibility to personalize care in clinical practice. In the future, we will combine coded EHR data with demographic risk factors, genetic variants, and imaging features to improve breast cancer risk prediction.

Original languageEnglish
Title of host publicationProceedings - 2017 IEEE International Conference on Healthcare Informatics, ICHI 2017
EditorsMollie Cummins, Julio Facelli, Gerrit Meixner, Christophe Giraud-Carrier, Hiroshi Nakajima
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages224-228
Number of pages5
ISBN (Electronic)9781509048816
DOIs
Publication statusPublished - 8 Sep 2017
Event5th IEEE International Conference on Healthcare Informatics, ICHI 2017 - Park City, United States
Duration: 23 Aug 201726 Aug 2017

Publication series

NameProceedings - 2017 IEEE International Conference on Healthcare Informatics, ICHI 2017

Conference

Conference5th IEEE International Conference on Healthcare Informatics, ICHI 2017
Country/TerritoryUnited States
CityPark City
Period23/08/1726/08/17

Scopus Subject Areas

  • Health Informatics

User-Defined Keywords

  • Breast cancer
  • Electronic health record (EHR)
  • International Classification of Disease (ICD)
  • LASSO
  • Risk prediction models

Fingerprint

Dive into the research topics of 'Breast Cancer Risk Prediction Using Electronic Health Records'. Together they form a unique fingerprint.

Cite this