Development and validation of machine learning algorithms based on electrocardiograms for cardiovascular diagnoses at the population level | npj Digital Medicine

npj Digital Medicine volume 7, Article number: 133 (2024) Cite this article

3909 Accesses

2 Citations

6 Altmetric

Metrics details

Artificial intelligence-enabled electrocardiogram (ECG) algorithms are gaining prominence for the early detection of cardiovascular (CV) conditions, including those not traditionally associated with conventional ECG measures or expert interpretation. This study develops and validates such models for simultaneous prediction of 15 different common CV diagnoses at the population level. We conducted a retrospective study that included 1,605,268 ECGs of 244,077 adult patients presenting to 84 emergency departments or hospitals, who underwent at least one 12-lead ECG from February 2007 to April 2020 in Alberta, Canada, and considered 15 CV diagnoses, as identified by International Classification of Diseases, 10th revision (ICD-10) codes: atrial fibrillation (AF), supraventricular tachycardia (SVT), ventricular tachycardia (VT), cardiac arrest (CA), atrioventricular block (AVB), unstable angina (UA), ST-elevation myocardial infarction (STEMI), non-STEMI (NSTEMI), pulmonary embolism (PE), hypertrophic cardiomyopathy (HCM), aortic stenosis (AS), mitral valve prolapse (MVP), mitral valve stenosis (MS), pulmonary hypertension (PHTN), and heart failure (HF). We employed ResNet-based deep learning (DL) using ECG tracings and extreme gradient boosting (XGB) using ECG measurements. When evaluated on the first ECGs per episode of 97,631 holdout patients, the DL models had an area under the receiver operating characteristic curve (AUROC) of <80% for 3 CV conditions (PTE, SVT, UA), 80–90% for 8 CV conditions (CA, NSTEMI, VT, MVP, PHTN, AS, AF, HF) and an AUROC > 90% for 4 diagnoses (AVB, HCM, MS, STEMI). DL models outperformed XGB models with about 5% higher AUROC on average. Overall, ECG-based prediction models demonstrated good-to-excellent prediction performance in diagnosing common CV conditions.

The 12-lead electrocardiogram (ECG) is the most common, low-cost, and accessible diagnostic tool for cardiovascular (CV) disease. It is performed on nearly all acute care visits and commonly more than once. In the US alone, over 100 million ECGs are obtained annually1. This is useful as the ECG contains a large amount of information that provides insight into underlying cardiac physiology since morphological and temporal features are produced from the electrical activity of the heart. However, standard techniques used by physicians and by computer algorithms to interpret ECGs are constrained, as many are rule-based and only consider a fraction of the total information available on the ECG. Manual or computerized approaches and even conventional statistical methods cannot account for high-level interactions between ECG signals from multiple leads or imperceptible, yet informative, changes that may signal early disease. The emergence of deep learning (DL) analyses offers an exciting opportunity to identify clinically relevant but ‘hidden’ patterns in ECG signals and simultaneously assess complex interactive relationships from routinely captured clinical data for diagnosis of various CV abnormalities2,3,4,5.

Prior investigations exploring machine-learned models for disease prediction using ECGs have predominantly focused on cardiac conditions that can be readily interpreted by physician experts based on morphological changes in ECG patterns—e.g., arrhythmias (atrial fibrillation [AF], ventricular tachycardia [VT], supraventricular tachycardia [SVT])6, ST-elevation myocardial infarction (STEMI) or non-STEMI (NSTEMI)7,8,9, or heart block conditions such as atrioventricular blocks or branch blocks including left-bundle branch block and right-bundle branch block10. While the number of machine learning (ML)-based models using ECG data to predict CV conditions beyond those traditionally associated with ECG patterns is currently limited, it is steadily increasing. These models focus on conditions such as mitral valve prolapse (MVP)1, cardiac arrest (CA)11,12, heart failure (HF)13,14, pulmonary embolism (PE)15,16, aortic stenosis (AS)17,18,19, mitral valve stenosis (MS)20, pulmonary hypertension (PHTN)21,22 and hypertrophic cardiomyopathy (HCM)18,23,24. Furthermore, while existing studies have mainly concentrated on individual labels, there hasn’t been any prior research developing a predictive system for the simultaneous detection of these specific conditions. The lack of large medical datasets that are clinically annotated with an extensive set of diagnostic labels available for supervised ML is a well-recognized problem, and large-scale validations at the population scale are critical to show trustworthiness for the successful adoption of prediction models into clinical practice, where early identification and treatment may potentially impact disease-related complications, healthcare use, and cost.

Accordingly, we used a large population-level cohort of patients, from a single-payer universal health system, to develop and validate DL models (based on 12-lead ECG tracings) as well as extreme gradient boosting (XGB) models (based on routinely collected ECG measurements) to simultaneously predict 15 common CV diagnoses through a unified prediction framework.

Baseline characteristics of the cohort have been described previously25. In brief, the average age of patients was 65.8 ± 17.3 years, and 56.7% were males (Supplementary Table 1). The models underwent training using ECGs from 146,446 patients and were subsequently evaluated on a holdout cohort of 97,631 patients (Fig. 1). The holdout dataset included 53,436 men and 44,195 women, used for sex-based performance evaluations. Additionally, 96,164 patients without pacemakers were evaluated separately to investigate the impact of pacemakers on model performance. Anticipating the implementation of our prediction system at the point of care, we assessed our models exclusively using the first ECG of each holdout patient in a specific episode.

We divided the entire ECG dataset, allocating 60% for model development (including fivefold internal cross-validation for training and fine-tuning) and setting aside 40% as a holdout set for final validation. For evaluation, we assessed our models using two approaches: first, exclusively on the first ECGs from each episode captured during an ED visit or hospitalization, reflecting the intended point-of-care deployment; second, on all ECGs from the holdout set. Additionally, we evaluated our models’ performance within specific patient subgroups categorized by sex and the presence of cardiac pacing or ventricular assist devices.

Frequency and percentage of ECGs with any of the selected CV conditions in full, development and holdout splits as well as among the first ECG per episode in the holdout set are presented in Table 1. The first ECG per episode in the holdout set (used for the final evaluations) had some differences in diagnostic labels compared to the full ECG data (e.g., frequency for HF: 9.3% vs 15.5%; AF: 11.5% vs 18.2%).

Comparison of model performances for DL and XGB models with ECG traces (with versus without age and sex features) and measurements (with age and sex features) for 15 CV conditions is presented in Fig. 2, Table 2 and Supplementary Table 2. The holdout validation of our main model (DL: ECG trace, age, sex) showed that our model for STEMI had the best performance with a receiver operating characteristic curve (AUROC) of 95.5%, and our model for pulmonary thromboembolism (PTE) was the worst performance with an AUROC of 68.9%. The models for all diagnoses, except for PTE, had AUROCs above 76%: with AUROCs <80% for two diagnoses (SVT, UA, in increasing order); AUROCs in the 80–90% range for eight diagnoses in (cardiac arrest [CA], NSTEMI, VT, mitral valve prolapse [MVP], pulmonary hypertension [PHTN], aortic stenosis [AS], AF, HF, in increasing order); AUROCs >90% for four diagnoses (atrioventricular block [AVB], hypertrophic cardiomyopathy [HCM], mitral valve stenosis [MS], STEMI, in increasing order). The model for AF had the highest area under the precision-recall curve (AUPRC) score of 59.2% (F1 score: 51.6%), followed by HF with 56.1% (F1 score: 46.6%) and STEMI with AUPRC of 54.3% (F1 score: 39.2%).

The height of the bars represents the performance in external holdout validation, and the crosses represent the performance in each of the fivefold cross-validation. For each condition, the models are ranked based on their performance (statistically similar performances are assigned tied ranking), and the model with the highest performance is indicated with a star. AF atrial fibrillation, AS aortic stenosis, AUROC Area under the operating receiver curve, AVB atrioventricular block, DL deep learning (ResNet), ECG electrocardiogram, HCM hypertrophic cardiomyopathy, HF heart failure, MS mitral stenosis, MVP mitral valve prolapse, NSTEMI non-ST-elevation myocardial infarction, STEMI ST-elevation myocardial infarction, SVT supraventricular tachycardia, PHTN pulmonary hypertension, PTE pulmonary thromboembolism, UA unstable angina, VT ventricular tachycardia, XGB XGBoost.

The DL model with (ECG trace, age, sex) performed better than the XGB model with (ECG measurements, age, sex) for most diagnoses, except for AVB, where both models performed comparably. DL models outperformed XGB models with an average improvement in AUROC of 5.2%, with notable increases of 11.8% for MS, 8.6% for MVP, 7.3% for NSTEMI, and 7.1% for STEMI. Comparison of 95% confidence intervals from the bootstrap results showed that there were significant differences between DL model performances with versus without (age, sex) features for all diagnoses except PTE, suggesting that age and sex features can add small but significant improvements to diagnostic prediction. Similarly, bootstrap results showed that DL models with ECG traces alone outperformed XGB models with ECG measurements, age, and sex for diagnoses other than AVB, HCM, and VT.

We evaluated the DL model, which was trained using (ECG trace, age, sex) separately for males and females in the holdout set, and found similar results overall (Fig. 3, top panel). The models performed marginally better for men in 10 of out 15 conditions, with average AUROC increase of 1.0%. Five of these—namely VT, STEMI, PTE, HF, and AS—showed significant differences. We found the highest difference in VT where the model performed 6.4% better for men compared to women (Men: 85.1%, Women: 78.7%), and 14.8% in terms of AUPRC (Men: 37.5%, Women: 22.6%). In contrast, prediction performance for AF was significantly better by 1.2% AUROC in females than in males.

Evaluations are performed separately for the males and females of the holdout patients (a) as well as ECGs without pacemakers and all ECGs in the holdout set (b). The height of bars represents the performance in external holdout validation and the models with statistically higher performance are indicated with a star.

Similarly, evaluating the DL (ECG trace, age, sex) models on the holdout ECGs, after excluding the ECGs of patients with pacemakers and other ICD devices, showed performance that is comparable with overall evaluation, with a very small average AUROC increase of 0.25% with vs without those ECGs (Fig. 3, bottom panel). Again, VT showed the highest difference, where performance dropped by 3.2% AUROC and 5.6% AUPRC when pacemaker ECGs were excluded. Another diagnosis that showed significant difference in the same direction was AVB (1.6% AUROC drop, 5.7% decrease in AUPRC).

Our study utilized ECGs sourced from 14 hospitals. Notably, two of these were tertiary care hospitals, contributing the highest ECG counts (487,042 and 453,085 ECGs, respectively). For each tertiary hospital (H1 and H2), we performed a leave-one-hospital-out validation, training on ECGs from all other hospitals, excluding the validation one (Supplementary Figure 1). The performance of the DL: ECG, Age, Sex model with leave-one-hospital-out validation was comparable to the results reported on the overall holdout set (Supplementary Table 3). In comparison to the primary validation outcomes, the average AUROC performance over 15 conditions exhibited a slight increase of 1.38% in H1 validation but a decrease of 1.34% in H2 validation.

As an added validation of our DL model, we evaluated its performance in all ECGs, rather than just the first ECG, acquired during each episode of care for patients in the holdout set. The results, as outlined in Supplementary Table 4, exhibit performance that is either superior or comparable to that achieved using only the initial ECGs.

The all-ECG evaluation revealed some fluctuations in AUROC and AUPRC scores. There was an overall decrease of 2.04% in the average AUROC (averaged across 15 conditions) and a concurrent increase of 2.15% in AUPRC for the all-ECG assessment. Notably, F1-scores for all labels in the all-ECG evaluation exhibited improvements ranging from 1.09% to 16.81%, with an average increase of 6.28%. Similarly, positive predictive values (PPV or precision) for all labels showed an increase ranging from 0.56% to 15.32%, with an average improvement of 4.98%. Therefore, these algorithms can be anticipated to exhibit comparable, if not superior, performance when applied to ECGs conducted at any point during the course of an episode of care.

The prevalence of several of the diagnoses of interest in our sample was low (e.g., 0.09% for MS among first ECGs), which is likely to impact PPV. We, therefore, explored an alternative evaluation scheme based on a composite label approach that has been previously employed for screening purposes to enhance diagnostic yield20. We created a composite label such that it was positive if any of our 15 conditions of interest were positive, and negative if all of the conditions were negative.

We re-evaluated our multi-label DL model’s ability to predict if an ECG is positive for the composite label. Results showed PPV of 31.64% with a F1 score of 47.39%. We also trained a new model supervised with the composite label using the same model architecture and assessed its performance on the same holdout set. Results showed PPV of 57.9% with a F1 score of 63.03%. These results suggest that a higher PPV could be achieved when screening for the composite outcome.

Figure 4 depicts the results of GradCAM, highlighting areas of ECG with higher contribution and relevance towards the model’s prediction of different CV conditions (see Supplementary Figure 2 for a full list of all 15 diagnoses). Notably, the regions that contributed the most to the diagnosis were: PR intervals and QRS complexes in STEMI, T waves in NSTEMI, QRS complexes in PHTN, VT beats in patients with non-sustained VT, QRS complexes in AS, p waves in AVB, and ST segment region in HF. Figure 5 shows feature importance analyses of XGB model based on ECG measurements, depicting substantial information gain with P-duration for prediction of AF, heart rate for SVT, RR interval for UA, QRS duration for AVB, frontal T axis for HF, horizontal T axis for NSTEMI, Bazett’s rate-corrected QT interval for CA, etc.

Representative ECG traces were chosen for a selected group of diagnoses. GradCAM results do not extend to the entire population, but indicative of the DL model’s prediction for a single representative case. The darker areas in each trace on GradCAM denote the areas with the most contribution to DL model’s diagnostic prediction. PR intervals and QRS complexes in STEMI, T waves in NSTEMI, QRS complexes in PHTN, VT beats in patients with non-sustained VT, QRS complexes in AS, p waves in AVB, and ST segment region in HF contributed the most to the diagnosis of each condition. AS aortic stenosis, AVB atrioventricular block, DL deep learning, ECG electrocardiogram, HF heart failure, NSTEMI non-ST-elevation myocardial infarction, STEMI ST-elevation myocardial infarction, PHTN pulmonary hypertension, VT ventricular tachycardia.

Information gain-based feature importance for various cardiovascular conditions with XGBoost models based on ECG measurements showed substantial information gain with P-duration for prediction of AF, heart rate for SVT, RR interval for UA etc. ECG electrocardiogram. Abbreviations for ECG measurements and diseases are provided in Supplementary Tables 8 and 9.

In this large, population-level study with linked administrative health records including millions of ECGs, we developed and validated ML-based prediction models for diagnosing common CV conditions including those previously not explored in ECG-based prediction studies and found both DL and XGB models demonstrated good-to-excellent prediction performance and that DL models performed better than XGB models for most of the studied CV conditions.

Previous studies using AI-enabled ECG diagnosis have shown that ML and DL models can accurately recognize ECG rhythm and morphological abnormalities in ECG, however, they have not provided insights into performance for detecting cardiac conditions that are not routinely diagnosed via ECG26. Our study demonstrates how standard ML techniques can learn models that can use the simple and easy-to-obtain 12-lead ECG to accurately predict not only CV conditions but also disorders not conventionally diagnosed using ECGs. These models are potential tools for early screening of CV conditions, particularly those that place considerable burden on the healthcare system, and may help more proximal identification of clinically important CV conditions27,28,29. More importantly, the use of these automated classification systems could enhance access to care in remote areas that have limited access to qualified medical and cardiology specialists. Further investigation is required to determine whether automatic ECG-based screening interfaces can be deployed for early management and prevention of disease progression and to provide cost-effective care.

DL models are complex algorithms with millions of parameters, and likely to overfit when trained on small datasets30,31,32. Even when large medical datasets are available, they are usually unlabeled or unannotated, which poses further challenges for supervised ML approaches33. Our large Alberta ECG dataset using the gold-standard 12 leads and its linkage to population-level data25,34 represents a naturalistic population laboratory with a wide array of demographic and clinical covariates, and hence, provides the ideal setting for developing ECG-based ML algorithms for the prediction of common CV conditions. Additionally, the current study focused on the very first ECG captured in a healthcare episode, to emulate real-life scenarios of a patient’s initial medical contact at the point of care.

Our DL model showed C-index or AUROC levels >80% for 12 out of 15 conditions, and >90% for 4 conditions (i.e., STEMI, MS, HCM, and AVB). The DL model of ECG tracing provided better prediction than the XGB model of ECG routine measurements for the prediction of all included conditions (except for atrioventricular block where both models had comparable excellent performance), with up to 11.8% improvement in performance for mitral valvulopathy and up to 7.3% improvement in performance in detecting myocardial infarction. Importantly, we further evaluated model robustness with respect to any potential biases towards sex groups and ECGs acquired in the presence of cardiac pacing or left ventricular assist device (LVAD), which can complicate ECG interpretation. We found that our DL model remains robust and appears to work equally well when evaluated on patients of either sex or pacing/LVAD. In fact, our models showed even better performance when ECGs with pacemakers were included in the testing data.

Our study has some limitations that require further discussion. First, all ECGs were generated by machines from the same manufacturer (Phillips Intelligence System), which might limit the generalizability and extrapolation of findings to ECGs from other systems. Second, ECG measurements used in the XGB models were provided through Phillips machines, and were not core laboratory-read or human expert-curated. Third, our labels were derived from ICD codes recorded in the ED and hospitalization record. Owing to the nature of data collection in administrative medical records, the precise timing of condition’s presentation during a healthcare episode cannot be definitively ascertained. Consequently, in rare instances where an acute condition emerges after the collection of the initial ECG, our prediction task can be interpreted as early detection rather than diagnostic prediction of an existing condition. This distinction is noteworthy, as early detection remains valuable in clinical management, offering insights into potential complications in the near future that can be equally beneficial for inpatient care. Fourth, internal testing, even on a substantial scale, may be considered secondary to external validations. This is primarily because biases inherent to a single health system can be perpetuated due to similarities in patient population, equipment, label generation procedures, and other factors. Unfortunately, we were unable to offer external validation for our multi-label models as there is no appropriate external ECG dataset linked to selected 15 ICD-10-based diagnostic labels. However, the performance of leave-one-hospital-out validation of our DL: ECG, age, sex models demonstrates the robustness of our models across hospitals. Fifth, our study is based on a real-world cohort of patients presenting to emergency departments and hospitals with varying prevalence rates of the diagnoses of interest. The variation in the positive rate of the different labels could explain why predictions of some diagnoses were more accurate than others. We did not augment or manipulate the data as our goal is to eventually deploy these models within electronic medical record systems. Moreover, our training dataset of nearly a million ECGs, had a sufficient number of positive cases, to develop effective predictive models. However, some labels in our models may exhibit PPV that might be lower than optimal, necessitating careful consideration regarding their eligibility for clinical deployment. Sixth, we evaluated our models primarily using AUROC or C-index, a common metric in biomedical literature. However, it has limitations and does not consider misclassification costs of false positives or false negatives. For model deployment, custom evaluations aimed at minimizing expected cost, that consider misclassification costs and other resource allocation factors should be prioritized35. Furthermore, despite the black-box nature of some of the ML approaches, we used techniques such as GradCAM analysis of DL models (respectively, SHAP analysis of XGB models) to find ECG patterns (or ECG measurements) that contribute to the diagnosis of common CV conditions.

In conclusion, we demonstrate, using comprehensive linked administrative databases at the population level, that ECG-based DL and XGB prediction models demonstrate good-to-excellent prediction performance in diagnosing common CV conditions. The DL models of ECG tracing provided better prediction accuracy among the studied conditions than the XGB models based on routine ECG measurements. Models performed comparably between different sex groups and in patients with and without pacing or LVAD. Future research is needed to determine how these models can be implemented in clinical practice for early diagnosis and risk stratification.

This study was performed in Alberta, Canada, where there is a single-payer healthcare system with universal access and 100% capture of all interactions with the healthcare system.

ECG data was linked with the following administrative health databases using a unique patient health number: (1) Discharge Abstract Database (DAD) containing data on inpatient hospitalizations; (2) National Ambulatory Care Reporting System (NACRS) database of all hospital-based outpatient clinic, and emergency department (ED) visits; and (3) Alberta Health Care Insurance Plan Registry (AHCIP), which provides demographic information.

We used standard 12-lead ECG traces (voltage-time series, sampled at 500 Hz for the duration of 10 seconds for each of 12 leads) and ECG measurements (automatically generated by Philips IntelliSpace ECG system’s built-in algorithm). The ECG measurement included atrial rate, heart rate, RR interval, P wave duration, frontal P axis, horizontal P axis, PR interval, QRS duration, frontal QRS axis in the initial 40 ms, frontal QRS axis in the terminal 40 ms, frontal QRS axis, horizontal QRS axis in the initial 40 ms, horizontal QRS axis in terminal 40 ms, horizontal QRS axis, frontal ST wave axis (equivalent to ST deviation), frontal T axis, horizontal ST wave axis, horizontal T axis, Q wave onset, Fridericia rate-corrected QT interval, QT interval, Bazett’s rate-corrected QT interval.

The study cohort has been described previously25. In brief, patients who were hospitalized at 14 sites between February 2007 and April 2020 in Alberta, Canada, and includes 2,015,808 ECGs from 3,336,091 ED visits and 1,071,576 hospitalizations of 260,065 patients. Concurrent healthcare encounters (ED visits and/or hospitalizations) that occurred for a patient within a 48-hour period of each other were considered to be transfers and part of the same healthcare episode. An ECG record was linked to a healthcare episode if the acquisition date was within the timeframe between the admission date and discharge date of an episode. After excluding the ECGs that could not be linked to any episode, ECGs of patients <18 years of age, as well as ECGs with poor signal quality (identified via warning flags generated by the ECG machine manufacturer’s built-in quality algorithm), our analysis cohort contained 1,605,268 ECGs from 748,773 episodes in 244,077 patients (Fig. 1).

We developed and evaluated ECG-based models to predict the probability of a patient being diagnosed with any of 15 specific common CV conditions: AF, SVT, VT, CA, AVB, UA, NSTEMI, STEMI, PTE, HCM, AS, MVP, MS, PHTN, and HF. The conditions were identified based on the record of corresponding International Classification of Diseases, 10th revision (ICD-10) codes in the primary or in any one of 24 secondary diagnosis fields of a healthcare episode linked to a particular ECG (Supplementary Table 5). The validity of ICD coding in administrative health databases has been established previously36,37. If an ECG was performed during an ED or inpatient episode, it was considered positive for all diagnoses of interest that were recorded in the episode. Some diagnoses, such as AF, SVT, VT, STEMI, and AVB, which are typically identified through ECGs, were included in the study as positive controls to showcase the effectiveness of our models in detecting ECG-diagnosable conditions.

The goal of the prediction model was to output calibrated probabilities for each of selected 15 conditions. These learned models could use ECGs that were acquired at any time point during a healthcare episode. Note that a single patient visit may involve multiple ECGs. When training the model, we used all ECGs (multiple ECGs belonging to the same episode were included) in the training/development set to maximize learning. However, to evaluate our models, we used only the earliest ECG in a given episode in the test/holdout set, with the goal of producing a prediction system that could be employed at the point of care, when the patient’s first ECG is acquired during an ED visit or hospitalization (See section ‘Evaluation’ below for more details).

We used ResNet-based DL for the information-rich voltage-time series and gradient boosting-based XGB for the ECG measurements25. To determine whether demographic features (age and sex) add incremental predictive value to the performance of models trained on ECGs only, we developed and reported the models in the following manner: (a) ECG only (DL: ECG trace); (b) ECG + age, sex (DL: ECG trace, age, sex [which is the primary model presented in this study]); and (c) XGB: ECG measurement, age, sex.

We employed a multi-label classification methodology with binary labels—i.e., presence (yes) or absence (no) for each one of the 15 diagnoses—to estimate the probability of a new patient having each of these conditions. Since the input for the models that used ECG measurements was structured tabular data, we trained gradient-boosted tree ensembles (XGB)38 models, whereas we used deep convolutional neural networks for the models with ECG voltage-time series traces. For both XGB and DL models, we used 90% of training data to train the model, and used the remaining 10% as a tuning set to track the performance loss and to “early stop” the training process, to reduce the chance of overfitting39. For DL, we learned a single ResNet model for a multi-class multi-label task10, which mapped each ECG signal into 15 values, corresponds to the probability of presence of each of the 15 diagnoses. On the other hand, for gradient boosting, we learned 15 distinct binary XGB models, each mapping the ECG signal to the probability for one of the individual labels. The methodological details of our XGB and DL model implementations have been described previously25.

Evaluation design: we used a 60/40 split on the data for training and evaluation. We divided the overall ECG dataset into random splits of 60% for the model development (which used fivefold internal cross-validation for training and fine-tuning the final models) and the remaining 40% as the holdout set for final external validation. We ensured that ECGs from the same patient were not shared between development and evaluation data or between the train/test folds of internal cross-validation. As mentioned earlier, since we expect the deployment scenario of our prediction system to be at the point of care, we evaluated our models using only the patient’s first ECG in a given episode, which was captured during an ED visit or hospitalization. The number of ECGs, episodes, and patients used in overall data and in experimental splits are presented in Fig. 1 and Supplementary Table 5. In addition to the primary evaluation, we extend our testing to include all ECGs from the holdout set, to demonstrate the versatility of DL model in handling ECGs captured at any point during an episode.

Furthermore, we performed ‘Leave-one-hospital-out validation’ using two large tertiary care hospitals to assess the robustness of our model with respect to distributional differences between the hospital sites. To guarantee complete separation between our training and testing sets, we omitted ECGs of patients admitted to both the training and testing hospitals during the study period, as illustrated in Supplementary Figure 1. Finally, to underscore the applicability of DL model in screening scenarios, we present additional evaluations by consolidating 15 disease labels into a composite prediction, thereby enhancing diagnostic yield20.

We reported area under the receiver operating characteristic curve (AUROC, equivalent to C-index) and area under the precision-recall curve (AUPRC). Also, we generated F1 Score, Specificity, Recall, Precision (equivalent to PPV) and Accuracy after binarizing the prediction probabilities into diagnosis/non-diagnosis classes using optimal cut-points derived from the training set Youden’s index40. We also used the calibration metric Brier Score41 (where a smaller score indicates better calibration) to evaluate whether predicted probabilities agree with observed proportions.

Sex and Pacemaker Subgroups: We investigated our models’ performance in specific patient subgroups, based on the patient’s sex. We also investigated any potential bias with ECGs captured in the presence of cardiac pacing (including pacemaker or implantable cardioverter-defibrillators [ICD]) or ventricular assist devices (VAD) since ECG interpretation can be difficult in these situations, by comparing the model performances in ECGs without pacemakers in the holdout set versus the overall holdout set (including ECGs both with or without pacemakers) (Fig. 1). The diagnosis and procedure codes used for identifying the presence of pacemakers are provided in the Supplementary Table 7.

Model comparisons: For each evaluation, we report the performances from the fivefold internal cross-validation as well as the final performances in the holdout set, using the same training and testing splits for the various modeling scenarios. The performances were compared between models by sampling holdout instances with replacement in pairwise manner, to generate a total of 10,000 bootstrap replicates of pairwise differences in AUROC—i.e., each comparing without pacemakers versus the original. The difference in the model performances was said to be statistically significant if the 95% confidence intervals of the mean pairwise differences in AUROCs did not include the zero value for the compared models.

Visualizations: We used feature importance values based on information gained to identify the ECG measurements that were key contributors to the diagnosis prediction in the XGB models. Further, we visualized the gradient activation maps that contributed to the model’s prediction of diagnosis in our DL models using Gradient-weighted Class Activation Mapping (GradCAM)42 on the last convolutional layer. Also, we used feature importance values based on information gain to identify the ECG measurements that were key contributors to the diagnosis prediction in the XGB models.

The data underlying this article was provided by Alberta Health Services under the terms of a research agreement. Inquiries respecting access to the data can be made directly to them. We have included an ECG dataset that is artificially generated for the purpose of code demonstration only. They are not expected to accurately represent real ECG signals, or the label distributions. The demo dataset is openly available, and can be downloaded at https://figshare.com/s/b593e8d7bfe7cd8500b1.

The code base for training the deep learning models used in this study is available at: https://figshare.com/s/b593e8d7bfe7cd8500b1.

Tison, G. H., Zhang, J., Delling, F. N. & Deo, R. C. Automated and interpretable patient ECG profiles for disease detection, tracking, and discovery. Circ. Cardiovasc. Qual. Outcomes 12, e005289 (2019).

Article PubMed PubMed Central Google Scholar

Attia, Z. I. et al. Age and sex estimation using artificial intelligence from standard 12-lead ECGs. Circ. Arrhythm. Electrophysiol. 12, e007284 (2019).

Article PubMed PubMed Central Google Scholar

Attia, Z. I. et al. Screening for cardiac contractile dysfunction using an artificial intelligence-enabled electrocardiogram. Nat. Med. 25, 70–74 (2019).

Article CAS PubMed Google Scholar

Attia, Z. I. et al. An artificial intelligence-enabled ECG algorithm for the identification of patients with atrial fibrillation during sinus rhythm: a retrospective analysis of outcome prediction. Lancet 394, 861–867 (2019).

Article PubMed Google Scholar

Kwon, J.-M. et al. Comparing the performance of artificial intelligence and conventional diagnosis criteria for detecting left ventricular hypertrophy using electrocardiography. Europace 22, 412–419 (2020).

Article PubMed Google Scholar

Sraitih, M., Jabrane, Y. & Hajjam El Hassani, A. An automated system for ECG arrhythmia detection using machine learning techniques. J. Clin. Med. Res. 10, 5450 (2021).

Gustafsson, S. et al. Development and validation of deep learning ECG-based prediction of myocardial infarction in emergency department patients. Sci. Rep. 12, 19615 (2022).

Article CAS PubMed PubMed Central Google Scholar

Wu, L. et al. Deep learning networks accurately detect st-segment elevation myocardial infarction and culprit vessel. Front Cardiovasc. Med. 9, 797207 (2022).

Article PubMed PubMed Central Google Scholar

Al-Zaiti, S. S. et al. Machine learning for ECG diagnosis and risk stratification of occlusion myocardial infarction. Nat. Med. 29, 1804–1813 (2023).

Article CAS PubMed PubMed Central Google Scholar

Ribeiro, A. H. et al. Automatic diagnosis of the 12-lead ECG using a deep neural network. Nat. Commun. 11, 1760 (2020).

Article CAS PubMed PubMed Central Google Scholar

Isasi, I. et al. A robust machine learning architecture for a reliable ECG rhythm analysis during CPR. Conf. Proc. IEEE Eng. Med. Biol. Soc. 2019, 1903–1907 (2019).

Google Scholar

Elola, A. et al. Deep neural networks for ECG-based pulse detection during out-of-hospital cardiac arrest. Entropy 21, 305 (2019).

Choi, J. et al. Deep learning of ECG waveforms for diagnosis of heart failure with a reduced left ventricular ejection fraction. Sci. Rep. 12, 14235 (2022).

Article CAS PubMed PubMed Central Google Scholar

Raghu, A. et al. ECG-guided non-invasive estimation of pulmonary congestion in patients with heart failure. Sci. Rep. 13, 3923 (2023).

Article CAS PubMed PubMed Central Google Scholar

Somani, S. S. et al. Development of a machine learning model using electrocardiogram signals to improve acute pulmonary embolism screening. Eur. Heart J. Digit Health 3, 56–66 (2022).

Article PubMed Google Scholar

Valente Silva, B., Marques, J., Nobre Menezes, M., Oliveira, A. L. & Pinto, F. J. Artificial intelligence-based diagnosis of acute pulmonary embolism: Development of a machine learning model using 12-lead electrocardiogram. Rev. Port. Cardiol. 42, 643–651 (2023).

Article PubMed Google Scholar

Hata, E. et al. Classification of aortic stenosis using ECG by deep learning and its analysis using grad-CAM. Conf. Proc. IEEE Eng. Med. Biol. Soc. 2020, 1548–1551 (2020).

Google Scholar

Goto, S. et al. Multinational federated learning approach to train ECG and echocardiogram models for hypertrophic cardiomyopathy detection. Circulation 146, 755–769 (2022).

Article CAS PubMed PubMed Central Google Scholar

Cohen-Shelly, M. et al. Electrocardiogram screening for aortic valve stenosis using artificial intelligence. Eur. Heart J. 42, 2885–2896 (2021).

Article PubMed Google Scholar

Ulloa-Cerna, A. E. et al. rECHOmmend: an ECG-based machine learning approach for identifying patients at increased risk of undiagnosed structural heart disease detectable by echocardiography. Circulation 146, 36–47 (2022).

Article PubMed PubMed Central Google Scholar

Aras, M. A. et al. Electrocardiogram detection of pulmonary hypertension using deep learning. J. Card. Fail. 29, 1017–1028 (2023).

Article PubMed Google Scholar

Liu, C.-M. et al. Artificial intelligence-enabled electrocardiogram improves the diagnosis and prediction of mortality in patients with pulmonary hypertension. JACC Asia 2, 258–270 (2022).

Article PubMed PubMed Central Google Scholar

Chen, L., Fu, G. & Jiang, C. Deep learning-derived 12-lead electrocardiogram-based genotype prediction for hypertrophic cardiomyopathy: a pilot study. Ann. Med. 55, 2235564 (2023).

Article PubMed PubMed Central Google Scholar

Ko, W.-Y. et al. Detection of hypertrophic cardiomyopathy using a convolutional neural network-enabled electrocardiogram. J. Am. Coll. Cardiol. 75, 722–733 (2020).

Article PubMed Google Scholar

Sun, W. et al. Towards artificial intelligence-based learning health system for population-level mortality prediction using electrocardiograms. NPJ Digit. Med. 6, 21 (2023).

Article PubMed PubMed Central Google Scholar

Liu, X., Wang, H., Li, Z. & Qin, L. Deep learning in ECG diagnosis: a review. Knowl.-Based Syst. 227, 107187 (2021).

Article Google Scholar

Mant, J. et al. Accuracy of diagnosing atrial fibrillation on electrocardiogram by primary care practitioners and interpretative diagnostic software: analysis of data from screening for atrial fibrillation in the elderly (SAFE) trial. BMJ 335, 380 (2007).

Article PubMed PubMed Central Google Scholar

Veronese, G. et al. Emergency physician accuracy in interpreting electrocardiograms with potential ST-segment elevation myocardial infarction: is it enough? Acute Card. Care 18, 7–10 (2016).

Article PubMed Google Scholar

Tran, D. T. et al. The current and future financial burden of hospital admissions for heart failure in Canada: a cost analysis. CMAJ Open 4, E365–E370 (2016).

Article PubMed PubMed Central Google Scholar

Somani, S. et al. Deep learning and the electrocardiogram: review of the current state-of-the-art. Europace 23, 1179–1191 (2021).

Article PubMed PubMed Central Google Scholar

Clifford, G. D. et al. AF classification from a short single lead ECG recording: the PhysioNet/computing in cardiology challenge 2017. Comput. Cardiol. 44, https://doi.org/10.22489/CinC.2017.065-469 (2017).

Hannun, A. Y. et al. Cardiologist-level arrhythmia detection and classification in ambulatory electrocardiograms using a deep neural network. Nat. Med. 25, 65–69 (2019).

Article CAS PubMed PubMed Central Google Scholar

Sun, W. et al. Improving ECG-based COVID-19 diagnosis and mortality predictions using pre-pandemic medical records at population-scale. In: Time series for health at NeurIPS. https://doi.org/10.48550/arXiv.2211.10431. (2022).

Sun, W. et al. ECG for high-throughput screening of multiple diseases: Proof-of-concept using multi-diagnosis deep learning from population-based datasets. In: Medical imaging meets NeurIPS. https://doi.org/10.48550/arXiv.2210.06291. (2022).

Drummond, C. & Holte, R. C. Cost curves: an improved method for visualizing classifier performance. Mach. Learn. 65, 95–130 (2006).

Article Google Scholar

Quan, H. et al. Assessing validity of ICD-9-CM and ICD-10 administrative data in recording clinical conditions in a unique dually coded database. Health Serv. Res. 43, 1424–1441 (2008).

Article PubMed PubMed Central Google Scholar

Quan, H. et al. Coding algorithms for defining comorbidities in ICD-9-CM and ICD-10 administrative data. Med. Care 43, 1130–1139 (2005).

Article PubMed Google Scholar

Chen, T. & Guestrin, C. XGBoost: a scalable tree boosting system. In: Proceedings of the 22nd ACM SIGKDD international conference on knowledge discovery and data mining, 785–794 (Association for Computing Machinery, 2016).

Prechelt, L. Early stopping — but when? In: neural networks: tricks of the trade: Second Edition (eds. Montavon, G., Orr, G. B. & Müller, K.-R.) 53–67 (Springer Berlin Heidelberg, 2012).

Youden, W. J. Index for rating diagnostic tests. Cancer 3, 32–35 (1950).

3.0.CO;2-3" data-track-item_id="10.1002/1097-0142(1950)3:13.0.CO;2-3" data-track-value="article reference" data-track-action="article reference" href="https://doi.org/10.1002%2F1097-0142%281950%293%3A1%3C32%3A%3AAID-CNCR2820030106%3E3.0.CO%3B2-3" aria-label="Article reference 40" data-doi="10.1002/1097-0142(1950)3:13.0.CO;2-3">Article CAS PubMed Google Scholar

Brier, G. W. Verification of forecasts expressed in terms of probability. Mon. Weather Rev. 78, 1–3 (1950).

2.0.CO;2" data-track-item_id="10.1175/1520-0493(1950)0782.0.CO;2" data-track-value="article reference" data-track-action="article reference" href="https://doi.org/10.1175%2F1520-0493%281950%29078%3C0001%3AVOFEIT%3E2.0.CO%3B2" aria-label="Article reference 41" data-doi="10.1175/1520-0493(1950)0782.0.CO;2">Article Google Scholar

Selvaraju, R. R. et al. Grad-CAM: visual explanations from deep networks via gradient-based localization. In: 2017 IEEE International Conference on Computer Vision (ICCV) 618–626 (2017).

Moons, K. G. M. et al. Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis (TRIPOD): explanation and elaboration. Ann. Intern. Med. 162, W1–W73 (2015).

Article PubMed Google Scholar

Download references

The study was funded by an operating grant from the Canadian Institutes of Health Research (Grant # PJT-178158). Dr. Kaul holds the Canadian Institutes of Health Research Sex and Gender Science Chair and a Heart & Stroke Chair in Cardiovascular Research. Data was extracted from the Alberta Health Services Enterprise Data Warehouse with support provided by AbSPORU Data and Research Services platform, which is funded by CIHR, Alberta Innovates, University Hospital Foundation, University of Alberta, University of Calgary, and Alberta Health Services. The interpretation and conclusions contained herein are those of the researchers and do not necessarily represent the views of Alberta Health Services or any of the funders.

These authors contributed equally: Sunil Vasu Kalmady, Amir Salimi.

Department of Computing Science, University of Alberta, Edmonton, AB, Canada

Sunil Vasu Kalmady, Amir Salimi, Weijie Sun, Yousef Nademi, Abram Hindle & Russel Greiner

Canadian VIGOUR Centre, Department of Medicine, University of Alberta, Edmonton, AB, Canada

Sunil Vasu Kalmady, Nariman Sepehrvand, Kevin Bainey, Justin Ezekowitz, Finlay McAlister, Roopinder Sandhu & Padma Kaul

Department of Medicine, University of Alberta, Edmonton, AB, Canada

Sunil Vasu Kalmady, Kevin Bainey, Justin Ezekowitz, Finlay McAlister & Padma Kaul

Department of Medicine, University of Calgary, Calgary, AB, Canada

Nariman Sepehrvand

Smidt Heart Institute, Cedars-Sinai Medical Center Hospital System, Los Angeles, CA, USA

Roopinder Sandhu

You can also search for this author in PubMed Google Scholar

P.K. and S.V.K. conceived of the study, acquired funding, and were responsible for the overall study; A.S., W.S., S.V.K., and Y.N. conducted the analyses; S.V.K., P.K., A.S., and N.S. drafted the manuscript; all other authors critically reviewed and commented on the analyses and manuscript.

Correspondence to Padma Kaul.

The authors declare no competing interests.

This study was approved by the University of Alberta Research Ethics Board (Pro00120852), including waiving the need for individual patient informed consent. The report has been structured according to the “Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis based on artificial intelligence” (TRIPOD-AI) guidelines43.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/.

Reprints and permissions

Kalmady, S.V., Salimi, A., Sun, W. et al. Development and validation of machine learning algorithms based on electrocardiograms for cardiovascular diagnoses at the population level. npj Digit. Med. 7, 133 (2024). https://doi.org/10.1038/s41746-024-01130-8

Download citation

Received: 16 August 2023

Accepted: 26 April 2024

Published: 18 May 2024

DOI: https://doi.org/10.1038/s41746-024-01130-8

Anyone you share the following link with will be able to read this content:

Sorry, a shareable link is not currently available for this article.

Provided by the Springer Nature SharedIt content-sharing initiative