Can you know you are at risk of heart disease with easily obtainable personal indicators?
In 2020, the U.S. Center for Disease Control (CDC) compiled a 319,795-by-18 tabular dataset for heart disease prediction. Treating the boolean-valued column “heart_disease” as the dependent variable and all other columns as predictor variables, the best initial screening model is picked among five candidate machine learning (ML) models. Initial screening particularly requires high recall so as not to miss any true positives. The training and test results reveal that gradient-boosted tree is the best model for this purpose, achieving a high recall rate of 78.5%.
According to the CDC, heart disease is a leading natural cause of death in the U.S. Finding important indicators for heart disease is an ongoing valuable inquiry in the medical community. Highly flexible models with low interpretability such as a deep neural network are avoided for this reason. Decision tree-based models, namely a gradient-boosted tree and a random forest are chosen both for the ease of interpreting the relative importance of input features and their robustness in learning complex data patterns. Other methods based on the maximum likelihood principle such as a naive Bayes classifier, logistic regression and linear discriminant analysis (LDA) are also implemented to compare model performance based on a different approach from tree-based models. In the order of importance, the metrics used for evaluating model effectiveness include recall, area under the receiver operating characteristic curve (AUC-ROC), precision, and accuracy.
The original dataset is split into 80%-20% train-test subsets. Furthermore, an additional 10-fold cross-validation is carried out for the training set. All splitting is stratified on response.
The five candidate models in this project are:
To overcome the class imbalance in the training set, for baseline models, a ROSE (Random Over-Sampling Examples) is used in favor of the underrepresented positive samples. For tree models, the class imbalance is countered by tuning the class weights higher in favor of positive samples.
For each tree-based model, hyperparameter tuning was achieved by exploring a large grid of hyperparameter combinations, and then the final model is selected based on the best (highest) ROC-AUC value. After obtaining the optimal hyperparameters, the test dataset is used to check for overfitting. Finally, metrics such as recall, AUC-ROC, precision, and accuracy are used for evaluation.
Since the goal of these models is to detect potential heart disease in the early stage, an essential priority is to keep a low false-negative rate. Therefore, instead of accuracy, special classification metrics such as recall, J-index, and ROC-AUC are of greater importance and are included in Table 1 to the right.
*J index: Youden’s J statistic, Kappa: Cohen’s kappa coefficient, PR AUC: area under the precision-recall curve, ROC AUC: area under the ROC curve
Logistic regression, LDA, and XGBoost are the best models in terms of overall performance. XGBoost model has the highest recall, J index, and PR-AUC. By examining ROC and PRC for each of five models, one can also conclude that the XGBoost model is indeed the best option in general.
Additionally, relative feature importance in logistic regression and XGBoost are analyzed to determine whether a factor is related to heart disease. The most relevant variables are listed below (in descending order of importance, positive/negative):
greater than 70 years old | stroke | good general health | less than 45 years old | difficult walking | smoking more than 100 cigarettes in entire life | diabetic | asthma | _heavily alcohol drinking*_ | regular exercise
*may not be true due to data bias.
Key takeaways of this study can be summarized as follows:
For attribution, please cite this work as
Chen, et al. (2022, April 25). CeleritasML: ML-powered heart disease screening. Retrieved from https://celeritasml.netlify.app/posts/2022-04-25-ml-powered-heart-disease-screening/
BibTeX citation
@misc{chen2022ml-powered, author = {Chen, Yongrui and Gao, Jingsong and Luo, Ercong and Qiu, Rui and Zhang, Lu}, title = {CeleritasML: ML-powered heart disease screening}, url = {https://celeritasml.netlify.app/posts/2022-04-25-ml-powered-heart-disease-screening/}, year = {2022} }