Heart Disease UCI Logistic Regression In R

Halima Tusyakdiah
6 min readJan 6, 2020
https://pressroom.today/2019/07/27/amri-hospitals-introduces-special-cardiology-packages-to-save-precious-lives/

Introduced:

Logistic regression is a statistical method for analyzing a dataset in which there are one or more independent variables that determine an outcome. The outcome is measured with a dichotomous variable (in which there are only two possible outcomes).

In logistic regression, the dependent variable is binary or dichotomous, i.e. it only contains data coded as 1 (TRUE, success, pregnant, etc.) or 0 (FALSE, failure, non-pregnant, etc.).

The goal of logistic regression is to find the best fitting (yet biologically reasonable) model to describe the relationship between the dichotomous characteristic of interest (dependent variable = response or outcome variable) and a set of independent (predictor or explanatory) variables. Logistic regression generates the coefficients (and its standard errors and significance levels) of a formula to predict a logit transformation of the probability of presence of the characteristic of interest:

where p is the probability of presence of the characteristic of interest. The logit transformation is defined as the logged odds:

and

Rather than choosing parameters that minimize the sum of squared errors (like in ordinary regression), estimation in logistic regression chooses parameters that maximize the likelihood of observing the sample values. (source: https://www.medcalc.org/manual/logistic_regression.php)

The coefficients (Beta values b) of the logistic regression algorithm must be estimated from your training data.

Data:

You can download data on the link → https://www.kaggle.com/ronitf/heart-disease-uci
The dataset contains many medical indicators , This database contains 76 attributes. The dataset contains medical history of patients of Hungarian and Switzerland origin.

Attribute Information:

  1. age : age in year
  2. sex : (1 = male; 0 = female
  3. cp: the chest pain experienced (value 1: typical angina, value 2: atypical angina, value 3: non-anginal pain, value 4: asymptomatic)
  4. trestbps: resting blood pressure (in mm hg on admission to the hospital)
  5. chol: serum cholestoral in mg/dl
  6. fbs: (fasting blood sugar > 120 mg/dl) (1 = true; 0 = false)
  7. restecg: resting electrocardiographic measurement (0 = normal, 1 = having st-t wave abnormality, 2 = showing probable or definite left ventricular hypertrophy by estes’ criteria)
  8. thalach: maximum heart rate achieved
  9. exang: exercise induced angina (1 = yes; 0 = no)
  10. oldpeak: the slope of the peak exercise st segment (value 1: upsloping, value 2: flat, value 3: downsloping)
  11. slope: the slope of the peak exercise st segment (value 1: upsloping, value 2: flat, value 3: downsloping)
  12. ca: number of major vessels (0–3) colored by flourosopy
  13. thal: a blood disorder called thalassemia (3 = normal; 6 = fixed defect; 7 = reversable defect)
  14. target: heart disease (0 = no, 1 = yes)

Problem:
in this study, aim was to predict if a person has a heart disease or not based on attributes blood pressure,heart beat, exang, fbs and others.

Step 1: import dataset

dataset=read.csv(file = "D:\\KULIAH\\BIML\\heart.csv", sep = ",")
View(dataset)

the number of rows of the dataset

nrow(dataset)

Step 2: Heart Disease Exploraion

library(ggplot2)
gg <- ggplot(dataset, aes(x=chol, y=trestbps)) +
geom_point(aes(col=target)
summary(dataset)
databox=data.frame(dataset$age, dataset$trestbps,dataset$chol,dataset$thalach)
boxplot(databox)
# Scatterplot
library(ggplot2)
gg <- ggplot(dataset, aes(x=chol, y=trestbps)) +
geom_point(aes(col=target, size=oldpeak)) +
geom_smooth(method="loess", se=F) +
xlim(c(100, 430)) +
ylim(c(75, 200)) +
labs(subtitle="trestbps Vs chol",
y="trestbps",
x="chol",
title="Scatterplot",
caption = "Source: midwest",
bins = 30)
plot(gg)

OK, we will try logistic regression to predict which patients have heart disease.

Step 3: Calculating the baseline model

#1=mengidap penyakit jantung, 0=tidak mengidap penyakit jantung
#baseline model
table(dataset$target)

then, calculates the probability of heart disease. use code 1 / all data

165/303

from the baseline model value of 0.545, means that approximately 54% of patients suffering from heart disease.

Step 4: Splitting Dataset into Train and Test set

To implement this algorithm model, we need to separate dependent and independent variables within our data sets and divide the dataset in training set and testing set for evaluating models.

Data is divided:
training data: 0,75
testing data: 0,25
then, do a random split 123 times

###data train dan data testing###
library(caTools)
#randomly split data
set.seed(123)
split=sample.split(dataset$target, SplitRatio = 0.75)
split
qualityTrain=subset(dataset,split == TRUE)
qualityTest=subset(dataset,split == FALSE)
nrow(qualityTrain)
nrow(qualityTest)

so, we used 228 data train and 75 data tes

Step 5: Building the Model

The dependent variable used is target, for the independent variable is age, trestbps, chol, fbs, restecg, thalach, exang, oldpeak, slope, ca, and thal.

#logistic regression model
datasetlog=glm(target ~ target+age+trestbps+chol+fbs+restecg+thalach+exang+oldpeak+slope+ca+thal,data=qualityTrain,family = binomial)
summary(datasetlog)

Let’s break down what the code means. glm is the generalized linear model we will be using. target~ means that we want to model target using (~) every available feature (.). family = binomial() is used because we are predicting a binary outcome, 0 or 1.

then, we get the result

A lot of variables are not significant.

Now we will removing Variables based on Significance Level using the backward method

datasetlog2=glm(target ~ age+sex+cp+trestbps+chol+fbs+thalach+exang+oldpeak+slope+ca+thal,data=qualityTrain,family = binomial)
summary(datasetlog2)

after removing as many as 4 variables, in the following order:
1. remove restecg

2. remove fbs

3. remove slove

4. remove exang

then get model 6,

datasetlog6=glm(target ~ sex+cp+trestbps+chol+thalach+oldpeak+ca+thal,data=qualityTrain,family = binomial)
summary(datasetlog6)

Applying Model after removing least significant Variables

A general rule in machine learning is that the more features you have, the more likely your model will suffer from overfitting

Making predictions on training sets using datasetlog6

#presiksi data aktual menggunakan model datasetlog6
predictTrain=predict(datasetlog,type="response")
predictTrain

Step 6: Plotting ROCR curve

#penentuan thresold
library(ROCR)
ROCRpred=prediction(predictTrain, qualityTrain$target)ROCRperf=performance(ROCRpred,'tpr','fpr')
plot(ROCRperf)
plot(ROCRperf,colorize=TRUE)
plot(ROCRperf, colorize=TRUE, print.cutoffs.at=seq(0,1,by=0.1),
text.adj=c(-0.2,1.7))

From ROCR curve threshold of 0.7 seems to be okay so that true positives are maximised such that maximum number of patients with heart disease are not identified as healthy.

then we can see the value of AUC. Higher the AUC, better the model is at distinguishing between patients with disease and no disease.

#Area under the curve
auc = as.numeric(performance(ROCRpred, 'auc')@y.values)
auc

AUC value is 0.92 that means our model is able to distinguish between patients with the disease and no disease with a probability of 0.92. So it is a good value.

Step 7: Accuracy

#Accuracy using a threshold of 0.7
predictTest=predict(datasetlog, newdata = qualityTest,type = "response")
table(qualityTest$target,predictTest >=0.7)
#accuracy
(29+28)/75

Logistic regression model with all the variables and logistic regression model after removing less significant attributes performed best with an accuracy of testing 76%

THX :)

Reference:

--

--

Halima Tusyakdiah

Statistics student at Islamic University of Indonesian