Diabetes prediction using Pima Indians dataset
Pima Indians diabetic dataset is taken from the Kaggle website Link .
Diabetes is a disease in which human body can not produce enough insulin or use it well to remove and absorb glucose from the bloodstream, leading to a cascade of disease like kidney, liver failure, blindness, and much more. More than 37 million US adults have diabetes, and 1 out of 5 do not know that they have it. It is the seventh leading cause of death in US. Source: CDC .
In this workbook we will use machine learning techniques to see how accurately it can predict the chance of diabetes in Pima Indian population.
Exploratory Analysis of the pima-indians diabetes dataset:
First let's load the necessary modules:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
Data Visualization and Statistics
all_data = pd.read_csv("diabetes.csv")
print("Data Shape:", all_data.shape)
all_data
Data Shape: (768, 9)
Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome | |
---|---|---|---|---|---|---|---|---|---|
0 | 6 | 148 | 72 | 35 | 0 | 33.6 | 0.627 | 50 | 1 |
1 | 1 | 85 | 66 | 29 | 0 | 26.6 | 0.351 | 31 | 0 |
2 | 8 | 183 | 64 | 0 | 0 | 23.3 | 0.672 | 32 | 1 |
3 | 1 | 89 | 66 | 23 | 94 | 28.1 | 0.167 | 21 | 0 |
4 | 0 | 137 | 40 | 35 | 168 | 43.1 | 2.288 | 33 | 1 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
763 | 10 | 101 | 76 | 48 | 180 | 32.9 | 0.171 | 63 | 0 |
764 | 2 | 122 | 70 | 27 | 0 | 36.8 | 0.340 | 27 | 0 |
765 | 5 | 121 | 72 | 23 | 112 | 26.2 | 0.245 | 30 | 0 |
766 | 1 | 126 | 60 | 0 | 0 | 30.1 | 0.349 | 47 | 1 |
767 | 1 | 93 | 70 | 31 | 0 | 30.4 | 0.315 | 23 | 0 |
Print the Descriptive Statistics:
all_data.describe()
Pregnancies | Glucose | BloodPressure | SkinThickness | Insulin | BMI | DiabetesPedigreeFunction | Age | Outcome | |
---|---|---|---|---|---|---|---|---|---|
count | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 | 768.000000 |
mean | 3.845052 | 120.894531 | 69.105469 | 20.536458 | 79.799479 | 31.992578 | 0.471876 | 33.240885 | 0.348958 |
std | 3.369578 | 31.972618 | 19.355807 | 15.952218 | 115.244002 | 7.884160 | 0.331329 | 11.760232 | 0.476951 |
min | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.078000 | 21.000000 | 0.000000 |
25% | 1.000000 | 99.000000 | 62.000000 | 0.000000 | 0.000000 | 27.300000 | 0.243750 | 24.000000 | 0.000000 |
50% | 3.000000 | 117.000000 | 72.000000 | 23.000000 | 30.500000 | 32.000000 | 0.372500 | 29.000000 | 0.000000 |
75% | 6.000000 | 140.250000 | 80.000000 | 32.000000 | 127.250000 | 36.600000 | 0.626250 | 41.000000 | 1.000000 |
max | 17.000000 | 199.000000 | 122.000000 | 99.000000 | 846.000000 | 67.100000 | 2.420000 | 81.000000 | 1.000000 |
Data Rescaling:
all_data = (all_data-all_data.min())/(all_data.max()-all_data.min())
all_data
### Training and Test set (70% and 30%)
n_features = all_data.shape[1]-1
ndata = len(all_data)
ntrain = int(0.7*ndata)
ntest = ndata - ntrain
training_data = all_data[0:ntrain]
test_data = all_data[ntrain:ndata]
training_data.shape, test_data.shape
((537, 9), (231, 9))
Training and Test Set Splitting:
X_train = training_data.to_numpy()[:,0:n_features]
y_train = training_data.to_numpy()[:,-1]
X_test = test_data.to_numpy()[:,0:n_features]
y_test = test_data.to_numpy()[:,-1]
print ("Training:", X_train.shape, y_train.shape)
print ("Test:", X_test.shape, y_test.shape)
Training: (537, 8) (537,) Test: (231, 8) (231,)
Binary Logistic Regression
There are 8 features: \(\vec{X}=[X_1, X2, ..., X_8]\).
The weights are \(\vec{w}=[w_1, w_2, ..., w_8]\) and the the bias is b.
The linear regression function is \(z=\vec{w}.\vec{X} + b\).
Logistic regression function: \(f_{\vec{w},b}(\vec{X})=g(z) = \frac{1}{1+\exp(-z)} = \frac{1}{1+\exp(-(\vec{w}.\vec{X} + b))}\)
# Initial weights and bias
w = np.ones(n_features)
b = 0.5
# Learning Rate
alpha = 0.01
Defining the Logistic and Cost Function:
Logistic Cost Function:
\(J(\vec{w},b) = \frac{1}{m}\sum \limits_{i=1}^{m} L(w,b),\) with \(m=8\)
where, the loss function \(L(\vec{w},b)=\frac{1}{2} \left( f_{\vec{w},b}(\vec{X}^{(i)})-y^{(i)}\right )^2\), \(i \in [1,8]\).
Gradient Descent Algorithm (With w regularization):
\(\frac{\partial J(\vec{w},b)}{\partial w_j} = \frac{1}{m} \sum \limits_{i=1}^{m} \left( f_{\vec{w},b}(\vec{X}^{(i)})-y^{(i)}\right )\vec{X}^{(i)} + \frac{\lambda}{m}w_j\)
\(\frac{\partial J(\vec{w},b)}{\partial b} = \frac{1}{m} \sum \limits_{i=1}^{m} \left( f_{\vec{w},b}(\vec{X}^{(i)})-y^{(i)}\right )\)
def f_logistic(X,w,b):
return 1./(1+np.exp(-np.dot(w,X)-b))
def dJ_dw(X,y,w,b):
Loss_Sum = 0
for i in range (ntrain):
Loss_Sum += (f_logistic(X[i],w,b) - y[i])*X[i]
return (1/ntrain)*Loss_Sum
def dJ_dw_L2_regularization(X,y,w,b,lamda):
Loss_Sum = 0
for i in range (ntrain):
Loss_Sum += (f_logistic(X[i],w,b) - y[i])*X[i]
return (1/ntrain)*Loss_Sum + (lamda/ntrain)*w
def dJ_db(X,y,w,b):
Loss_Sum = 0
for i in range (ntrain):
Loss_Sum += (f_logistic(X[i],w,b) - y[i])
return (1/ntrain)*Loss_Sum
Update w and b in epoch:
n_epoch = 2000
for i in range (n_epoch):
#tmp_w = w - alpha*dJ_dw(X_train,y_train,w,b)
tmp_w = w - alpha*dJ_dw_L2_regularization(X_train,y_train,w,b,lamda=0.1) # With Regularization
tmp_b = b - alpha*dJ_db(X_train,y_train,w,b)
w = tmp_w
b = tmp_b
#print (i,w,b)
Prediction with the trained parameter:
print ("Final Weights(w):", w)
print ("Final Bias(b):", b)
y_pred = []
for i in range(ntest):
pred_result=0
y = f_logistic(X_test[i],w,b)
if (y > 0.5):
pred_result = 1
y_pred.append(pred_result)
y_pred = np.array(y_pred)
y_test, y_pred
Final Weights(w): [0.77691792, 0.33937085, -0.13957614, 0.57400966, 0.86465804, 0.33027304, 0.7857026, 0.78044359] Final Bias(b): -1.51501363903071
y_test: [0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0]
y_pred: [0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]
Accuracy of Binary Regression prediction:
accuracy = 100*(1-np.sum(abs(y_test-y_pred))/ntest)
print("Accuracy: %0.2f%%"%accuracy)
Accuracy: 71.00% Use Scikit-Learn for Logistic Regression
from sklearn.linear_model import LogisticRegression
logisticRegr = LogisticRegression(penalty=None, max_iter=200)
logisticRegr.fit(X_train, y_train)
y_pred_sklearn = logisticRegr.predict(X_test)
accuracy = 100*(1-np.sum(abs(y_test-y_pred_sklearn))/ntest)
print("Accuracy: %0.2f%%"%accuracy)
Accuracy: 79.22% Neural Network (3 layers)
import tensorflow as tf
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.losses import BinaryCrossentropy
model = Sequential([
Dense(units=35, activation='relu'),
Dense(units=10, activation='relu'),
Dense(units=1, activation='linear')
])
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.001) ,loss=BinaryCrossentropy(from_logits=True))
model.fit(X_train, y_train, epochs=200)
Accuracy of the Neural Network prediction:
model.summary()
#y_pred_continuous = model.predict(X_test)
logit = model(X_test)
y_pred_continuous = tf.nn.sigmoid(logit)
y_pred_tf = np.where(y_pred_continuous > 0.5 , 1, 0).T[0]
accuracy = 100*(1-np.sum(abs(y_test-y_pred_tf))/ntest)
print("Accuracy: %0.2f%%"%accuracy)
Model: "sequential" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= dense (Dense) (None, 35) 315 dense_1 (Dense) (None, 10) 360 dense_2 (Dense) (None, 1) 11 ================================================================= Total params: 686 Trainable params: 686 Non-trainable params: 0 _________________________________________________________________ Accuracy: 80.95%
Feature Selection (chi2):
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2
bestfeatures = SelectKBest(score_func=chi2, k=4)
fit = bestfeatures.fit(X_train, y_train)
print("Fit Scores:", fit.scores_)
top4_features = np.argpartition(fit.scores_, -4)[-4:]
print("Top 4 features:", top4_features)
X_train_top4 = fit.transform(X_train)
X_test_top4 = fit.transform(X_test)
Fit Scores: [4.53678678 4.54641459 0.04086338 0.28601954 2.00041986 1.61708541 1.99254061 4.31147926] Top 4 features: [4 7 1 0]
The most important features identified by the KBest analysis are: [Insulin, Age, Glucose, Pregnancies].
Logistic Regression
logisticRegr = LogisticRegression(penalty=None, max_iter=200)
logisticRegr.fit(X_train_top4, y_train)
y_pred_sklearn = logisticRegr.predict(X_test_top4)
accuracy = 100*(1-np.sum(abs(y_test-y_pred_sklearn))/ntest)
print("Accuracy: %0.2f%%"%accuracy)
Accuracy: 77.92% Neural Network (3 layers) with the most important 4 features
model = Sequential([
Dense(units=15, activation='relu'),
Dense(units=8, activation='relu'),
Dense(units=1, activation='linear')
])
model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.001), loss=BinaryCrossentropy(from_logits=True))
model.fit(X_train, y_train, epochs=200)
model.summary()
logit = model(X_test)
y_pred_continuous = tf.nn.sigmoid(logit)
y_pred_tf = np.where(y_pred_continuous > 0.5 , 1, 0).T[0]
accuracy = 100*(1-np.sum(abs(y_test-y_pred_tf))/ntest)
print("Accuracy: %0.2f%%"%accuracy)
Model: "sequential_1" _________________________________________________________________ Layer (type) Output Shape Param # ================================================================= dense_3 (Dense) (None, 15) 135 dense_4 (Dense) (None, 8) 128 dense_5 (Dense) (None, 1) 9 ================================================================= Total params: 272 Trainable params: 272 Non-trainable params: 0 _________________________________________________________________ Accuracy: 80.52%
Conclusions of the ML study:
(a) Simple logistic regression performs as per as the neural network accuracy.
(b) Choosing the best 4 features [Insulin, Age, Glucose, Pregnancies] is equivalent in performance in term of accuracy but speeds up the computational time, and gives a rational argument on the risk factors of the diabetes.