Diabetes prediction using Pima Indians dataset

Feb 18, 2022

Machine Learning Diabetes Prediction

Pima Indians diabetic dataset is taken from the Kaggle website Link .

Diabetes is a disease in which human body can not produce enough insulin or use it well to remove and absorb glucose from the bloodstream, leading to a cascade of disease like kidney, liver failure, blindness, and much more. More than 37 million US adults have diabetes, and 1 out of 5 do not know that they have it. It is the seventh leading cause of death in US. Source: CDC .

In this workbook we will use machine learning techniques to see how accurately it can predict the chance of diabetes in Pima Indian population.

Exploratory Analysis of the pima-indians diabetes dataset:

First let's load the necessary modules:


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Data Visualization and Statistics


all_data = pd.read_csv("diabetes.csv")
print("Data Shape:", all_data.shape)
all_data

Data Shape: (768, 9)

	Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age	Outcome
0	6	148	72	35	0	33.6	0.627	50	1
1	1	85	66	29	0	26.6	0.351	31	0
2	8	183	64	0	0	23.3	0.672	32	1
3	1	89	66	23	94	28.1	0.167	21	0
4	0	137	40	35	168	43.1	2.288	33	1
...	...	...	...	...	...	...	...	...	...
763	10	101	76	48	180	32.9	0.171	63	0
764	2	122	70	27	0	36.8	0.340	27	0
765	5	121	72	23	112	26.2	0.245	30	0
766	1	126	60	0	0	30.1	0.349	47	1
767	1	93	70	31	0	30.4	0.315	23	0

Print the Descriptive Statistics:


all_data.describe()

	Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age	Outcome
count	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000	768.000000
mean	3.845052	120.894531	69.105469	20.536458	79.799479	31.992578	0.471876	33.240885	0.348958
std	3.369578	31.972618	19.355807	15.952218	115.244002	7.884160	0.331329	11.760232	0.476951
min	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.078000	21.000000	0.000000
25%	1.000000	99.000000	62.000000	0.000000	0.000000	27.300000	0.243750	24.000000	0.000000
50%	3.000000	117.000000	72.000000	23.000000	30.500000	32.000000	0.372500	29.000000	0.000000
75%	6.000000	140.250000	80.000000	32.000000	127.250000	36.600000	0.626250	41.000000	1.000000
max	17.000000	199.000000	122.000000	99.000000	846.000000	67.100000	2.420000	81.000000	1.000000

Data Rescaling:


all_data = (all_data-all_data.min())/(all_data.max()-all_data.min())
all_data


### Training and Test set (70% and 30%)
n_features = all_data.shape[1]-1
ndata = len(all_data)
ntrain = int(0.7*ndata)
ntest = ndata - ntrain
training_data = all_data[0:ntrain]
test_data = all_data[ntrain:ndata]
training_data.shape, test_data.shape

((537, 9), (231, 9))

Training and Test Set Splitting:


X_train = training_data.to_numpy()[:,0:n_features]
y_train = training_data.to_numpy()[:,-1]

X_test = test_data.to_numpy()[:,0:n_features]
y_test = test_data.to_numpy()[:,-1]

print ("Training:", X_train.shape, y_train.shape)
print ("Test:", X_test.shape, y_test.shape)

Training: (537, 8) (537,)
Test: (231, 8) (231,)

Binary Logistic Regression

There are 8 features: \(\vec{X}=[X_1, X2, ..., X_8]\).
The weights are \(\vec{w}=[w_1, w_2, ..., w_8]\) and the the bias is b.
The linear regression function is \(z=\vec{w}.\vec{X} + b\).
Logistic regression function: \(f_{\vec{w},b}(\vec{X})=g(z) = \frac{1}{1+\exp(-z)} = \frac{1}{1+\exp(-(\vec{w}.\vec{X} + b))}\)


# Initial weights and bias
w = np.ones(n_features)
b = 0.5

# Learning Rate
alpha = 0.01

Defining the Logistic and Cost Function:

Logistic Cost Function:

\(J(\vec{w},b) = \frac{1}{m}\sum \limits_{i=1}^{m} L(w,b),\) with \(m=8\)
where, the loss function \(L(\vec{w},b)=\frac{1}{2} \left( f_{\vec{w},b}(\vec{X}^{(i)})-y^{(i)}\right )^2\), \(i \in [1,8]\).

Gradient Descent Algorithm (With w regularization):

\(\frac{\partial J(\vec{w},b)}{\partial w_j} = \frac{1}{m} \sum \limits_{i=1}^{m} \left( f_{\vec{w},b}(\vec{X}^{(i)})-y^{(i)}\right )\vec{X}^{(i)} + \frac{\lambda}{m}w_j\)
\(\frac{\partial J(\vec{w},b)}{\partial b} = \frac{1}{m} \sum \limits_{i=1}^{m} \left( f_{\vec{w},b}(\vec{X}^{(i)})-y^{(i)}\right )\)


def f_logistic(X,w,b):
    return 1./(1+np.exp(-np.dot(w,X)-b))

def dJ_dw(X,y,w,b):
    Loss_Sum = 0
    for i in range (ntrain):
        Loss_Sum += (f_logistic(X[i],w,b) - y[i])*X[i]
    return (1/ntrain)*Loss_Sum

def dJ_dw_L2_regularization(X,y,w,b,lamda):
    Loss_Sum = 0
    for i in range (ntrain):
        Loss_Sum += (f_logistic(X[i],w,b) - y[i])*X[i]
    return (1/ntrain)*Loss_Sum + (lamda/ntrain)*w
    
def dJ_db(X,y,w,b):
    Loss_Sum = 0
    for i in range (ntrain):
        Loss_Sum += (f_logistic(X[i],w,b) - y[i])
    return (1/ntrain)*Loss_Sum

Update w and b in epoch:


n_epoch = 2000
for i in range (n_epoch):
    #tmp_w = w - alpha*dJ_dw(X_train,y_train,w,b)
    tmp_w = w - alpha*dJ_dw_L2_regularization(X_train,y_train,w,b,lamda=0.1) # With Regularization
    tmp_b = b - alpha*dJ_db(X_train,y_train,w,b)
    w = tmp_w
    b = tmp_b
    #print (i,w,b)

Prediction with the trained parameter:


print ("Final Weights(w):", w)
print ("Final Bias(b):", b)
y_pred = []
for i in range(ntest):
    pred_result=0
    y = f_logistic(X_test[i],w,b)
    if (y > 0.5):
        pred_result = 1
    y_pred.append(pred_result)
y_pred = np.array(y_pred)
y_test, y_pred

Final Weights(w): [0.77691792, 0.33937085, -0.13957614, 0.57400966, 0.86465804, 0.33027304, 0.7857026, 0.78044359]
Final Bias(b): -1.51501363903071

y_test: [0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 0, 0, 1, 0]
y_pred: [0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]

Accuracy of Binary Regression prediction:


accuracy = 100*(1-np.sum(abs(y_test-y_pred))/ntest)
print("Accuracy: %0.2f%%"%accuracy)

Accuracy: 71.00%

Use Scikit-Learn for Logistic Regression


from sklearn.linear_model import LogisticRegression
logisticRegr = LogisticRegression(penalty=None, max_iter=200)

logisticRegr.fit(X_train, y_train)
y_pred_sklearn = logisticRegr.predict(X_test)

accuracy = 100*(1-np.sum(abs(y_test-y_pred_sklearn))/ntest)
print("Accuracy: %0.2f%%"%accuracy)

Accuracy: 79.22%

Neural Network (3 layers)


import tensorflow as tf
from tensorflow.keras import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.losses import BinaryCrossentropy

model = Sequential([
    Dense(units=35, activation='relu'),
    Dense(units=10, activation='relu'),
    Dense(units=1, activation='linear')
])

model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.001) ,loss=BinaryCrossentropy(from_logits=True))
model.fit(X_train, y_train, epochs=200)

Accuracy of the Neural Network prediction:


model.summary()
#y_pred_continuous = model.predict(X_test)

logit = model(X_test)
y_pred_continuous = tf.nn.sigmoid(logit)

y_pred_tf = np.where(y_pred_continuous > 0.5 , 1, 0).T[0]
accuracy = 100*(1-np.sum(abs(y_test-y_pred_tf))/ntest)
print("Accuracy: %0.2f%%"%accuracy)

Model: "sequential"
_________________________________________________________________
    Layer (type)                Output Shape              Param #   
=================================================================
    dense (Dense)               (None, 35)                315       
                                                                    
    dense_1 (Dense)             (None, 10)                360       
                                                                    
    dense_2 (Dense)             (None, 1)                 11        
                                                                    
=================================================================
Total params: 686
Trainable params: 686
Non-trainable params: 0
_________________________________________________________________
Accuracy: 80.95%

Feature Selection (chi2):


from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2

bestfeatures = SelectKBest(score_func=chi2, k=4)
fit = bestfeatures.fit(X_train, y_train)

print("Fit Scores:", fit.scores_)
top4_features = np.argpartition(fit.scores_, -4)[-4:]
print("Top 4 features:", top4_features)

X_train_top4 = fit.transform(X_train)
X_test_top4 = fit.transform(X_test)

Fit Scores: [4.53678678 4.54641459 0.04086338 0.28601954 2.00041986 1.61708541 1.99254061 4.31147926]
Top 4 features: [4 7 1 0]
The most important features identified by the KBest analysis are: [Insulin, Age, Glucose, Pregnancies].


Logistic Regression
logisticRegr = LogisticRegression(penalty=None, max_iter=200)

logisticRegr.fit(X_train_top4, y_train)
y_pred_sklearn = logisticRegr.predict(X_test_top4)

accuracy = 100*(1-np.sum(abs(y_test-y_pred_sklearn))/ntest)
print("Accuracy: %0.2f%%"%accuracy)

Accuracy: 77.92%

Neural Network (3 layers) with the most important 4 features


model = Sequential([
    Dense(units=15, activation='relu'),
    Dense(units=8, activation='relu'),
    Dense(units=1, activation='linear')
])

model.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=0.001), loss=BinaryCrossentropy(from_logits=True))
model.fit(X_train, y_train, epochs=200)

model.summary()

logit = model(X_test)
y_pred_continuous = tf.nn.sigmoid(logit)

y_pred_tf = np.where(y_pred_continuous > 0.5 , 1, 0).T[0]
accuracy = 100*(1-np.sum(abs(y_test-y_pred_tf))/ntest)
print("Accuracy: %0.2f%%"%accuracy)

Model: "sequential_1"
_________________________________________________________________
    Layer (type)                Output Shape              Param #   
=================================================================
    dense_3 (Dense)             (None, 15)                135       
                                                                    
    dense_4 (Dense)             (None, 8)                 128       
                                                                    
    dense_5 (Dense)             (None, 1)                 9         
                                                                    
=================================================================
Total params: 272
Trainable params: 272
Non-trainable params: 0
_________________________________________________________________
Accuracy: 80.52%

Conclusions of the ML study:

(a) Simple logistic regression performs as per as the neural network accuracy.
(b) Choosing the best 4 features [Insulin, Age, Glucose, Pregnancies] is equivalent in performance in term of accuracy but speeds up the computational time, and gives a rational argument on the risk factors of the diabetes.