Indian diabetes database modeling using Naive Bayes¶

Coding from scratch. To better understand the process of Naive Bayes model training.

And in the end, we make another Naive Bayes Classifier through normal workflow using sklearn.

Pros and cons of Naive Bayes Classifiers¶

Pros:¶

Computationally fast
Simple to implement
Works well with small datasets
Works well with high dimensions
Perform well even if the Naive Assumption is not perfectly met. In many cases, the approximation is enough to build a good classifier

Cons:¶

Require to remove correlated features because they are voted twice in the model and it can lead to over inflating importance.
If a categorical variable has a category in test data set which was not observed in training data set, then the model will assign a zero probability. It will not be able to make a prediction. This is often known as “Zero Frequency”. To solve this, we can use the smoothing technique. One of the simplest smoothing techniques is called Laplace estimation. Sklearn applies Laplace smoothing by default when you train a Naive Bayes classifier.

In [1]:

#import libraries and basic settings

import matplotlib.pyplot as plt
plt.style.use('classic')
import numpy as np
import pandas as pd
import random
import math
from IPython.display import display

pi = math.pi

Exploratory analysis¶

Check out the data to get a brief understanding of the dataset, including data integrity, size of the dataset, any missing values.

In [2]:

full_catalog = pd.read_csv('data\indian-diabetes-database.csv')
print(full_catalog.columns)
print('Size of the catalogue: {}'.format(len(full_catalog)))
print('Is there any missing value?: {}'.format(full_catalog.isnull().any().any()))
full_catalog.head()

Index(['Pregnancies', 'Glucose', 'BloodPressure', 'SkinThickness', 'Insulin',
       'BMI', 'DiabetesPedigreeFunction', 'Age', 'Outcome'],
      dtype='object')
Size of the catalogue: 768
Is there any missing value?: False

Out[2]:

	Pregnancies	Glucose	BloodPressure	SkinThickness	Insulin	BMI	DiabetesPedigreeFunction	Age	Outcome
0	6	148	72	35	0	33.6	0.627	50	1
1	1	85	66	29	0	26.6	0.351	31	0
2	8	183	64	0	0	23.3	0.672	32	1
3	1	89	66	23	94	28.1	0.167	21	0
4	0	137	40	35	168	43.1	2.288	33	1

No missing values are found in our dataset, which is good. Because the dataset is pretty clean we don't need to do too much data cleansing.

The Outcome column is our target column. Value 1 stands positive diagnosis of diabetes and value 0 represents negative. Let's see how many people are positive and/or negative in our database.

In [3]:

# diabetes positive
positive = full_catalog[full_catalog['Outcome'] == 1]
print('The number of patients with diabetes: ', len(positive))

# diabetes negative
negative = full_catalog[full_catalog['Outcome'] == 0]
print('The number of healthy patients: ', len(negative))

The number of patients with diabetes:  268
The number of healthy patients:  500

Deeper into the data: let's see which features (attributes) can better explain our data according to the Outcome.

Scatter matrix of all data¶

First let's look at the scatter matrix of attributes one against another. Blue represents negative patients (healthy), and red represent positive patients (with diabetes).

In [4]:

# According to the color map, blue = 0, red = 1
df = pd.DataFrame(full_catalog, columns = full_catalog.columns.drop('Outcome'))
pd.plotting.scatter_matrix(df, c = full_catalog['Outcome'].values, figsize = (15, 15),
                          marker = 'o', hist_kwds = {'bins': 10, 'color':'green'},
                          s = 10, alpha = 0.2, cmap = plt.get_cmap('bwr'))
plt.show()

Scatter matrix of diabetes patients (positive)¶

In [5]:

# Scatter of patients with diabetes
df = pd.DataFrame(positive, columns = positive.columns.drop('Outcome'))
pd.plotting.scatter_matrix(df, c = 'red', figsize = (15, 15), marker = 'o',
                          hist_kwds = {'bins': 10, 'color': 'red'}, s = 10,
                          alpha = 0.2)
plt.show()

Scatter matrix of heathy patients (negative outcomes)¶

In [6]:

# Scatter of patients with diabetes
df = pd.DataFrame(negative, columns = negative.columns.drop('Outcome'))
pd.plotting.scatter_matrix(df, c = 'blue', figsize = (15, 15), marker = 'o',
                          hist_kwds = {'bins': 10, 'color': 'blue'}, s = 10,
                          alpha = 0.2)
plt.show()

Training set and test set preparation¶

Of course we can use train_test_split from sklearn. But here, we use a scratch function to realize it.

In [7]:

'''
Function create_training_test divide the whole dataset into training set and test set.
Parameters:
    dataset: the original dataset to be split
    fraction_training: fraction of training set, number in [0, 1]
    msg: debug flag. If True, display message of current process
Output:
    training_set
    test_set
'''

def create_training_test(dataset, fraction_training, msg):
    # define size of training and test sets
    size_dataset = len(dataset)
    size_training = round(size_dataset * fraction_training)
    size_test = size_dataset - size_training
    
    # initializing both the training and test set using the whole dataset
    training_set = dataset.copy()
    test_set = dataset.copy()
    
    #index of the dataset dataframe
    total_idx_list = list(dataset.index.values)
    
    #index of the test set. Randomly selected from total_idx_list
    test_idx_list = random.sample(list(dataset.index.values), size_test)
    test_idx_list.sort()
    
    #index of the training set
    training_idx_list = list(set(total_idx_list) - set(test_idx_list))
    
    #drop the corresponding rows from the training and test dataframe
    training_set.drop(training_set.index[test_idx_list], inplace = True)
    test_set.drop(test_set.index[training_idx_list], inplace = True)
    
    if msg == True:
        training_positive = training_set[training_set['Outcome'] == 1]
        training_negative = training_set[training_set['Outcome'] == 0]
        print("Size of the dataset         : {}".format(size_dataset))
        print("Size of the training set    : {} samples ({} of the whole dataset)".format(
        len(training_set), fraction_training))
        print("\tPositive cases in the training set: {}".format(len(training_positive)))
        print("\tNegative cases in the training set: {}".format(len(training_negative)))
        print("Size of the test set        : {}".format(len(test_set)))
        
    return training_set, test_set

In [8]:

'''
Function get_parameters create a dictionary that contain the mean and standard deviation of each column in dataset
Input:
    dataset: input dataset frame
    msg: debug flag
Output:
    dict_parameters:  a dictionary that contain the mean and standard deviation of each attribute in dataset
'''

def get_parameters(dataset, msg):
    features = dataset.columns.values
    nbins = 10
    dict_parameters = {}
    
    # exclude the 'Outcome' column from the loop
    for i in range(0, len(features)-1):
        #we single out the column 'feature[i]' from dataset
        aux_df = pd.DataFrame(dataset[features[i]])
        #Here we make partition into nbins. 
        #aux_df has an extra column indicating to which bin each instance belongs to
        aux_df['bin'] = pd.cut(aux_df[features[i]], nbins)
        
        # 'counts' is a searies whose index is the bin interval and the values are the
        #number of counts in each bin
        counts = pd.value_counts(aux_df['bin'])
        
        points_X = np.zeros(nbins)
        points_Y = np.zeros(nbins)
        
        for j in range(0, nbins):
            points_X[j] = counts.index[j].mid # the mid point of each bin
            points_Y[j] = counts.iloc[j]      # the number of counts
            
        total_Y = np.sum(points_Y)
        # we compute the mean and std and store them in dict_parameters
        # whose keys are column names and values are (mu, sigma) tuple
        mu = np.sum(points_X * points_Y)/total_Y
        sigma2 = np.sum((points_X - mu)**2*points_Y)/(total_Y - 1)
        sigma = math.sqrt(sigma2)
        
        dict_parameters[features[i]] = (mu, sigma)
        
        if msg == True:
            print('\t\tfeature: {}, mean: {}, standard deviation: {}'.format(
            features[i], mu, sigma))
        
    return dict_parameters

In [9]:

'''
Function likelihood calculates the probability density of each column
Input:
    instance: series, can be considered as each row in our dataset, the index of the series is the columns of our dataset
    dictionary
    
Output:
    dict_likelihood: dictionary contains probability density
'''

def likelihood(instance, dictionary):
    
    instance = instance[instance.index != 'Outcome']
    dict_likelihood = {}
    for feature in instance.index:
        mu = dictionary[feature][0]
        sigma = dictionary[feature][1]
        measurement = instance[feature]
        
        if feature in ['Pregnancies', 'Insulin', 'DiabetesPedigreeFunction', 'Age']:
            # We use exponential distribution for 
            # columns ['Pregnancies', 'Insulin', 'DiabetesPedigreeFunction', 'Age']
            dict_likelihood[feature] = 1./mu*math.exp(-measurement/mu)
        elif feature in ['Glucose', 'BloodPressure', 'SkinThickness', 'BMI']:
            # We use Gaussian distribution for columns['Glucose', 'BloodPressure', 'SkinThickness', 'BMI']
            dict_likelihood[feature] = 1./(math.sqrt(2*pi)*sigma)*math.exp(-(measurement - mu)**2/(2.*sigma**2))
            
    return dict_likelihood

In [10]:

'''
Function bayes classify the input instances according to Bayes Theory

Input: 
    lkh_positive: for each feature, P(features|outcome == 1)
    lkh_negative: for each feature, P(features|outcome == 0)
    prob_positive: the probability of patients with diabetes

Output: predictions, 1-positive, 0-negative
'''

def bayes(lkh_positive, lkh_negative, prob_positive):
    logPositive = 0
    logNegative = 0
    
    for feature in lkh_positive:
        logPositive += math.log(lkh_positive[feature])
        logNegative += math.log(lkh_negative[feature])
    
    logPositive = logPositive + math.log(prob_positive)
    logNegative = logNegative + math.log(1. - prob_positive)
    
    if logPositive > logNegative:
        return 1
    else:
        return 0

In [11]:

# funciton to run the test

def pima_diabetes_NBClassifier(training_fraction, msg):
    
    dataset = pd.read_csv('data\indian-diabetes-database.csv')
    
    training, test = create_training_test (dataset, training_fraction, msg)
    
    #split the training set into training_positive and training_negative according to Outcome
    training_positive = training[training['Outcome'] == 1]
    training_negative = training[training['Outcome'] == 0]
    prob_positive = len(training_positive)/(len(training))
    
    if msg ==True:
        print('Getting the parameters for the training set...')
        print('\tPositive cases subsample')
        
    param_positive = get_parameters(training_positive, msg)
    
    if msg ==True:
        print('\tNegative cases subsample')
        
    param_negative = get_parameters(training_negative, msg)
    
    if msg ==True:
        print('\tProbability of finding a positive case: {}'.format(prob_positive))
        print('Analyzing the test set...')
        
    # Here we compute the accuracy of the classifier by looping over the instances of the test set
    error_count = 0
    
    for idx in test.index.values:
        instance = test.loc[idx]
        likelihood_positive = likelihood(instance, param_positive)
        likelihood_negative = likelihood(instance, param_negative)
        
        prediction = bayes(likelihood_positive, likelihood_negative, prob_positive)
        answer = int(instance['Outcome'])
        
        if prediction != answer:
            error_count += 1
            
        
        error_rate = float(error_count)/len(test)
        
    if msg ==True:
        print('Results for this implementation:')
        print('\t Error rate:                           : {}'.format(error_rate))
        print('\tSuccessful classification rate         : {}'.format(1.-error_rate))
            
        
    return error_rate

Single implementation¶

In [12]:

training_fraction = 0.75
msg = True
pima_diabetes_NBClassifier(training_fraction, msg)

Size of the dataset         : 768
Size of the training set    : 576 samples (0.75 of the whole dataset)
	Positive cases in the training set: 203
	Negative cases in the training set: 373
Size of the test set        : 192
Getting the parameters for the training set...
	Positive cases subsample
		feature: Pregnancies, mean: 5.06026354679803, standard deviation: 3.659147466140387
		feature: Glucose, mean: 139.54409359605913, standard deviation: 31.20971649275315
		feature: BloodPressure, mean: 70.25017241379311, standard deviation: 20.63956588198645
		feature: SkinThickness, mean: 23.708549261083746, standard deviation: 16.27891796774379
		feature: Insulin, mean: 109.93596059113301, standard deviation: 116.28083680196309
		feature: BMI, mean: 35.053634975369455, standard deviation: 7.323798557556501
		feature: DiabetesPedigreeFunction, mean: 0.5482635467980296, standard deviation: 0.35793174891647633
		feature: Age, mean: 36.80145320197044, standard deviation: 10.80830792627758
	Negative cases subsample
		feature: Pregnancies, mean: 3.3208900804289545, standard deviation: 2.885346134494473
		feature: Glucose, mean: 110.5148873994638, standard deviation: 25.689628327169185
		feature: BloodPressure, mean: 68.60210455764074, standard deviation: 18.27917311567462
		feature: SkinThickness, mean: 20.48474798927614, standard deviation: 14.190959528879588
		feature: Insulin, mean: 86.43265415549598, standard deviation: 88.69429208023446
		feature: BMI, mean: 30.5620226541555, standard deviation: 7.792696402992349
		feature: DiabetesPedigreeFunction, mean: 0.43494249329758705, standard deviation: 0.3122588634553088
		feature: Age, mean: 31.4157908847185, standard deviation: 11.052104394041816
	Probability of finding a positive case: 0.3524305555555556
Analyzing the test set...
Results for this implementation:
	 Error rate:                           : 0.19791666666666666
	Successful classification rate         : 0.8020833333333334

Out[12]:

0.19791666666666666

Multiple implementations¶

Let's run the NBClassifier multiple times to get an mean and std values of the error rate and successful rate.

In [13]:

training_fraction = 0.75
nrealizations = 500
msg = False

error_rate = np.zeros(nrealizations)
sucess_rate = np.zeros(nrealizations)

for i in range(0, nrealizations):
    aux = pima_diabetes_NBClassifier(training_fraction, msg)
    
    error_rate[i] = aux
    success_rate = 1.-aux

print('Results after {} realizations and training the classifier wiht {} of the wholesamples...'.format(
    nrealizations, training_fraction))
print('error rate mean: {}, std: {}'.format(np.mean(error_rate), 
                                           np.std(error_rate)))
print('success rate mean: {}, std: {}'.format(np.mean(success_rate), 
                                               np.std(success_rate)))

Results after 500 realizations and training the classifier wiht 0.75 of the wholesamples...
error rate mean: 0.232875, std: 0.02677764991036783
success rate mean: 0.734375, std: 0.0

Naive Bayes using sklearn: a comparison¶

Now let's look at the normal workflow using the library of sklearn.

In [14]:

from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB

#importing dataset
data = pd.read_csv('data\indian-diabetes-database.csv')
data.dropna(axis = 0, how = 'any', inplace = True)

X = data[['Pregnancies',
          'Glucose',
          'BloodPressure',
          'SkinThickness',
          'Insulin',
          'BMI',
          'DiabetesPedigreeFunction',
          'Age']]
y = data['Outcome']

#split dataset into training and test sets

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size = 0.25, random_state = 42)

In [15]:

#initailizing the classifier, we use the GaussianNB
gnb = GaussianNB()

# train the classifier using training set
gnb.fit(X_train, y_train)

y_pred = gnb.predict(X_test)

# print results
print("Number of misclassified samples out of a total {} samples: {}, performance {:05.2f}%"
     .format(X_test.shape[0],
            (y_test != y_pred).sum(),
            100*(1-(y_test != y_pred).sum()/X_test.shape[0])))

Number of misclassified samples out of a total 192 samples: 51, performance 73.44%