Indian diabetes database modeling using Naive Bayes¶
Coding from scratch. To better understand the process of Naive Bayes model training.
And in the end, we make another Naive Bayes Classifier through normal workflow using sklearn.
Pros and cons of Naive Bayes Classifiers¶
Pros:¶
- Computationally fast
- Simple to implement
- Works well with small datasets
- Works well with high dimensions
- Perform well even if the Naive Assumption is not perfectly met. In many cases, the approximation is enough to build a good classifier
Cons:¶
- Require to remove correlated features because they are voted twice in the model and it can lead to over inflating importance.
- If a categorical variable has a category in test data set which was not observed in training data set, then the model will assign a zero probability. It will not be able to make a prediction. This is often known as “Zero Frequency”. To solve this, we can use the smoothing technique. One of the simplest smoothing techniques is called Laplace estimation. Sklearn applies Laplace smoothing by default when you train a Naive Bayes classifier.
#import libraries and basic settings
import matplotlib.pyplot as plt
plt.style.use('classic')
import numpy as np
import pandas as pd
import random
import math
from IPython.display import display
pi = math.pi
Exploratory analysis¶
Check out the data to get a brief understanding of the dataset, including data integrity, size of the dataset, any missing values.
full_catalog = pd.read_csv('data\indian-diabetes-database.csv')
print(full_catalog.columns)
print('Size of the catalogue: {}'.format(len(full_catalog)))
print('Is there any missing value?: {}'.format(full_catalog.isnull().any().any()))
full_catalog.head()
No missing values are found in our dataset, which is good. Because the dataset is pretty clean we don't need to do too much data cleansing.
The Outcome column is our target column. Value 1 stands positive diagnosis of diabetes and value 0 represents negative. Let's see how many people are positive and/or negative in our database.
# diabetes positive
positive = full_catalog[full_catalog['Outcome'] == 1]
print('The number of patients with diabetes: ', len(positive))
# diabetes negative
negative = full_catalog[full_catalog['Outcome'] == 0]
print('The number of healthy patients: ', len(negative))
Deeper into the data: let's see which features (attributes) can better explain our data according to the Outcome.
Scatter matrix of all data¶
First let's look at the scatter matrix of attributes one against another. Blue represents negative patients (healthy), and red represent positive patients (with diabetes).
# According to the color map, blue = 0, red = 1
df = pd.DataFrame(full_catalog, columns = full_catalog.columns.drop('Outcome'))
pd.plotting.scatter_matrix(df, c = full_catalog['Outcome'].values, figsize = (15, 15),
marker = 'o', hist_kwds = {'bins': 10, 'color':'green'},
s = 10, alpha = 0.2, cmap = plt.get_cmap('bwr'))
plt.show()
Scatter matrix of diabetes patients (positive)¶
# Scatter of patients with diabetes
df = pd.DataFrame(positive, columns = positive.columns.drop('Outcome'))
pd.plotting.scatter_matrix(df, c = 'red', figsize = (15, 15), marker = 'o',
hist_kwds = {'bins': 10, 'color': 'red'}, s = 10,
alpha = 0.2)
plt.show()
Scatter matrix of heathy patients (negative outcomes)¶
# Scatter of patients with diabetes
df = pd.DataFrame(negative, columns = negative.columns.drop('Outcome'))
pd.plotting.scatter_matrix(df, c = 'blue', figsize = (15, 15), marker = 'o',
hist_kwds = {'bins': 10, 'color': 'blue'}, s = 10,
alpha = 0.2)
plt.show()
Training set and test set preparation¶
Of course we can use train_test_split from sklearn. But here, we use a scratch function to realize it.
'''
Function create_training_test divide the whole dataset into training set and test set.
Parameters:
dataset: the original dataset to be split
fraction_training: fraction of training set, number in [0, 1]
msg: debug flag. If True, display message of current process
Output:
training_set
test_set
'''
def create_training_test(dataset, fraction_training, msg):
# define size of training and test sets
size_dataset = len(dataset)
size_training = round(size_dataset * fraction_training)
size_test = size_dataset - size_training
# initializing both the training and test set using the whole dataset
training_set = dataset.copy()
test_set = dataset.copy()
#index of the dataset dataframe
total_idx_list = list(dataset.index.values)
#index of the test set. Randomly selected from total_idx_list
test_idx_list = random.sample(list(dataset.index.values), size_test)
test_idx_list.sort()
#index of the training set
training_idx_list = list(set(total_idx_list) - set(test_idx_list))
#drop the corresponding rows from the training and test dataframe
training_set.drop(training_set.index[test_idx_list], inplace = True)
test_set.drop(test_set.index[training_idx_list], inplace = True)
if msg == True:
training_positive = training_set[training_set['Outcome'] == 1]
training_negative = training_set[training_set['Outcome'] == 0]
print("Size of the dataset : {}".format(size_dataset))
print("Size of the training set : {} samples ({} of the whole dataset)".format(
len(training_set), fraction_training))
print("\tPositive cases in the training set: {}".format(len(training_positive)))
print("\tNegative cases in the training set: {}".format(len(training_negative)))
print("Size of the test set : {}".format(len(test_set)))
return training_set, test_set
'''
Function get_parameters create a dictionary that contain the mean and standard deviation of each column in dataset
Input:
dataset: input dataset frame
msg: debug flag
Output:
dict_parameters: a dictionary that contain the mean and standard deviation of each attribute in dataset
'''
def get_parameters(dataset, msg):
features = dataset.columns.values
nbins = 10
dict_parameters = {}
# exclude the 'Outcome' column from the loop
for i in range(0, len(features)-1):
#we single out the column 'feature[i]' from dataset
aux_df = pd.DataFrame(dataset[features[i]])
#Here we make partition into nbins.
#aux_df has an extra column indicating to which bin each instance belongs to
aux_df['bin'] = pd.cut(aux_df[features[i]], nbins)
# 'counts' is a searies whose index is the bin interval and the values are the
#number of counts in each bin
counts = pd.value_counts(aux_df['bin'])
points_X = np.zeros(nbins)
points_Y = np.zeros(nbins)
for j in range(0, nbins):
points_X[j] = counts.index[j].mid # the mid point of each bin
points_Y[j] = counts.iloc[j] # the number of counts
total_Y = np.sum(points_Y)
# we compute the mean and std and store them in dict_parameters
# whose keys are column names and values are (mu, sigma) tuple
mu = np.sum(points_X * points_Y)/total_Y
sigma2 = np.sum((points_X - mu)**2*points_Y)/(total_Y - 1)
sigma = math.sqrt(sigma2)
dict_parameters[features[i]] = (mu, sigma)
if msg == True:
print('\t\tfeature: {}, mean: {}, standard deviation: {}'.format(
features[i], mu, sigma))
return dict_parameters
'''
Function likelihood calculates the probability density of each column
Input:
instance: series, can be considered as each row in our dataset, the index of the series is the columns of our dataset
dictionary
Output:
dict_likelihood: dictionary contains probability density
'''
def likelihood(instance, dictionary):
instance = instance[instance.index != 'Outcome']
dict_likelihood = {}
for feature in instance.index:
mu = dictionary[feature][0]
sigma = dictionary[feature][1]
measurement = instance[feature]
if feature in ['Pregnancies', 'Insulin', 'DiabetesPedigreeFunction', 'Age']:
# We use exponential distribution for
# columns ['Pregnancies', 'Insulin', 'DiabetesPedigreeFunction', 'Age']
dict_likelihood[feature] = 1./mu*math.exp(-measurement/mu)
elif feature in ['Glucose', 'BloodPressure', 'SkinThickness', 'BMI']:
# We use Gaussian distribution for columns['Glucose', 'BloodPressure', 'SkinThickness', 'BMI']
dict_likelihood[feature] = 1./(math.sqrt(2*pi)*sigma)*math.exp(-(measurement - mu)**2/(2.*sigma**2))
return dict_likelihood
'''
Function bayes classify the input instances according to Bayes Theory
Input:
lkh_positive: for each feature, P(features|outcome == 1)
lkh_negative: for each feature, P(features|outcome == 0)
prob_positive: the probability of patients with diabetes
Output: predictions, 1-positive, 0-negative
'''
def bayes(lkh_positive, lkh_negative, prob_positive):
logPositive = 0
logNegative = 0
for feature in lkh_positive:
logPositive += math.log(lkh_positive[feature])
logNegative += math.log(lkh_negative[feature])
logPositive = logPositive + math.log(prob_positive)
logNegative = logNegative + math.log(1. - prob_positive)
if logPositive > logNegative:
return 1
else:
return 0
# funciton to run the test
def pima_diabetes_NBClassifier(training_fraction, msg):
dataset = pd.read_csv('data\indian-diabetes-database.csv')
training, test = create_training_test (dataset, training_fraction, msg)
#split the training set into training_positive and training_negative according to Outcome
training_positive = training[training['Outcome'] == 1]
training_negative = training[training['Outcome'] == 0]
prob_positive = len(training_positive)/(len(training))
if msg ==True:
print('Getting the parameters for the training set...')
print('\tPositive cases subsample')
param_positive = get_parameters(training_positive, msg)
if msg ==True:
print('\tNegative cases subsample')
param_negative = get_parameters(training_negative, msg)
if msg ==True:
print('\tProbability of finding a positive case: {}'.format(prob_positive))
print('Analyzing the test set...')
# Here we compute the accuracy of the classifier by looping over the instances of the test set
error_count = 0
for idx in test.index.values:
instance = test.loc[idx]
likelihood_positive = likelihood(instance, param_positive)
likelihood_negative = likelihood(instance, param_negative)
prediction = bayes(likelihood_positive, likelihood_negative, prob_positive)
answer = int(instance['Outcome'])
if prediction != answer:
error_count += 1
error_rate = float(error_count)/len(test)
if msg ==True:
print('Results for this implementation:')
print('\t Error rate: : {}'.format(error_rate))
print('\tSuccessful classification rate : {}'.format(1.-error_rate))
return error_rate
Single implementation¶
training_fraction = 0.75
msg = True
pima_diabetes_NBClassifier(training_fraction, msg)
Multiple implementations¶
Let's run the NBClassifier multiple times to get an mean and std values of the error rate and successful rate.
training_fraction = 0.75
nrealizations = 500
msg = False
error_rate = np.zeros(nrealizations)
sucess_rate = np.zeros(nrealizations)
for i in range(0, nrealizations):
aux = pima_diabetes_NBClassifier(training_fraction, msg)
error_rate[i] = aux
success_rate = 1.-aux
print('Results after {} realizations and training the classifier wiht {} of the wholesamples...'.format(
nrealizations, training_fraction))
print('error rate mean: {}, std: {}'.format(np.mean(error_rate),
np.std(error_rate)))
print('success rate mean: {}, std: {}'.format(np.mean(success_rate),
np.std(success_rate)))
Naive Bayes using sklearn: a comparison¶
Now let's look at the normal workflow using the library of sklearn.
from sklearn.model_selection import train_test_split
from sklearn.naive_bayes import GaussianNB
#importing dataset
data = pd.read_csv('data\indian-diabetes-database.csv')
data.dropna(axis = 0, how = 'any', inplace = True)
X = data[['Pregnancies',
'Glucose',
'BloodPressure',
'SkinThickness',
'Insulin',
'BMI',
'DiabetesPedigreeFunction',
'Age']]
y = data['Outcome']
#split dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size = 0.25, random_state = 42)
#initailizing the classifier, we use the GaussianNB
gnb = GaussianNB()
# train the classifier using training set
gnb.fit(X_train, y_train)
y_pred = gnb.predict(X_test)
# print results
print("Number of misclassified samples out of a total {} samples: {}, performance {:05.2f}%"
.format(X_test.shape[0],
(y_test != y_pred).sum(),
100*(1-(y_test != y_pred).sum()/X_test.shape[0])))
Comments
comments powered by Disqus