Naive Bayes Multinomial Event Model
For this experiment, we obtain a set of spam emails, and a set of genuine newsgroup message. Using only the subject line and body of each message, we'll learn to dustinguish between the spam and non-spam.
(a)
Implement a Naive Bayes classifier for spam classification, using the multinomial event model and Laplace Smoothing.
If you're not familiar with Naive Bayes or Multinomial Event Model, you may want to check this first.
Remeber that in multinomial event model, we take how many times a word appears in consideration, since intuitively have an effect on the email category.
The parameters are
Prior distribution
In nb.py some functions are already provided.
import numpy as np
def readMatrix(file):
fd = open(file, 'r')
hdr = fd.readline()
rows, cols = [int(s) for s in fd.readline().strip().split()]
tokens = fd.readline().strip().split()
matrix = np.zeros((rows, cols))
Y = []
for i, line in enumerate(fd):
nums = [int(x) for x in line.strip().split()]
Y.append(nums[0])
kv = np.array(nums[1:])
k = np.cumsum(kv[:-1:2])
v = kv[1::2]
matrix[i, k] = v
return matrix, tokens, np.array(Y)
def nb_train(matrix, category):
state = {}
N = matrix.shape[1]
###################
###################
return state
def nb_test(matrix, state):
output = np.zeros(matrix.shape[0])
###################
###################
return output
def evaluate(output, label):
error = (output != label).sum() * 1. / len(output)
print 'Error: %1.4f' % error
def main():
trainMatrix, tokenlist, trainCategory = readMatrix('MATRIX.TRAIN')
testMatrix, tokenlist, testCategory = readMatrix('MATRIX.TEST')
state = nb_train(trainMatrix, trainCategory)
output = nb_test(testMatrix, state)
evaluate(output, testCategory)
return
if __name__ == '__main__':
main()
First, let's complete the function nb_train. Here, the matrix is a batch of training data, of shape , where
is number of training data, and
is a vocabulary size,
is how many times the
th word appears in the
th data.
Remeber that the prediction formula
When are small, it may cause underflow, so we use logarithm.
def nb_train(matrix, category):
state = {}
N = matrix.shape[1] # Vocabulary size
###################
# prior distribution P(Y=1)
mat_spam = matrix[category==1,:]
mat_non = matrix[category==0,:]
p_spam = float(len(mat_spam)/float(matrix.shape[0]))
p_non_spam = 1. - p_spam
state['phi_y_1'] = p_spam
state['phi_y_0'] = p_non_spam
# spam, conditional distribution P(X_i|Y=1)
time_words_spam = np.sum(mat_spam,axis=0)
num_all_word_spam = np.sum(time_words_spam)
phi_j_y_1 = (time_words_spam + 1) / float(num_all_word_spam + N)
state['phi_j_y_1'] = phi_j_y_1
# non_spam, conditional distribution P(X_i|Y=0)
time_words_non_spam = np.sum(mat_non, axis=0)
num_all_word_non_spam = np.sum(time_words_non_spam)
phi_j_y_0 = (time_words_non_spam + 1) / float(num_all_word_non_spam + N)
state['phi_j_y_0'] = phi_j_y_0
###################
return state
And the function nb_test. Since we use the logarithm in the parameters fitting, we change the prediction formula to
def nb_test(matrix, state):
output = np.zeros(matrix.shape[0])
###################
phi_y_1 = state['phi_y_1']
phi_y_0 = state['phi_y_0']
phi_j_y_0 = state['phi_j_y_0']
phi_j_y_1 = state['phi_j_y_1']
log_phi_y_zero = np.sum(np.log(phi_j_y_0) * matrix,axis=1) + np.log(phi_y_0)
log_phi_y_one = np.sum(np.log(phi_j_y_1) * matrix,axis=1) + np.log(phi_y_1)
output[log_phi_y_one > log_phi_y_zero] = 1
###################
return output
Runnb.pythe output is
Error: 0.0163