Naive Bayes Multinomial Event Model

For this experiment, we obtain a set of spam emails, and a set of genuine newsgroup message. Using only the subject line and body of each message, we'll learn to dustinguish between the spam and non-spam.

(a)

Implement a Naive Bayes classifier for spam classification, using the multinomial event model and Laplace Smoothing.

If you're not familiar with Naive Bayes or Multinomial Event Model, you may want to check this first.

Remeber that in multinomial event model, we take how many times a word appears in consideration, since intuitively have an effect on the email category.

The parameters are

Prior distribution

φ_{y=1}= \frac {\sum_{i=1}^m1\{y^{(i)}=1\}+1} {m + 2}

φ_{k|y=1} = \frac {\sum_{i=1}^m [1\{y^{(i)}=1\}\sum_{j=1}^{n_i}1\{x^{(i)}_j=k\}] + 1 } {\sum_{i=1}^m 1\{y^{(i)}=1\}n_i + |V|}

In nb.py some functions are already provided.

import numpy as np

def readMatrix(file):
    fd = open(file, 'r')
    hdr = fd.readline()
    rows, cols = [int(s) for s in fd.readline().strip().split()]
    tokens = fd.readline().strip().split()
    matrix = np.zeros((rows, cols))
    Y = []
    for i, line in enumerate(fd):
        nums = [int(x) for x in line.strip().split()]
        Y.append(nums[0])
        kv = np.array(nums[1:])
        k = np.cumsum(kv[:-1:2])
        v = kv[1::2]
        matrix[i, k] = v
    return matrix, tokens, np.array(Y)

def nb_train(matrix, category):
    state = {}
    N = matrix.shape[1]
    ###################

    ###################
    return state

def nb_test(matrix, state):
    output = np.zeros(matrix.shape[0])
    ###################

    ###################
    return output

def evaluate(output, label):
    error = (output != label).sum() * 1. / len(output)
    print 'Error: %1.4f' % error

def main():
    trainMatrix, tokenlist, trainCategory = readMatrix('MATRIX.TRAIN')
    testMatrix, tokenlist, testCategory = readMatrix('MATRIX.TEST')

    state = nb_train(trainMatrix, trainCategory)
    output = nb_test(testMatrix, state)

    evaluate(output, testCategory)
    return

if __name__ == '__main__':
    main()

First, let's complete the function nb_train. Here, the matrix is a batch of training data, of shape $m*n$ , where $m$ is number of training data, and $n$ is a vocabulary size, $x^{(m)}_j$ $(0<j<n)$ is how many times the $j$ th word appears in the $m$ th data.

Remeber that the prediction formula

y = argmax_y \prod_{j=1}^n P(X_j=x_j|Y=y) P(Y=y)

When $P(X_j=x_j|Y=y)$ are small, it may cause underflow, so we use logarithm.

def nb_train(matrix, category):
    state = {}
    N = matrix.shape[1] # Vocabulary size
    ###################
    
    # prior distribution P(Y=1)
    mat_spam = matrix[category==1,:]
    mat_non  = matrix[category==0,:]
    p_spam = float(len(mat_spam)/float(matrix.shape[0]))
    p_non_spam = 1. - p_spam
    state['phi_y_1'] = p_spam
    state['phi_y_0'] = p_non_spam
    
    # spam, conditional distribution P(X_i|Y=1)
    time_words_spam = np.sum(mat_spam,axis=0)
    num_all_word_spam = np.sum(time_words_spam)
    phi_j_y_1 = (time_words_spam + 1) / float(num_all_word_spam + N)
    state['phi_j_y_1'] = phi_j_y_1
    
    # non_spam, conditional distribution P(X_i|Y=0)
    time_words_non_spam = np.sum(mat_non, axis=0)
    num_all_word_non_spam = np.sum(time_words_non_spam)
    phi_j_y_0 = (time_words_non_spam + 1) / float(num_all_word_non_spam + N)
    state['phi_j_y_0'] = phi_j_y_0
    
    ###################
    return state

And the function nb_test. Since we use the logarithm in the parameters fitting, we change the prediction formula to

y = argmax_y \prod_{j=1}^n P(X_j=x_j|Y=y) P(Y=y)

= argmax_y \sum_{i=1}^n logP(X_j=x_j|Y=y) + logP(Y=y)

def nb_test(matrix, state):
   output = np.zeros(matrix.shape[0])
   ###################
   phi_y_1 = state['phi_y_1']
   phi_y_0 = state['phi_y_0']
   phi_j_y_0 = state['phi_j_y_0']
   phi_j_y_1 = state['phi_j_y_1']

   log_phi_y_zero = np.sum(np.log(phi_j_y_0) * matrix,axis=1) + np.log(phi_y_0)
   log_phi_y_one  = np.sum(np.log(phi_j_y_1) * matrix,axis=1) + np.log(phi_y_1)

   output[log_phi_y_one > log_phi_y_zero] = 1
   ###################
   return output

Runnb.pythe output is

Error: 0.0163