Naive Bayes

Basic Principle

Naive Bayes learn the prior distribution P(Y)P(Y)and conditional distributon P(XY)P(X|Y)from training dataset, and so learn the joint distributionP(X,Y)P(X,Y).


Strategy

Suppose we have a training dataset

T={(x(i),y(i))}i=1mT = \{(x^{(i)},y^{(i)})\}^m_{i=1}

Naive Bayes assume all xsx's are independent, for formally

P(X=x)=P(X1=x1,X2=x2,...)=i=1nP(Xn=xn)P(X=x) = P(X_1=x_1,X_2=x_2,...) = \prod_{i=1}^n P(X_n=x_n)

and that's why it's "Naive" Bayes. Actually the assumption is not so reasonable, for example, when doing spam email classification where Xi{0,1}X_i∈\{0,1\}, indicating whether a word exists in the email or not. P(XNaive)P(X_{Naive}) and P(XBayes)P(X_{Bayes}) may be correlated to some degree, because if there's word "Naive", it's more likely that there's word, so P(XNaive=1,XBayes=1)P(X_{Naive}=1,X_{Bayes}=1) may be larger than P(XNaive=1)P(XBayes=1)P(X_{Naive}=1)*P(X_{Bayes}=1).

Althrough the assumption is not so true, Naive Bayes algorithm is extremely effective.

Naive Bayes learns the conditional distribution P(XY)P(X|Y), more intuitively, how the example looks like given the label, or say, the machanism how a sample is generated, so it's actually a generative model.

Naive Bayes learn P(Y)P(Y) and P(XY)P(X|Y) directly from training data, and when doing prediction, we compute posterior distribution using Bayes Theorem.

P(Y=yX=x)P(Y=y|X=x)
=P(X=x,Y=y)P(X=x)= \frac {P(X=x,Y=y)} {P(X=x)}
=P(X=xY=y)P(Y=y)_yP(X=xY=_y)P(Y=_y)= \frac {P(X=x|Y=y)P(Y=y)} { \sum_{\_y}P(X=x|Y=\_y)P(Y=\_y) }
=i=1nP(Xi=xiY=y)P(Y=y)_yi=1nP(Xi=xiY=_y)P(Y=_y)= \frac { \prod_{i=1}^n P(X_i=x_i|Y=y)P(Y=y) } { \sum_{\_y} \prod_{i=1}^nP(X_i=x_i|Y=\_y) P(Y=\_y) }

When doing prediction, the strategy is

y=argmaxyP(Y=yX=x)y = argmax_yP(Y=y|X=x)
=argmaxyi=1nP(Xi=xiY=y)P(Y=y)_yi=1nP(Xi=xiY=_y)P(Y=_y)= argmax_y \frac { \prod_{i=1}^n P(X_i=x_i|Y=y)P(Y=y) } { \sum_{\_y} \prod_{i=1}^nP(X_i=x_i|Y=\_y) P(Y=\_y) }
=argmaxyi=1nP(Xi=xiY=y)P(Y=y)= argmax_y\prod_{i=1}^n P(X_i=x_i|Y=y)P(Y=y)

And the model training can be expressed as (on training set)

argmaxP(Y=yX=x)argmaxP(Y=y|X=x)
=argmaxi=1nP(Xi=xiY=y)P(Y=y)_yi=1nP(Xi=xiY=_y)P(Y=_y)= argmax \frac { \prod_{i=1}^n P(X_i=x_i|Y=y)P(Y=y) } { \sum_{\_y} \prod_{i=1}^nP(X_i=x_i|Y=\_y) P(Y=\_y) }
=argmaxi=1nP(Xi=xiY=y)P(Y=y)= argmax\prod_{i=1}^n P(X_i=x_i|Y=y)P(Y=y)
=argmaxi=1nP(Xi=xi,Yi=yi)= argmax\prod_{i=1}^n P(X_i=x_i,Y_i=y_i)

This leads to the joint likelihood form. On the whole training data, the joint likelihood is

l=i=1mP(x(i),y(i))l = \prod_{i=1}^m P(x^{(i)},y^{(i)})
=i=1mP(x(i)y(i)) P(y(i))= \prod_{i=1}^m P(x^{(i)}|y^{(i)})~P(y^{(i)})

Parameters (Maximize Likelihood Approximation)

To make the prediction better, we need some parameters to maximize the joint likelihood. Assume we have a training set of size mm, and each xx has aa probabe values .

φy=k=P(y=k)=i=1m1{y(i)=k}+1m+kφ_{y=k} = P(y=k) = \frac{ \sum_{i=1}^m1\{y^{(i)}=k\} + 1 } {m+k}
φjly=k=P(xj(i)=ly=k)=i=1m 1{xj(i)=l,y=k}+1i=1m 1{y(i)=k}+aφ_{jl|y=k} = P(x_j^{(i)}=l|y=k) = \frac { \sum_{i=1}^m~1\{x_j^{(i)}=l,y=k\} + 1 } { \sum_{i=1}^m~1\{y^{(i)}=k\} + a }

The log joint likelihood is

l(φy=k,φily=k)=logi=1m P(x(i),y(i);φy=k,φily=k)l(φ_{y=k},φ_{il|y=k}) = log\prod_{i=1}^m~P(x^{(i)},y^{(i)};φ_{y=k},φ_{il|y=k})
=logi=1mP(x(i)y(i);φily=k)P(y(i);φy=k)= log\prod_{i=1}^m P(x^{(i)}|y^{(i)};φ_{il|y=k})P(y^{(i)};φ_{y=k})

And the model training is

argmaxφy=k,φily=klogi=1mP(x(i)y(i);φily=k)P(y(i);φy=k)argmax_{φ_{y=k},φ_{il|y=k}} log\prod_{i=1}^m P(x^{(i)}|y^{(i)};φ_{il|y=k})P(y^{(i)};φ_{y=k})

It should be very straight forward why these parameters maximize the likelihood, for the formal proof, check here.

Maybe you're wondering what the +k+k, +a+a, +1+1 terms. Now let's find out why.

Laplace Smoothing

When do the prediction

P(y=kx)=P(y=k,x)P(x)P(y=k|x) = \frac {P(y=k,x)} {P(x)}
=P(xy=k)P(y=k)P(x)= \frac {P(x|y=k)P(y=k)} {P(x)}
=P(y=k)i=1nP(xiy=k)j=1kP(y=k)i=1nP(xiy=k)= \frac { P(y=k)\prod_{i=1}^n P(x_i|y=k) } { \sum_{j=1}^k P(y=k) \prod_{i=1}^nP(x_i|y=k) }

If for some xix_i have P(xiy=k)=0P(x_i|y=k)=0, then both the denominator and numerator are 00 and. That's a trap, and so we need a "smooth". Take φy=kφ_{y=k} as an example, it's unfair to say that P(y=k)=0P(y=k)=0, or say, yy can't be of label kk just because we haven't see y=ky=k in our finite training sample. Notice that it still holds true that

j=1kP(y=j)=1\sum_{j=1}^kP(y=j) = 1

which is a desired property.


Algorithm

Suppose we have training set

T={(x(i),y(i))}i=1mT = \{(x^{(i)},y^{(i)})\}_{i=1}^m

where

x(i)=(x1(i),x2(i),...,xn(i))x^{(i)} = (x^{(i)}_1,x^{(i)}_2,...,x^{(i)}_n)

and xj(i)x^{(i)}_j has aja_j possible value.

(1) Compute prior distribution P(Y)P(Y) and conditional distribution P(XY)P(X|Y)

φy=k=P(y=k)=i=1m 1{y(i)=k}+1m+kφ_{y=k}=P(y=k) =\frac {\sum_{i=1}^m~1\{y^{(i)}=k\} + 1} {m + k}
φjly=k=P(xj(i)=ly=k)=i=1m1{y(i)=k,xj(i)=l}i=1m1{y(i)=k}+aφ_{jl|y=k} = P(x^{(i)}_j=l|y=k) =\frac {\sum_{i=1}^m 1\{y^{(i)}=k, x^{(i)}_j=l \}} {\sum_{i=1}^m 1\{y^{(i)}=k\} + a}

(2) For a given x=(x1,x2,...,xn)x=(x_1,x_2,...,x_n), compute

y=argmaxyP(Y=y) P(X=xY=y)y = argmax_yP(Y=y)~P(X=x|Y=y)
=argmaxyP(Y=y) j=1n P(Xj=xjY=y)= argmax_yP(Y=y)~ \prod_{j=1}^n~P(X_j=x_j|Y=y)

Naive Bayes Variation: Multinomial Event Model

Take spam email classification as an example. In the Naive Bayes model above, suppose our considered 50000 words, then each training sample is

x(i)=(x1(i),x2(i),...,x50000(i))x^{(i)} = ( x^{(i)}_1,x^{(i)}_2,...,x^{(i)}_{50000} )

xjx_j is whether the jjth word appers or not, but actually, how often a word appears really matters. So in Multinomial Event Model, we take it in count.

Now xx is

x=(4,140,65,....)x = (4,140,65,....)

xjx_j is the index of the jjth word in the vocabulary , and nin_i is total number of words in the iith sample, V|V| is how many words we take into consideration, or say, vocabulary capacity.

And the conditional distribution parameter φjφ_j is how many the jjth word appears in all spam email.

φky=1=i=1m[1{y(i)=1}j=1ni1{xj(i)=k}]+1i=1m1{y(i)=1}ni+Vφ_{k|y=1} = \frac { \sum_{i=1}^m [ 1\{y^{(i)}=1\} \sum_{j=1}^{n_i}1\{x^{(i)}_j=k\} ] +1 } {\sum_{i=1}^m 1\{y^{(i)}=1\}n_i + |V|}

The numerator is how many times the kkth number appears in all spam emails, and +1+1 is for Laplace Smoothing

And the prior distribution is pretty much the same.

φy=1=i=1m1{y(i)=1}+1m+2φ_{y=1} = \frac {\sum_{i=1}^m 1\{y^{(i)}=1\} + 1} {m + 2}

Implementation (Naive Bayes)

Suppose we have some postings on a forum, the following number indicates whether the post is abusive. There're only a few word, you can type them by hand.

First, we need to preprocess the data.

import numpy as np

def loadDataSet(filename):
    postings = []
    labels = []
    for line in open(filename):
        words = line.split()
        posting = words[:-1]
        label = int(words[-1])
        postings.append(posting)
        labels.append(label)
    return postings, np.array(labels)

def createVocabularyList(dataset):
    wordsSet = set()
    for data in dataset:
        for word in data:
            wordsSet.add(word)
    return list(wordsSet)

def setOfWordsToVec(vocabularyList, dataset):
    ret = []
    for data in dataset:
        returnVec = [0]*len(vocabularyList)
        for word in data:
            if word in vocabularyList:
                returnVec[vocabularyList.index(word)] = 1
            else:
                print('word "{}" is not in the vocabulary list'.format(word))
        ret.append(returnVec)
    return np.array(ret)

In the function setOfWordToVec, for each sample data, the feature is of the same size as vocabularyList, the iith digit indicats whether the iith word in the vocabularyList appears in the sample data.

postings, labels = loadDataSet('postings.txt')
vocabularyList = createVocabularyList(postings)
vecs = setOfWordsToVec(vocabularyList, postings)

for vec in vecs:
    for i in range(len(vec)):
        if vec[i] == 1:
            print(vocabularyList[i],end=' ')
    print()

The output is

has flea problems help my dog please 
him maybe park not take dog to stupid 
him is so I cute dalmation love my 
worthless garbage stop posting stupid 
steak him is bosh how mr my eating to stop 

The words maybe in different order, that's because we use python dictionary in function createVocabularyList, but that's there words.

And now we need a function to fit the parameters (with Laplace Smoothing) according to the formulas. In this function, we use the processed word vectors, which words are in the fixed order, same order as in the vocabulary list.

φy=k=P(y=k)=i=1m 1{y(i)=k}+1m+kφ_{y=k}=P(y=k) =\frac {\sum_{i=1}^m~1\{y^{(i)}=k\} + 1} {m + k}
φjly=k=P(xj(i)=ly=k)=i=1m1{y(i)=k,xj(i)=l}i=1m1{y(i)=k}+aφ_{jl|y=k} = P(x^{(i)}_j=l|y=k) =\frac {\sum_{i=1}^m 1\{y^{(i)}=k, x^{(i)}_j=l \}} {\sum_{i=1}^m 1\{y^{(i)}=k\} + a}

In this function, we use the processed word vectors, which words are in the fixed order, same order as in the vocabulary list, or the probability of each word will be messed.


def trainNaiveBayes(trainSamples, trainLabels):
    num_train = len(trainSamples)
    numwords = len(trainSamples[0])

    
    # prior distribution φ_{y=1} = P(y=1)
    num_y_1 = len(np.flatnonzero(trainLabels))
    phi_y_1 = (float(num_y_1))/(num_train)
    phi_y_0 = 1. - phi_y_1


    # φ_{j|y=1} = P(x_j|y=1)
    # φ_{j|y=0} = P(x_j|y=0)
    samples_y_1 = trainSamples[trainLabels==1]
    samples_y_0 = trainSamples[trainLabels==0]

    # log scales phi_j in case underflow
    phi_j_y_1 = np.log(
        (np.sum(samples_y_1,axis=0) + 1)/(len(samples_y_1) + numwords)
    )
    phi_j_y_0 = np.log(
        (np.sum(samples_y_0,axis=0) + 1)/(len(samples_y_0) + numwords)
    )

    return {
            'phi_y_1':phi_y_1,
            'phi_y_0':phi_y_0,
            'phi_j_y_1':phi_j_y_1,
            'phi_j_y_0':phi_j_y_0
            }

Now we need to do prediction with the fitted Naive Bayes Model. Remeber the formula

P(Y=yX=x)P(Y=y|X=x)
=P(X=x,Y=y)P(X=x)= \frac {P(X=x,Y=y)} {P(X=x)}
=P(X=xY=y)P(Y=y)_yP(X=xY=_y)P(Y=_y)= \frac {P(X=x|Y=y)P(Y=y)} { \sum_{\_y}P(X=x|Y=\_y)P(Y=\_y) }
=i=1nP(Xi=xiY=y)P(Y=y)_yi=1nP(Xi=xiY=_y)P(Y=_y)= \frac { \prod_{i=1}^n P(X_i=x_i|Y=y)P(Y=y) } { \sum_{\_y} \prod_{i=1}^nP(X_i=x_i|Y=\_y) P(Y=\_y) }

Our strategy when doing prediction is

y=argmaxyP(Y=yX=x)y = argmax_yP(Y=y|X=x)
=argmaxyi=1nP(Xi=xiY=y)P(Y=y)_yi=1nP(Xi=xiY=_y)P(Y=_y)= argmax_y \frac { \prod_{i=1}^n P(X_i=x_i|Y=y)P(Y=y) } { \sum_{\_y} \prod_{i=1}^nP(X_i=x_i|Y=\_y) P(Y=\_y) }

The demominator is a sum over yy, so we just need

y=argmaxyi=1nP(Xi=xiY=y)P(Y=y)y = argmax_y\prod_{i=1}^n P(X_i=x_i|Y=y)P(Y=y)

Instead use the formula directly, we compute log-likelihood. The formula is same as

y=argmaxylog(i=1nP(Xi=xiY=y)P(Y=y))y = argmax_y log (\prod_{i=1}^n P(X_i=x_i|Y=y)P(Y=y))
=argmaxyi=1mlogP(Xi=xiY=y)+logP(Y=y)= argmax_y \sum_{i=1}^m logP(X_i=x_i|Y=y) + logP(Y=y)

Remeber we calculate loglogphi_j_y_1 in the function trainNaiveBayes, now we use the log term.

def classifyNaiveBayes(vecs,parameters):

    phi_y_1 = parameters['phi_y_1']
    phi_y_0 = parameters['phi_y_0']
    phi_j_y_1 = parameters['phi_j_y_1']
    phi_j_y_0 = parameters['phi_j_y_0']

    ret = []
    for vec in vecs:
        p_y_1 = np.sum(vec*phi_j_y_1) + np.log(phi_y_1)
        p_y_0 = np.sum(vec*phi_j_y_0) + np.log(phi_y_0)
        print('p_y_1:', p_y_1)
        print('p_y_0:', p_y_0)
        print()
        if p_y_1 > p_y_0:
            ret.append(1)
        else:
            ret.append(0)

    return np.array(ret)

Now we test how the Naive Bayes Model performs.

def loadTestingData():
    return[
        ['you','stupid','thing'],
        ['I','love','him'],
        ['he','is','a','stupid','garbage']
    ]

Clearly the first and third sentence is abusive.

postings, labels = loadDataSet('postings.txt')
vocabularyList = createVocabularyList(postings)
wordVecs = setOfWordsToVec(vocabularyList, postings)
state = trainNaiveBayes(wordVecs, labels)


testData = loadTestingData()
testVecs = setOfWordsToVec(vocabularyList, testData)
testLables = classifyNaiveBayes(testVecs, state)
print(testLables)

The output is

word "you" is not in the vocabulary list
word "thing" is not in the vocabulary list
word "he" is not in the vocabulary list
word "a" is not in the vocabulary list

p_y_1: -3.251665647691192
p_y_0: -3.9765615265657175

p_y_1: -10.525105164769647
p_y_0: -8.42312668237717

p_y_1: -9.426492876101538
p_y_0: -9.80942104349706

[1 0 1]

You can see that the Naive Model didn't show much confidence on the first and third sample, that's because our training data is small, for some words, the model doesn't "know" exactly whether it's abusive or not.

For Multinomial Event Model, you may check CS229 Assignment.