Perceptron

Perceptron is a linear classification model, its input is a feature vector, and output predicted label of input vector {1,-1}.


Strategy

Supoose the training dataset is linear separatable, the goal of perceptron is to find a hyperplaney=wx+by=w*x+b which can separate positive samples form negative ones. The risk function is sum of distance from each misclassified sample to the hyperplane. Distance from point x0x_0 to hyper plane SS is

1w(wx+b)\frac{1}{||w||}(w*x+b)

w||w|| is the l2 norm of ww.

For each misclassified sample x(i)x^{(i)},

y(i)(wx(i)+b)>0-y^{(i)}(w*x^{(i)}+b)>0

Suppose MM is the set of all misclassified samples, sum of distance from each misclassified sample to the hyperplane is

1wx(i)My(i)(wx(i)+b)-\frac{1}{||w||} \sum_{x^{(i)}∈M} y^{(i)}(w*x^{(i)}+b)

Leave out 1w\frac{1}{||w||}, we get the loss function of perceptron.

More formally, given training dataset

T={(x(i),y(i));i=1,2,...,n}T = \{(x^{(i)},y^{(i)}); i = 1,2,...,n\}

x(i)Rnx^{(i)}∈R^n, y(i)1,+1y^{(i)}∈{-1,+1}, loss function of perceptron sign(wx+b)sign(w*x+b) is

L(w,b)=x(i)My(i)(wx(i)+b)L(w,b) = -\sum_{x^{(i)}∈M} y^{(i)}(w*x^{(i)}+b)

Apparently, the loss function is non-negative, the less misclassfied samples, the closer misclassified samples are to the hyperplane, the smaller loss function is. For a single sample, the loss function is linear to ww and bb, and 00 when correctly classified, so the loss function is continously differentiable.


Algorithm

(Original Algorithm)

The strategy of perceptron is

minw,bL(w,b)=x(i)My(i)(wx(i)+b)min_{w,b}L(w,b) = -\sum_{x^{(i)}∈M} y^{(i)}(w*x^{(i)}+b)

We can use stochastic gradient descent to do minimization. First, randomly choose a hyperplane, and each time choose a misclassified point to do gradient descent.

wL(w,b)=x(i)My(i)x(i)∇_wL(w,b) = -\sum_{x^{(i)}∈M} y^{(i)}x^{(i)}
bL(w,b)=x(i)My(i)∇_bL(w,b) = -\sum_{x^{(i)}∈M} y^{(i)}

Randomly choose a misclassified sample (x(i),y(i))(x^{(i)}, y^{(i)}) and updata ww and bb.

w:=w+α(y(i)x(i))w := w+α(y^{(i)}x^{(i)})
b:=b+αy(i)b := b+αy^{(i)}

The whole process is

Randomly choose w and b;
While(there are misclassified samples){
    choose a misclassified sample;
    updata w and b;
}

(Original Alogrithm Implementation)

Now we implement perceptron with original algorithm in python.

First, import libraries and get our silly database.

import numpy as np
import matplotlib.pyplot as plt

def getDataSet():
    X = np.array([
        [3, 3],
        [4, 3],
        [1, 1],
        [0.5, 0.5]
    ], dtype=np.float64)


    Y = np.array([1,1,-1,-1], dtype=np.float64)
    return X,Y

Plot data to have a look

X,Y = getDataSet()
X_red = []
X_blue = []

for i in range(len(Y)):
    if Y[i] > 0:
        X_red.append(X[i])
    else:
        X_blue.append(X[i])
X_red = np.array(X_red)
X_blue = np.array(X_blue)


plt.xlim([-5,5])
plt.ylim([-5,5])
plt.plot(X_red[:,0],X_red[:,1],'ro')
plt.plot(X_blue[:,0],X_blue[:,1],'bo')
plt.show()

It's clear that the simple dataset is divided into two groups.

We need a function to decide whether we have classify all data correctly, if not, return a misclassified data.

def done(X, Y, W, b):
    for x, y in zip(X, Y):
        if (W.T.dot(x) + b) * y <= 0:
            return False, x,y
    return True,None,None

Do initialization

W = np.array([0,0],dtype=np.float64)
b = 0.0
datas = []

while True:
    _done, x,y = done(X,Y,W,b)
    if _done:
        break
    else:
        W += y*x
        b += y
        datas.append([W.copy(),b])
print 'parameters: ',W,b

Updata WW and bb if dataset is not perfectly classified.

And plot the hyperplane after each iteration.

xs = np.array([np.min(X[:,1]), np.max(X[:,1])],dtype=np.float128)
for k,para in enumerate(datas):
    ys = -(para[0][0]*xs+para[1])/(para[0][1])
    plt.plot(xs,ys,label='iteraton {}'.format(k),lw='0.5')


plt.legend(loc='best')
plt.show()

The output is

parameters:  [1. 1.] -3.0

(Dual Problem)

In original problem, we set ww and bb to 0, and when there's a misclassified sample (x(i),y(i))(x^{(i)},y^{(i)}), with updata ww and bb with the following rule

w:=w+ηyixiw := w + ηy_ix_i
b:=b+ηyib := b + ηy_i

Suppose for each sample(x(i),y(i))(x^{(i)},y^{(i)}), update parameters nin_i times

w=Δw=i=1maiyixiw = Δw = \sum_{i=1}^m a_iy_ix_i
b=Δb=i=1maiyib = Δb = \sum_{i=1}^m a_iy_i

where

ai=niηa_i = n_i*η

The output model is

f(x)=sign(i=1maiyixix+b)f(x) = sign(\sum_{i=1}^ma_iy_ix_ix+b)

Formally, the algorithm is

a_i = 0, b=0
while(there are misclassified sample){
    choose a misclassified sample;
    a_i := a_i + η
    b   := b + ηy_i
}

Since all samples only appear as inner product in dual problem, we can store a matrix with inner products, this matrix is called Gram Matrix.

G=[xixj]NNG = [x_i x_j]_{N*N}

Now let's implement the dual problem in python

(Dual problem implementation)

let's create another file. First, we need our silly dataset.

import numpy as np

def getDataSet():
    X = np.array([
        [3, 3],
        [4, 3],
        [1, 1],
        [0.5, 0.5]
    ], dtype=np.float64)


    Y = np.array([1,1,-1,-1], dtype=np.float64)
    return X,Y

The done() function is a lot different from the one in original problem. Here aia_i is a python dictionary stores how many times a sample is trained on, since np.ndarray is unhashable, we use index instead; and the grammatgram_mat is the Gram Matrix stores the inner product of samples.

The second for loop is the accumulated sum

j=1majyjxjx\sum_{j=1}^m a_jy_jx_jx
def done(X, Y, a_i, b, gram_mat):
    for i in range(len(Y)):
        w_x = 0
        for j in range(len(Y)):
            w_x += a_i[j]*Y[j]*gram_mat[i][j]
        if Y[i]*(w_x+b)<=0:
            return False, i

    return True,None

And since we need to store a Gram Matrix, let code the perceptron as a class.

class Perceptron():
    def __init__(self,X):
        self.X = X
        self.gram = np.matmul(X, X.T)
        #self.W = np.zeros(shape=[2],dtype=np.float64)
        self.W = np.zeros(shape=[X.shape[1]], dtype=np.float64)
        self.b = 0.0
        self.a_i = dict()
        for i in range(len(X)):
            self.a_i[i] = 0

        print 'gram: ',self.gram

    def parameters(self):
        return self.W, self.b

    def train(self,Y):
        while True:
            _done,i = done(self.X, Y, self.a_i,self.b, self.gram)
            if _done:
                break
            else:
                self.a_i[i] += 1
                self.b += Y[i]

        for j in range(len(Y)):
            self.W += self.a_i[j]*Y[j]*self.X[j]


    def predict(self,inX):
        #pred = self.W[0]*inX[0] + self.W[1]*inX[1]+self.b
        pred = np.inner(self.W,inX) + self.b
        if pred > 0:
            return 1
        else:
            return -1


X,Y = getDataSet()
p = Perceptron(X)
p.train(Y)
print 'parameters: ',p.parameters()

The output is

gram:  [[18.  21.   6.   3. ]
 [21.  25.   7.   3.5]
 [ 6.   7.   2.   1. ]
 [ 3.   3.5  1.   0.5]]
parameters:  (array([1., 1.]), -3.0)

We got the same paramters one the same dataset, so the prediction and plot must be the same.