Perceptron
Perceptron is a linear classification model, its input is a feature vector, and output predicted label of input vector {1,-1}.
Strategy
Supoose the training dataset is linear separatable, the goal of perceptron is to find a hyperplane which can separate positive samples form negative ones. The risk function is sum of distance from each misclassified sample to the hyperplane.
Distance from point  to hyper plane  is
 is the l2 norm of .
For each misclassified sample ,
Suppose  is the set of all misclassified samples, sum of distance from each misclassified sample to the hyperplane is
Leave out , we get the loss function of perceptron.
More formally, given training dataset
, , loss function of perceptron  is
Apparently, the loss function is non-negative, the less misclassfied samples, the closer misclassified samples are to the hyperplane, the smaller loss function is.
For a single sample, the loss function is linear to  and , and  when correctly classified, so the loss function is continously differentiable.
Algorithm
(Original Algorithm)
The strategy of perceptron is
We can use stochastic gradient descent to do minimization. First, randomly choose a hyperplane, and each time choose a misclassified point to do gradient descent.
Randomly choose a misclassified sample  and updata  and .
The whole process is
Randomly choose w and b;
While(there are misclassified samples){
    choose a misclassified sample;
    updata w and b;
}
(Original Alogrithm Implementation)
Now we implement perceptron with original algorithm in python.
First, import libraries and get our silly database.
import numpy as np
import matplotlib.pyplot as plt
def getDataSet():
    X = np.array([
        [3, 3],
        [4, 3],
        [1, 1],
        [0.5, 0.5]
    ], dtype=np.float64)
    Y = np.array([1,1,-1,-1], dtype=np.float64)
    return X,Y
Plot data to have a look
X,Y = getDataSet()
X_red = []
X_blue = []
for i in range(len(Y)):
    if Y[i] > 0:
        X_red.append(X[i])
    else:
        X_blue.append(X[i])
X_red = np.array(X_red)
X_blue = np.array(X_blue)
plt.xlim([-5,5])
plt.ylim([-5,5])
plt.plot(X_red[:,0],X_red[:,1],'ro')
plt.plot(X_blue[:,0],X_blue[:,1],'bo')
plt.show()

It's clear that the simple dataset is divided into two groups.
We need a function to decide whether we have classify all data correctly, if not, return a misclassified data.
def done(X, Y, W, b):
    for x, y in zip(X, Y):
        if (W.T.dot(x) + b) * y <= 0:
            return False, x,y
    return True,None,None
Do initialization
W = np.array([0,0],dtype=np.float64)
b = 0.0
datas = []
while True:
    _done, x,y = done(X,Y,W,b)
    if _done:
        break
    else:
        W += y*x
        b += y
        datas.append([W.copy(),b])
print 'parameters: ',W,b
Updata  and  if dataset is not perfectly classified.
And plot the hyperplane after each iteration.
xs = np.array([np.min(X[:,1]), np.max(X[:,1])],dtype=np.float128)
for k,para in enumerate(datas):
    ys = -(para[0][0]*xs+para[1])/(para[0][1])
    plt.plot(xs,ys,label='iteraton {}'.format(k),lw='0.5')
plt.legend(loc='best')
plt.show()
The output is
parameters:  [1. 1.] -3.0

(Dual Problem)
In original problem, we set  and  to 0, and when there's a misclassified sample , with updata  and  with the following rule
Suppose for each sample, update parameters  times
where
The output model is
Formally, the algorithm is
a_i = 0, b=0
while(there are misclassified sample){
    choose a misclassified sample;
    a_i := a_i + η
    b   := b + ηy_i
}
Since all samples only appear as inner product in dual problem, we can store a matrix with inner products, this matrix is called Gram Matrix.
Now let's implement the dual problem in python
(Dual problem implementation)
let's create another file. First, we need our silly dataset.
import numpy as np
def getDataSet():
    X = np.array([
        [3, 3],
        [4, 3],
        [1, 1],
        [0.5, 0.5]
    ], dtype=np.float64)
    Y = np.array([1,1,-1,-1], dtype=np.float64)
    return X,Y
The done() function is a lot different from the one in original problem. Here  is a python dictionary stores how many times a sample is trained on, since np.ndarray is unhashable, we use index instead; and the  is the Gram Matrix stores the inner product of samples.
The second for loop is the accumulated sum
def done(X, Y, a_i, b, gram_mat):
    for i in range(len(Y)):
        w_x = 0
        for j in range(len(Y)):
            w_x += a_i[j]*Y[j]*gram_mat[i][j]
        if Y[i]*(w_x+b)<=0:
            return False, i
    return True,None
And since we need to store a Gram Matrix, let code the perceptron as a class.
class Perceptron():
    def __init__(self,X):
        self.X = X
        self.gram = np.matmul(X, X.T)
        #self.W = np.zeros(shape=[2],dtype=np.float64)
        self.W = np.zeros(shape=[X.shape[1]], dtype=np.float64)
        self.b = 0.0
        self.a_i = dict()
        for i in range(len(X)):
            self.a_i[i] = 0
        print 'gram: ',self.gram
    def parameters(self):
        return self.W, self.b
    def train(self,Y):
        while True:
            _done,i = done(self.X, Y, self.a_i,self.b, self.gram)
            if _done:
                break
            else:
                self.a_i[i] += 1
                self.b += Y[i]
        for j in range(len(Y)):
            self.W += self.a_i[j]*Y[j]*self.X[j]
    def predict(self,inX):
        #pred = self.W[0]*inX[0] + self.W[1]*inX[1]+self.b
        pred = np.inner(self.W,inX) + self.b
        if pred > 0:
            return 1
        else:
            return -1
X,Y = getDataSet()
p = Perceptron(X)
p.train(Y)
print 'parameters: ',p.parameters()
The output is
gram:  [[18.  21.   6.   3. ]
 [21.  25.   7.   3.5]
 [ 6.   7.   2.   1. ]
 [ 3.   3.5  1.   0.5]]
parameters:  (array([1., 1.]), -3.0)
We got the same paramters one the same dataset, so the prediction and plot must be the same.