Multisite adaption algorithm based on low-rank representaion: Criticizing a paper and developing my own algorithm, which is also full of flaws.

You may wanna read my note on low-rank representaion and the paper on multi-site adaption based on low-rank representation.

Part 1: Problems about this paper

In the paper, the basic set up is: data come from TT different sites, each site may have some systematic bias which make data from different sites somewhat "heterogeneous". Then from these TT sites, choose one as "Target Site" STS_T, assume there is a projection matrix PP which maps data from STS_T to the common latent space. Then for each of other sites SiS_i, which are called a "Source Site", there is a projection matrix PiP_i, obtained by Pi=P+EPiP_i = P + E_{P_i}, which maps data from SiS_i to the latent space. Then use the mapped data from the target site to linearly represent the mapped data from source sites, that is PiXSi=PXTZi+ESiP_iX_{S_i} = PX_TZ_i + E_{S_i}, where XSiX_{S_i} is the data from source site SiS_i and XTX_T is the data from target site STS_T.The setup is illustrate by the picture below.

multisite.png

It's assumed that the latent space is low-rank, then the following optimization problem is given

minP,Pi,Zi,ESi,EPi   P+i=1K(Zi+αESi1+βEPi1)s.t    PiXSi=PXTZi+ESiPi=P+EPiPPT=I\text{min}_{P,P_i,Z_i,E_{S_i}, E_{P_i}}~~~ ||P||_* + \sum_{i=1}^K\bigg( ||Z_i||_* + \alpha ||E_{S_i}||_1 + \beta ||E_{P_i}||_1 \bigg) \\ \text{s.t}~~~~P_i X_{S_i} = P X_T Z_i + E_{S_i}\\ P_i = P + E_{P_i}\\ PP^T = I

This problem can be solved by solving the equivalent problem

minP,J,Pi,Fi,Zi,ESi,EPi   J+i=1K(Fi+αESi1+βEPi1)s.t    PiXSi=PXTZi+ESiPi=P+EPiP=JZi=FiPPT=I\text{min}_{P,J,P_i,F_i,Z_i,E_{S_i}, E_{P_i}}~~~ ||J||_* + \sum_{i=1}^K\bigg( ||F_i||_* + \alpha ||E_{S_i}||_1 + \beta ||E_{P_i}||_1 \bigg) \\ \text{s.t}~~~~P_i X_{S_i} = P X_T Z_i + E_{S_i}\\ P_i = P + E_{P_i}\\ P = J \\ Z_i = F_i\\ PP^T = I

Then the paper solve this problem with Augmented Lagrangian Multiplier. First, form the augmented lagrangian multiplier as if the constraint PPT=IPP^T=I does not exist.

Ac(P,J,Pi,Fi,Zi,ESi,EPi,Y1,i,Y2,i,Y3,i,Y4)=A_c(P,J,P_i,F_i,Z_i,E_{S_i}, E_{P_i}, Y_{1,i}, Y_{2,i}, Y_{3,i}, Y_4) =
J+i=1K(Fi+αESi1+βEPi1)||J||_* + \sum_{i=1}^K\bigg( ||F_i||_* + \alpha ||E_{S_i}||_1 + \beta ||E_{P_i}||_1 \bigg)
+<Y4,i,PJ>+i=1K(<Y1,i,ZiFi>+<Y2,i,PiXSiPXTZiESi>+<Y3,i,PiPEPi>)+ <Y_{4,i}, P-J> + \sum_{i=1}^K\bigg( <Y_{1,i}, Z_i - F_i> + <Y_{2,i}, P_i X_{S_i} - P X_T Z_i - E_{S_i}> + <Y_{3,i}, P_i - P - E_{P_i}> \bigg)
+c2PJF2+c2i=1K(ZiFiF2+PiXSiPXTZiESiF2+PiPEPiF2)+ \frac{c}{2}||P-J||_F^2 + \frac{c}{2} \sum_{i=1}^K\bigg( ||Z_i - F_i||_F^2 + ||P_i X_{S_i} - P X_T Z_i - E_{S_i}||_F^2 + ||P_i - P - E_{P_i}||_F^2 \bigg)

Then the optimization algorithm is given

initialize all variables
while not convergence{
    Fix other variables, minimize A_c over J
    Fix other variables, minimize A_c over F_i 
    Fix other variables, minimize A_c over E_{S_i}
    Fix other variables, minimize A_c over E_{P_i}
    Fix other variables, minimize A_c over F_i
    Fix other variables, minimize A_c over Z_i
    Fix other variables, minimize A_c over P_i
    Fix other variables, minimize A_c over P
    Update dual variable Y_{1,i} += Z_i - F_i
    Update dual variable Y_{2,i} += Z_i - F_i
    Update dual variable Y_{3,i} += Z_i - F_i
    Update dual variable Y_1 += Z_i - F_i
    
    P = Orth(P) to satisfy PP^T = I
}

I found 3 fatal mistakes in this article.

1. About the objective

In this article, one term in the the objective of the optimization problem is P||P||_*, which is the sum of singular values of PP. But when the equality constraint PPT=IPP^T=I is satisfied, we can perform singular value decomposition on PP, we will get

PPT=UΣVT(UΣVT)T=UΣVTVΣUT=UΣ2UT=IPP^T = U \Sigma V^T (U \Sigma V^T)^T \\ = U \Sigma V^T V \Sigma U^T \\ = U \Sigma^2 U^T = I

Multiplie both sides by UU, we get UΣ2=UU \Sigma^2=U Since Σ\Sigma is a diagonal matrix, the only solution is that Σ=I\Sigma=I. If so, P=UΣVTP = U \Sigma V^T has nuclear norm nn, then the term P||P||_* is determined by PPT=IPP^T=I. So the term P||P||_* in the objective is redundant.

2. About the optimization problem

The optimization algorithm given by this article is not quite justified.

They constructed the Augmented Lagrangian Multiplier as if the equality constraint PPT=IPP^T=I, which is a quadratic constraint, does not exist. After each iteration where the primal and dual variables are minimized over, the constraint PPT=IPP^T=I is satisfied by "bruteforcely" orthogonalizingPP, this step changes PP, thus PP does not minimizes the lagrangian when other variables are fixed, so the convergence of augmented lagrangian multiplier does not hold any more.

If this algorithm does converge with some magic going on underneath, at least they can give some proof or say something like "we cannot prove the convergence but it does converge everytime in the experiment".

3. About the convexity

As I quote from the paper, "the optimization of Eq.(5) is convex and can be solved by iteratively updating each variable separately".

Even if we do not consider this equality constraint PPT=IPP^T=I, which is not included when forming augmented lagrangian multiplier, the equality constraint PiXSi=PXTZi+ESiP_i X_{S_i} = P X_T Z_i + E_{S_i}, which is a quadratic equality constraint because of the appearance of variables PP and ZiZ_i in the same term, makes the problem nonconvex.

Thus the optimization problem is in no sense convex due to a quadratic equality constraint. There are some convergence proof for augmented lagrangian multiplier on nonconvex problems, but the statement "the optimization of Eq.(5) is convex" is definitely wrong.


So, there are 2 possibilities

  1. I do not fully understand this paper, if you find where I was wrong, please email diqiudaozhezhuan@gmail.com, it really bothers me.
  2. This article is full of flaws.

When thinking through this article, I came up with my own idea, which turns out to be also a piece of garbage. But I still wanna write it down.

Part 2: My own idea.

Actually, my idea is very simple. Suppose datas from differents sites came from a common space, for each site SiS_i, there is a projection matrix PiP_i which project the data to a common space, then perform low-rank representation in the common space. We can arrange XiX_i's into a block-diagonal matrix XX and horizontally stack PiP_i's in to a matrix PP. And the optimization problem can be written as

minP,Z,E      Z+λE1s.t.        PX=PXZ+E\text{min}_{P,Z,E} ~~~~~~ ||Z||_* + \lambda ||E||_1\\ \text{s.t.}~~~~~~~~ PX = PXZ + E

This is identical to the basic low-rank representation except that here we perform low-rank representation with PXPX. The structure of the matrix multiplication is shown in the picture below.

PX

However, this optimization has a trivial solution P=0,Z=0,E=0P=0, Z=0, E=0. This trivial solution exists because that when mapping datas to a common space, we through all informations in XX by setting P=0P=0, thus the common space is surely low-rank. Thus, we want to project XX into the common space and keep some information in XX in the meantime. Consider the following optimization problem

minP,Z,E    Z+i(αPiI+βEi1)s.t.      PX=PXZ+E\text{min}_{P,Z,E} ~~~~ ||Z||_* + \sum_i \bigg( \alpha ||P_i-I||_* + \beta ||E_i||_1 \bigg) \\ \text{s.t.}~~~~~~PX = PXZ + E

The term PiI||P_i - I||_* in objective is quite heuristic, that is, put it other way, not fully justified. The motivation is, if PiI||P_i - I||_* is small, then the sigular values of PiIP_i - I tends to 0, then singular values of PiP_i tends to 1, thus we avoid the situation where PiP_i throws all information in XX.

This problem is nonconvex due to the quadratic equality constraint PX=PXZ+EPX=PXZ+E,but we can still use augmented lagragian multiplier to get to a local minimum.

To solve the optimization problem, we can solve the equivalent problem

minPi,Z,Ei,Qi,J   J+i(αQi+βEi1)\text{min}_{P_i,Z,E_i, Q_i, J} ~~~ ||J||_* + \sum_i \bigg( \alpha ||Q_i||_* + \beta ||E_i||_1 \bigg)
s.t.   PX=PXZ+E:Y1J=Z:Y2Qi=PiI:Y3,i\text{s.t.}~~~PX = PXZ + E : Y1 \\ J = Z : Y_2\\ Q_i = P_i - I : Y_{3,i}

First, we build the augmented lagrangian multiplier

A(Pi,Z,Ei,Qi,J,Y1,Y2,Y3,i)=J+i(αQi+βEi1)A(P_i, Z, E_i, Q_i, J, Y_1, Y_2, Y_{3,i}) = ||J||_* + \sum_i \bigg( \alpha ||Q_i||_* + \beta ||E_i||_1 \bigg)
+<Y1,PXPXZE>+<Y2,JZ>+i<Y3,i,QiPi+I>+ <Y_1, PX-PXZ-E> + <Y_2, J-Z> + \sum_i <Y_{3,i}, Q_i-P_i+I>
+c2PXPXZEF2+c2JZF2+c2iQiPi+IF2+ \frac{c}{2}||PX-PXZ-E||_F^2 + \frac{c}{2}||J-Z||_F^2 + \frac{c}{2}\sum_i ||Q_i-P_i+I||_F^2

We can alternatively minimize the lagrangian over each variable.

1. minimize over JJ

minJ   A(Pi,Z,Ei,Qi,J,Y1,Y2,Y3,i)\text{min}_{J}~~~ A(P_i, Z, E_i, Q_i, J, Y_1, Y_2, Y_{3,i})

This gives the optimization problem

minJ   J+<Y2,JZ>+c2JZF2\text{min}_{J}~~~ ||J||_* + <Y_2, J-Z> + \frac{c}{2}||J-Z||_F^2

The close-form solution is given by singular value shrinkage

J=D1/c(ZY2c)                      (1)J = \mathcal{D}_{1/c}(Z-\frac{Y_2}{c}) ~~~~~~~~~~~~~~~~~~~~~~ (1)

2. minimize over QiQ_i

minQi   A(Pi,Z,Ei,Qi,J,Y1,Y2,Y3,i)\text{min}_{Q_i}~~~ A(P_i, Z, E_i, Q_i, J, Y_1, Y_2, Y_{3,i})

This gives the optimization problem

minJ   αQi+<Y3,i,QiPi+I>+c2iQiPi+IF2\text{min}_{J}~~~ \alpha ||Q_i||_* + <Y_{3,i}, Q_i-P_i+I> + \frac{c}{2}\sum_i ||Q_i-P_i+I||_F^2

The close-form solution is given by singular value shrinkage

Qi=Dα/c(PiIY3,ic)                      (2)Q_i = \mathcal{D}_{\alpha / c}(P_i - I - \frac{Y_{3,i}}{c}) ~~~~~~~~~~~~~~~~~~~~~~ (2)

3. minimize over PP

minP   A(Pi,Z,Ei,Qi,J,Y1,Y2,Y3,i)\text{min}_{P}~~~ A(P_i, Z, E_i, Q_i, J, Y_1, Y_2, Y_{3,i})

This gives the optimization problem

minP   c2PXPXZEF2+c2iQiPi+IF2\text{min}_P~~~ \frac{c}{2}||PX-PXZ-E||_F^2 + \frac{c}{2}\sum_i ||Q_i-P_i+I||_F^2
+<Y1,PXPXZE>+i<Y3,i,QiPi+I>+ <Y_1, PX-PXZ-E> + \sum_i <Y_{3,i}, Q_i-P_i+I>

This problem can be dissect into several optimization problems

minPi   PiXiFi+jiPjXjFjEF2+c2iQiPi+IF2\text{min}_{P_i}~~~||P_iX_iF_i + \sum_{j \neq i} P_jX_jF_j - E||_F^2 + \frac{c}{2}\sum_i ||Q_i-P_i+I||_F^2
+<Y1,PiXiFi>+i<Y3,i,QiPi+I>+ <Y_1, P_iX_iF_i> + \sum_i <Y_{3,i}, Q_i-P_i+I>

where FiF_i is the ith row block of IZI-Z.

The close-form solution is given by

Pi[(XiFi)(XiFi)T+I]=(Qi+I+Y3,ic)(G+Y1c)(XiFi)T                      (3)P_i [(X_iF_i)(X_iF_i)^T + I] = (Q_i + I + \frac{Y_{3,i}}{c}) - (G + \frac{Y_1}{c})(X_iF_i)^T ~~~~~~~~~~~~~~~~~~~~~~ (3)

4. minimize over ZZ

minZ   A(Pi,Z,Ei,Qi,J,Y1,Y2,Y3,i)\text{min}_{Z}~~~ A(P_i, Z, E_i, Q_i, J, Y_1, Y_2, Y_{3,i})

This gives the optimization problem

minZc2PXPXZEF2+c2JZF2\text{min}_Z \frac{c}{2}||PX-PXZ-E||_F^2 + \frac{c}{2}||J-Z||_F^2
+<Y1,PXPXZE>+<Y2,JZ>+ <Y_1, PX-PXZ-E> + <Y_2, J-Z>

The close-form solution is given by

[(PX)T(PX)+I]Z=(PX)T[PXE+Y1c]+J+Y2c                      (4)[(PX)^T(PX) + I]Z = (PX)^T[PX-E+\frac{Y_1}{c}] + J + \frac{Y_2}{c} ~~~~~~~~~~~~~~~~~~~~~~ (4)

Implementation details: In (3) and (4),the "diagonal-plus-lowrank" structure is crying out, make sure that you exploit the sparsity pattern by using block elimination to solve the linear system

5. minimize over EE

minE   A(Pi,Z,Ei,Qi,J,Y1,Y2,Y3,i)\text{min}_{E}~~~ A(P_i, Z, E_i, Q_i, J, Y_1, Y_2, Y_{3,i})

This gives the optimization problem

minE   βE1+c2PXPXZEF2+<Y1,PXPXZE>\text{min}_{E}~~~ \beta ||E||_1 + \frac{c}{2}||PX-PXZ-E||_F^2 + <Y_1, PX-PXZ-E>

This problem can be dissected into several problems

minEi   βEi1+c2PXFiEiF2+<Y1,i,Ei>\text{min}_{E_i}~~~ \beta ||E_i||_1 + \frac{c}{2}||PXF_i-E_i||_F^2 + <Y_{1,i}, -E_i>

where FiF_i is the ith column block of IZI_Z, Y1,iY_{1,i} is the ith column of the dual variable Y1Y_1.

This problem can be formulated as below

minEi   βcEi1+12Ei(PXFi+Y1,ic)F2\text{min}_{E_i}~~~ \frac{\beta}{c}||E_i||_1 + \frac{1}{2}||E_i - (PXF_i + \frac{Y_{1,i}}{c})||_F^2

which can be solved efficiently with primal-dual interior point method. For the detail, please refer to and appendix.

We can alternatingly optimize over each variable and update the dual variable accordingly in each iteration of augmented lagrangian multiplier.


This method also have some problems.

1. About the objective

The term PiI||P_i-I||_* make the singular values of PiP_i tend to 1. However, we have no idea that the "right" solution for PP have all 1 singular values. We can only prevent PP from throwing away all information inXX.

2. About the convexity

Due to the equality constraint PX=PXZ+EPX=PXZ + E, the problem is nonconvex. Though we can get a local minimum by using the augmented lagrangian multiplier and achieve a "fairly good" solution by running multiple trials with different initial points, it's not as elegant as a convex problem.

3. About the complexity

In each iteration, assuming there are KK sites and each data has dimension nn, we'll need to solve KK n×nn \times n SVDs when calculating (2)(2), each of which has complexity O(n3)O(n^3). When the data is high-dimension, the problem becomes intractable quickly.

Part 3: Results of my own idea.

Following is the result on Yale face dataset when I set α=0.5\alpha=0.5 and β=0.05\beta=0.05.

57

294

226

These images have lower quality after low-rank representaion, but remember that the purpose of low-rank representation if to map data into a common space. In these results, the shade at the eye is eliminated in the first image and the shade near the nose is eliminated in the second image, making those two images pretty "homogeneous".

Only one experiment was done since solving the optimization problem is pretty time-consuming. I think the result can be better if proper α\alpha and β\beta is chosen.

4. Appendix : Solving one-norm-plus-frobenius-norm-square

Consider the optimization problem

minx   kx1+xv22\text{min}_x~~~ k||x||_1 + ||x-v||_2^2

This can be reformulated into an equivalent problem

minx,y   k1Ty+xv22s.t.   y<=x<=y\text{min}_{x,y}~~~ k 1^Ty + ||x-v||_2^2 \\ \text{s.t.}~~~-y <= x <= y

which is differentiable.

Write this into a standard form

min[x,y]   k(0,1)(xy)+(I,0)(xy)v22\text{min}_{[x,y]} ~~~ k \begin{pmatrix} 0, 1 \end{pmatrix} \begin{pmatrix} x \\ y \end{pmatrix} + || \begin{pmatrix} I,0 \end{pmatrix} \begin{pmatrix} x \\ y \end{pmatrix} - v ||_2^2
s.t.     (IIII)(xy)(00)\text{s.t}.~~~~~ \begin{pmatrix} -I & -I\\ I & -I \end{pmatrix} \begin{pmatrix} x \\ y \end{pmatrix} \leq \begin{pmatrix} 0 \\ 0 \end{pmatrix}

with a change of notation, we can write the problem as

minx   cTx+Axb22s.t.   Dx<=0:λ\text{min}_x~~~ c^Tx + ||Ax-b||_2^2 \\ \text{s.t.}~~~Dx<=0 : \lambda

Form the Lagrangian

L(x,λ)=cTx+xT(ATA)x2(ATb)Tx+bTb+λTDx\mathcal{L}(x, \lambda) = c^Tx + x^T(A^TA)x - 2(A^Tb)^Tx + b^Tb + \lambda^TDx

the modified KKT condition for the log barrier is

{Lx(x,λ)=c+2ATAx2ATb+DTλ=0diag(λ)Dx=1t1\begin{cases} \nabla \mathcal{L}_x(x,\lambda) = c + 2A^TAx - 2A^Tb + D^T\lambda = 0\\ -diag(\lambda)Dx = \frac{1}{t}1 \end{cases}

The residual is given by

r(x,λ)=(c+2ATAx2ATb+DTλdiag(λ)Dx1t1)r(x,\lambda) = \begin{pmatrix} c + 2A^TAx - 2A^Tb + D^T\lambda\\ -diag(\lambda)Dx - \frac{1}{t}1 \end{pmatrix}

Then form the KKT system

(2ATADTdiag(λ)Ddiag(Dx))(dxdλ)=(c+2ATAx2ATb+DTλdiag(λ)Dx1t1)=(rdrc)\begin{pmatrix} 2A^TA & D^T \\ -diag(\lambda)D & -diag(Dx) \end{pmatrix} \begin{pmatrix} dx \\ d\lambda \end{pmatrix} = - \begin{pmatrix} c + 2A^TAx - 2A^Tb + D^T\lambda\\ -diag(\lambda)Dx - \frac{1}{t}1 \end{pmatrix} = \begin{pmatrix} -rd\\ -rc \end{pmatrix}

where rdrd and rcrc stand for dual residual and centrality residual, respectively. Then solve the KKT system by block elimination

(2ATADTdiag(Dx)1diag(λ)D)dx=rdDTdiag(Dx)1rc(2A^TA - D^T diag(Dx)^{-1}diag(\lambda)D)dx = -rd-D^Tdiag(Dx)^{-1}rc\\
diag(λ)Ddx+diag(Dx)dλ=rcdiag(\lambda)Ddx + diag(Dx)d\lambda = rc

For more detail about primal-dual interior point method, refer to my note.

Implementation details: Notice the structure of matrices and make sure you exploit the sparsity in solving the KKT system.